MySQL Parallel Query Processing of 60 Billion rows via InfiniDB

Posted by: jtommaney

Tagged in: 10k

Let’s extend the analysis of MySQL parallel query processing with InfiniDB against 60 Billion rows:


The Star Schema Benchmark scale factor for 10k generates 60 billion rows fact rows, with the largest dimension tables at 300 million and 100 million.  Previous results were shown at a scale factor of 1000, generating 6 billion facts and 30 million rows in the largest dimension table. 

 

Vadim Tkachenko with Percona (scale factor 1,000):

http://www.mysqlperformanceblog.com/2010/01/07/star-schema-bechmark-infobright-infinidb-and-luciddb/ . 

 

Tuning and Scaling SSB Queries (scale factor 1,000):

http://www.infinidb.org/myblog-admin/mysql-parallel-query-processing-of-ssb-queries-via-infinidb-.html

 

The results here were generated with the same queries and basic methodology as detailed in the scale factor 1000 InfiniDB blog entry referenced above, with the exceptions noted below. Only one server configuration was evaluated, in this case 1 User Module running MySQL daemon and InfiniDB connection management processes, and 6 Performance Modules running our distributed global cache and distributed join and block processing engine.  Additional details on the load methodology, cautions, and exceptions are available at:  http://www.infinidb.org/myblog-admin/infinidb-load-60-billion-ssb-rows-trended.html .  As noted, one of 2560 source files was mistakenly deleted prior to import (pebkac error), so the actual row count is 59,977,404,781 - just short of 60 billion.  The servers themselves had 16GB memory and 8 cores connected to fibre channel attached LUNs.

 

Non-Default Parameter Settings:

 

Handling 10x larger data did require some non-default parameter settings.  To handle the larger aggregation the decimal scale was specified at the session level: 

    set infinidb_decimal_scale=1;

    set infinidb_use_decimal_scale = 'ON';

 

Tuning was set up to allow for distributed joins to take place where possible. The non-default  Calpont.xml server parameters were:

 

./configxml.sh setconfig JobList MaxOutstandingRequests 7

./configxml.sh setconfig HashJoin PmMaxMemorySmallSide 1G

./configxml.sh setconfig HashJoin UmMaxMemorySmallSide 13G

./configxml.sh setconfig HashJoin TotalUmMaxMemorySmallSide 14G

./configxml.sh setconfig RowAggregation MaxMemory 10G

 

Details on these parameters are available in the Performance and Tuning Guide available by registering here:  http://www.infinidb.org/downloads/cat_view/71-advanced-user-documentation .

 

Open bug and work-around for this test:

 

A memory leak was identified when joining 300 million customer dimension rows to the 60 billion row fact table as part of Q4.3.  The issue is actually related to the number of rows in the dimension table (300 million customers in this case); joining 60 billion to 30 million did not show this issue.  The work-around was to re-start the system prior to executing Q4.3, although adding more memory would also get past the issue.  https://bugs.launchpad.net/bugs/522707

 

Analysis:

 

Note that both schemas were in the same instance.  The results were basically in-line with expectations around query performance, although the test methodology of running Q1.1 through Q4.3 in a single stream allowed for caching behavior to take place, benefitting the scale factor 1000 results to a greater degree. This shows up most prominently in Q4.3, where the query did zero Physical I/O against the lineorder fact table with 6 Billion vs. 100% PIO for the lineorder fact table at 60 Billion.

select calgettrace()  for ssb_1k :
Desc Mode Table     TableOID ReferencedOIDs                  PIO  LIO     PBE     Elapsed Rows
BPS  PM   customer  3257     (3258)                          0    14650   0       6.494   30000000
DSS  PM   supplier  3243     (3254)                          0    2       -       0.355   2
BPS  PM   supplier  3243     (3244,3253,3248)                9771 25610   0       0.879   399769
HJS  PM   supplier  3243     -                               -    -       -       0.000   -
BPS  PM   part      3228     (3229,3232,3239)                4586 8803    0       0.264   79913
BPS  PM   dateinfo  3273     (3278,3274)                     0    8       0       1.268   729
BPS  PM   lineorder 3208     (3214,3211,3213,3212,3221,3222) 0    5642519 2224128 13.684  2301852
HJS  PM   lineorder 3208     -                               -    -       -       0.000   -
TAS  UM   -         -        -                               -    -       -       13.576  -
TAS  UM   -         -        -                               -    -       -       13.576  -
 
select calgettrace()  for ssb_10k :
Desc Mode Table     TableOID ReferencedOIDs                  PIO      LIO      PBE      Elapsed Rows
BPS  PM   customer  3135     (3136)                          145974   145932   0        47.878  300000000
DSS  PM   supplier  3121     (3132)                          12       0        -        2.228   12
BPS  PM   supplier  3121     (3122,3131,3126)                244161   255834   0        5.391   4001247
HJS  PM   supplier  3121     -                               -        -        -        0.000   -
BPS  PM   part      3172     (3173,3176,3183)                10529    12278    0        1.045   111730
BPS  PM   dateinfo  3187     (3192,3188)                     8        5        0        1.164   729
BPS  PM   lineorder 3086     (3092,3089,3091,3090,3099,3100) 56395771 56278340 22237184 682.780 23098897
HJS  PM   lineorder 3086     -                               -        -        -        0.000   -
TAS  UM   -         -        -                               -        -        -        682.246 -
TAS  UM   -         -        -                               -        -        -        682.246 –

Extended Analysis:

These queries showed an average elapsed time @60 billion rows of around 1,540 seconds versus average elapsed times posted by Percona @6 billion rows of 6,000 - 8,000 seconds or more for other products.  The differences in the test conditions are significant enough to rule out direct comparisons; we used our Enterprise Edition, had more hardware, loaded the data by month, and tuned the queries.  I fully expect that each of the other products would show benefits under other test conditions, but will leave it to them to show extended capabilities against larger data sets or distributed systems. However, I did want to highlight what is possible with the distributed and parallel processing capabilities of InfiniDB. 

 

Let us help you put your data to work. 

Thanks – Jim Tommaney

Comments (0)Add Comment

Write comment
You must be logged in to post a comment. Please register if you do not have an account yet.

busy