MySQL Parallel Query Processing of 60 Billion rows via InfiniDB
Posted by: jtommaney
on Feb 16, 2010
Let’s extend the analysis of MySQL parallel query processing with InfiniDB against 60 Billion rows:
The Star Schema Benchmark scale factor for 10k generates 60 billion rows fact rows, with the largest dimension tables at 300 million and 100 million. Previous results were shown at a scale factor of 1000, generating 6 billion facts and 30 million rows in the largest dimension table.
Vadim Tkachenko with Percona (scale factor 1,000):
http://www.mysqlperformanceblog.com/2010/01/07/star-schema-bechmark-infobright-infinidb-and-luciddb/ .
Tuning and Scaling SSB Queries (scale factor 1,000):
The results here were generated with the same queries and basic methodology as detailed in the scale factor 1000 InfiniDB blog entry referenced above, with the exceptions noted below. Only one server configuration was evaluated, in this case 1 User Module running MySQL daemon and InfiniDB connection management processes, and 6 Performance Modules running our distributed global cache and distributed join and block processing engine. Additional details on the load methodology, cautions, and exceptions are available at: http://www.infinidb.org/myblog-admin/infinidb-load-60-billion-ssb-rows-trended.html . As noted, one of 2560 source files was mistakenly deleted prior to import (pebkac error), so the actual row count is 59,977,404,781 - just short of 60 billion. The servers themselves had 16GB memory and 8 cores connected to fibre channel attached LUNs.
Non-Default Parameter Settings:
Handling 10x larger data did require some non-default parameter settings. To handle the larger aggregation the decimal scale was specified at the session level:
set infinidb_decimal_scale=1;
set infinidb_use_decimal_scale = 'ON';
Tuning was set up to allow for distributed joins to take place where possible. The non-default Calpont.xml server parameters were:
./configxml.sh setconfig JobList MaxOutstandingRequests 7
./configxml.sh setconfig HashJoin PmMaxMemorySmallSide 1G
./configxml.sh setconfig HashJoin UmMaxMemorySmallSide 13G
./configxml.sh setconfig HashJoin TotalUmMaxMemorySmallSide 14G
./configxml.sh setconfig RowAggregation MaxMemory 10G
Details on these parameters are available in the Performance and Tuning Guide available by registering here: http://www.infinidb.org/downloads/cat_view/71-advanced-user-documentation .
Open bug and work-around for this test:
A memory leak was identified when joining 300 million customer dimension rows to the 60 billion row fact table as part of Q4.3. The issue is actually related to the number of rows in the dimension table (300 million customers in this case); joining 60 billion to 30 million did not show this issue. The work-around was to re-start the system prior to executing Q4.3, although adding more memory would also get past the issue. https://bugs.launchpad.net/bugs/522707
Analysis:
Note that both schemas were in the same instance. The results were basically in-line with expectations around query performance, although the test methodology of running Q1.1 through Q4.3 in a single stream allowed for caching behavior to take place, benefitting the scale factor 1000 results to a greater degree. This shows up most prominently in Q4.3, where the query did zero Physical I/O against the lineorder fact table with 6 Billion vs. 100% PIO for the lineorder fact table at 60 Billion.
select calgettrace() for ssb_1k :
Desc Mode Table TableOID ReferencedOIDs PIO LIO PBE Elapsed Rows
BPS PM customer 3257 (3258) 0 14650 0 6.494 30000000
DSS PM supplier 3243 (3254) 0 2 - 0.355 2
BPS PM supplier 3243 (3244,3253,3248) 9771 25610 0 0.879 399769
HJS PM supplier 3243 - - - - 0.000 -
BPS PM part 3228 (3229,3232,3239) 4586 8803 0 0.264 79913
BPS PM dateinfo 3273 (3278,3274) 0 8 0 1.268 729
BPS PM lineorder 3208 (3214,3211,3213,3212,3221,3222) 0 5642519 2224128 13.684 2301852
HJS PM lineorder 3208 - - - - 0.000 -
TAS UM - - - - - - 13.576 -
TAS UM - - - - - - 13.576 -
select calgettrace() for ssb_10k :
Desc Mode Table TableOID ReferencedOIDs PIO LIO PBE Elapsed Rows
BPS PM customer 3135 (3136) 145974 145932 0 47.878 300000000
DSS PM supplier 3121 (3132) 12 0 - 2.228 12
BPS PM supplier 3121 (3122,3131,3126) 244161 255834 0 5.391 4001247
HJS PM supplier 3121 - - - - 0.000 -
BPS PM part 3172 (3173,3176,3183) 10529 12278 0 1.045 111730
BPS PM dateinfo 3187 (3192,3188) 8 5 0 1.164 729
BPS PM lineorder 3086 (3092,3089,3091,3090,3099,3100) 56395771 56278340 22237184 682.780 23098897
HJS PM lineorder 3086 - - - - 0.000 -
TAS UM - - - - - - 682.246 -
TAS UM - - - - - - 682.246 –
Extended Analysis:
These queries showed an average elapsed time @60 billion rows of around 1,540 seconds versus average elapsed times posted by Percona @6 billion rows of 6,000 - 8,000 seconds or more for other products. The differences in the test conditions are significant enough to rule out direct comparisons; we used our Enterprise Edition, had more hardware, loaded the data by month, and tuned the queries. I fully expect that each of the other products would show benefits under other test conditions, but will leave it to them to show extended capabilities against larger data sets or distributed systems. However, I did want to highlight what is possible with the distributed and parallel processing capabilities of InfiniDB.
Let us help you put your data to work.
Thanks – Jim Tommaney
