Profiling InfiniDB Multi-Join

Posted by: jtommaney

Tagged in: Trace

Welcome to Profiling InfiniDB Multi-Join, part of a series designed to highlight different aspects of the InfiniDB join capabilities. InfiniDB is engineered and optimized to execute join operations for millions or billions of rows using a number of key features:

  1. Hash Joins.
  2. Parallel processing (multi-threaded and optionally distributed).
  3. Join sequencing to minimize data shipping costs. (Right-deep join tree instead of left-deep).
  4. InfiniDB multi-join (stream 1 large table past up to 60 hash maps in one operation).
  5. As a result of the above: Join processing occurs within the InfiniDB engine rather than within MySQL process space, final results are returned rather than individual tables.

Join operations are accomplished in InfiniDB via hash joins, nested loop operations have not been implemented. This is an optimization highly suited for joining large data sets, but may be slower than nested loop/indexed if joining a very few rows. In addition, the hash joins are preferentially structured as 'streaming' hash join operations, where the smaller of two tables being hash joined is scanned first and a hash structure built. The larger table is then read in a single pass operation that does not require materializing results from the largest table, as is the case when the joined rows are part of a group by/aggregation operation. In addition, this eliminates the requirement to maintain indexing to support nested loop joins.

Parallel (multi-threaded) join processing for all versions of InfiniDB, with distributed join capabilities also available with Enterprise Edition. This parallelization happens automatically with InfiniDB for all queries, except cases of few row lookups from a few data blocks/pages. Parallelization is another optimization highly suited for large data sets, but may be slower for very few rows. For a full description of parallel capabilities see our Performance Tuning Guide http://infinidb.org/downloads/doc_download/31-performance-tuning-for-the-infinidb-analytics-database .

Joins are sequenced differently within InfiniDB versus standard MySQL. Rather than left-deep trees highly optimized for individual row lookups, InfiniDB transforms join operations into combinations of right-deep and bushy variations of right-deep trees. This join sequencing has a number of benefits important to InfiniDB performance including minimizing data shipping costs by deferring scans and projection of the largest table as late as possible. As a result of the above, a 2 table join and a 20 table join star schema join have a similar execution plan within InfiniDB. The general process flows is:

  1. Build N small side hash maps in parallel.
  2. Stream the largest table past N hash maps.

Of course, joins aren't free, and a 20 table join certainly takes longer than a 2 table join. InfiniDB was built to have join operations deliver a predictable performance experience as well as provide benefits from scaling, either via more threads, or more servers. Let's take a look using an SSB data set with 600 million fact rows.

Table cardinality involved for the Star Schema Benchmark (SSB) data set.

  • dateinfo 2,556
  • supplier 1,000,000
  • part 1,400,000
  • customer 3,000,000
  • lineorder 600,037,902

The basic query and variations of this query are designed to profile "what happens when you add more joins". The group by/aggregation is not complex by design so that measuring the cost of the join operation is not clouded by other processing time. These two filters repeat for all queries and are part of a 2 table join.
- The p_size filter includes 5 of 50 part sizes. The data is evenly distributed, so includes 10%.
- The lo_orderdate filter will filter 600 million lineorders down to 45, 90, or 180 million.
The data was loaded 1 month at a time based on lo_orderdate. This pattern is detected and used to eliminate I/O.


These two filters repeat for all queries and are part of a 2 table join,
- The p_size filter includes 5 of 50 part sizes. The data is evenly distributed, so includes 10%.
- The lo_orderdate filter will filter 600 million lineorders down to 45, 90, or 180 million.
The data was loaded 1 month at a time based on lo_orderdate. This pattern is detected and used to eliminate I/O.

select lo_discount, sum(lo_revenue), count(*)
from lineorder, part
where lo_orderdate between 19930101 and -- use 19930630 or 19931231 or 19941231 and lo_partkey = p_partkey and p_size in (1,3,5,7,11)-- include 5 of 50 part sizes
group by 1
order by 1;

Filters are added for the other dimension tables as well, but are designed to not significantly change the cardinality:
- The s_suppkey filter excludes 5 of 1 million suppliers.
- The c_custkey filter excludes 5 of 3 million customers.
- The date filter excludes one holiday that falls in November.

select lo_discount, sum(lo_revenue), count(*)
from lineorder, part, supplier, customer, dateinfo
where lo_orderdate between 19930101 and 19930630 and lo_partkey = p_partkey and p_size in (1,3,5,7,11) -- include 5 of 50 part sizes
and lo_suppkey = s_suppkey and s_suppkey not in (20,40,80,160,200) -- exclude 5 of 1 million suppliers
and lo_custkey = c_custkey and c_custkey not in (10,30,50,70,110) -- exclude 5 of 3 million customers
and lo_orderdate = d_datekey and not (d_monthnuminyear=11 and d_holidayfl=1) -- exclude 1 of 365 days group by 1
order by 1;

To measure join behavior beyond the 4 dimension tables included in SSB, the 4 same tables are referenced via alias up to 5 times each and the queries are structured to repeat the above filters and cardinality. As these queries are set up, the 4 table join aggregates the same rows and columns as the 20 table join. This was kept static by design to better measure join processing. Full query text available at http://docs.google.com/Doc?docid=0AWI1rlqF3OcyZGhjcTVqcnFfMWZxcmdndGYy&hl=en


select lo_discount, sum(lo_revenue), count(*)
from lineorder, part p1, supplier s1, customer c1, dateinfo d1,
part p2, supplier s2, customer c2, dateinfo d2
where lo_orderdate between 19930101 and 19931231 and lo_partkey = p1.p_partkey and p1.p_size in (1,3,5,7,11) and lo_suppkey = s1.s_suppkey and s1.s_suppkey not in (20,40,80,160,200) and lo_custkey = c1.c_custkey and c1.c_custkey not in (10,30,50,70,110) and lo_orderdate = d1.d_datekey and not (d1.d_monthnuminyear=11 and d1.d_holidayfl=1) and lo_partkey = p2.p_partkey and p2.p_size in (1,3,5,7,11) and lo_suppkey = s2.s_suppkey and s2.s_suppkey not in (20,40,80,160,200) and lo_custkey = c2.c_custkey and c2.c_custkey not in (10,30,50,70,110) and lo_orderdate = d2.d_datekey and not (d2.d_monthnuminyear=11 and d2.d_holidayfl=1) group by 1
order by 1;

The first results come from a single server implementation and show multi-threaded join operations in action. The queries use the above described filters against 1, 2, 3, 4, 8, 12, 16, or 20 SSB dimension tables, and the lineorder column scan finds 45, 90, or 180 million records.

Trending the elapsed time shows a relatively predictable performance curve, so looks good there. How about if we add more processing power? The same queries and data with 4 performance modules in InfiniDB's scale out architecture (Enterprise Edition) show similar predictability as joins are added and further accelerates the queries.

 



Some insight into what is happening under the covers can come from reviewing some trace output (calgettrace) that reports column usage, block touches, elapsed time, blocks eliminated, and rows for each table/column operation involved.

--------------
select lo_discount, sum(lo_revenue), count(*)
from lineorder, part, supplier, customer, dateinfo
where lo_orderdate between 19930101 and 19930630 and lo_partkey = p_partkey and p_size in (1,3,5,7,11) and lo_suppkey = s_suppkey and s_suppkey not in (20,40,80,160,200) and lo_custkey = c_custkey and c_custkey not in (10,30,50,70,110) and lo_orderdate = d_datekey and not (d_monthnuminyear=11 and d_holidayfl=1) group by 1
order by 1
--------------

+-------------+------------------+----------+
| lo_discount | sum(lo_revenue) | count(*) |
+-------------+------------------+----------+
| 0.00 | 1256942863944.00 | 411197 |
| 1.00 | 1242275387105.00 | 410392 |
| 2.00 | 1230022787579.00 | 410506 |
| 3.00 | 1222918049830.00 | 411874 |
| 4.00 | 1207020379870.00 | 410623 |
| 5.00 | 1193737956597.00 | 410687 |
| 6.00 | 1179302700381.00 | 411001 |
| 7.00 | 1166447185676.00 | 410318 |
| 8.00 | 1154794708036.00 | 410412 |
| 9.00 | 1144011969314.00 | 410524 |
| 10.00 | 1135850057862.00 | 412094 |
+-------------+------------------+----------+
11 rows in set, 1 warning (3.15 sec)

--------------
select calgettrace()
--------------

+----------------------------------+
| calgettrace() |
+----------------------------------+
| Desc Mode Table TableOID ReferencedOIDs PIO LIO PBE Elapsed Rows BPS PM part 4477 (4485,4478) 0 1369 0 0.042 139780 BPS PM supplier 3048 (3049) 0 490 0 0.091 999995 BPS PM customer 3062 (3063) 0 1466 0 0.536 2999995 BPS PM dateinfo 3078 (3079,3092,3094,3089) 0 15 0 0.002 2549 BPS PM lineorder 3028 (3034,3032,3033,3031,3040,3041) 0 183056 264945 2.130 4519640 HJS PM lineorder 3028 - - - - 0.000 - ADS UM - - - - - - 1.984 -

For a quick overview of the results there were 4 batch primitive steps (BPS) referencing smaller tables that touched (logical I/O or LIO) between 15 and 1466 blocks of data (8k), with duration for each of the operations between .002 and .536 seconds. The output then reports a BPS and a HJS (hash join step) for lineorder, with one additional piece of information available, the PBE column. This indicates that the scan of lo_orderdate avoided touching 264945 blocks of data, with PBE indicating Partition Blocks Eliminated. The ADS is an aggregation delivery step that is effectively a part 2 of a two step aggregation operation, with part 1 occurring within the previous operation. The column heading indicates Elapsed for the seconds, but perhaps Duration would be a better as a number of operations occur in parallel. Summing elapsed is 4.785 seconds, but the query completed in 3.15 seconds, clearly indicating parallel behavior. The trace output does not currently indicate this as cleanly as possible, but the scans of the 4 smaller tables happen concurrently, and the lineorder BPS, HJS, and ADS also execute in parallel (pipeline parallelization).

  1. Build N small side hash maps in parallel, with the max Elapsed/Duration of 0.536 seconds.
  2. Stream the largest table past N hash maps, with the max Elapsed/Duration of 2.130 seconds.


So, this trace output reports ~ 2.67 seconds consumed within InfiniDB versus 3.15 seconds actual client elapsed time, reporting something like 85% of the time consumption. Any sql parsing or other misc. activities within MySQL does not show through this trace methodology. Let's now look at joining linorder with 20 tables:

select lo_discount, sum(lo_revenue), count(*)
from lineorder, part p1, supplier s1, customer c1, dateinfo d1,
part p2, supplier s2, customer c2, dateinfo d2,
part p3, supplier s3, customer c3, dateinfo d3,
part p4, supplier s4, customer c4, dateinfo d4,
part p5, supplier s5, customer c5, dateinfo d5
where lo_orderdate between 19930101 and 19930630 and lo_partkey = p1.p_partkey and p1.p_size in (1,3,5,7,11) and lo_suppkey = s1.s_suppkey and s1.s_suppkey not in (20,40,80,160,200) and lo_custkey = c1.c_custkey and c1.c_custkey not in (10,30,50,70,110) and lo_orderdate = d1.d_datekey and not (d1.d_monthnuminyear=11 and d1.d_holidayfl=1) and lo_partkey = p2.p_partkey and p2.p_size in (1,3,5,7,11) and lo_suppkey = s2.s_suppkey and s2.s_suppkey not in (20,40,80,160,200) and lo_custkey = c2.c_custkey and c2.c_custkey not in (10,30,50,70,110) and lo_orderdate = d2.d_datekey and not (d2.d_monthnuminyear=11 and d2.d_holidayfl=1) and lo_partkey = p3.p_partkey and p3.p_size in (1,3,5,7,11) and lo_suppkey = s3.s_suppkey and s3.s_suppkey not in (20,40,80,160,200) and lo_custkey = c3.c_custkey and c3.c_custkey not in (10,30,50,70,110) and lo_orderdate = d3.d_datekey and not (d3.d_monthnuminyear=11 and d3.d_holidayfl=1) and lo_partkey = p4.p_partkey and p4.p_size in (1,3,5,7,11) and lo_suppkey = s4.s_suppkey and s4.s_suppkey not in (20,40,80,160,200) and lo_custkey = c4.c_custkey and c4.c_custkey not in (10,30,50,70,110) and lo_orderdate = d4.d_datekey and not (d4.d_monthnuminyear=11 and d4.d_holidayfl=1) and lo_partkey = p5.p_partkey and p5.p_size in (1,3,5,7,11) and lo_suppkey = s5.s_suppkey and s5.s_suppkey not in (20,40,80,160,200) and lo_custkey = c5.c_custkey and c5.c_custkey not in (10,30,50,70,110) and lo_orderdate = d5.d_datekey and not (d5.d_monthnuminyear=11 and d5.d_holidayfl=1) group by 1
order by 1
--------------

+-------------+------------------+----------+
| lo_discount | sum(lo_revenue) | count(*) |
+-------------+------------------+----------+
| 0.00 | 1256942863944.00 | 411197 |
| 1.00 | 1242275387105.00 | 410392 |
| 2.00 | 1230022787579.00 | 410506 |
| 3.00 | 1222918049830.00 | 411874 |
| 4.00 | 1207020379870.00 | 410623 |
| 5.00 | 1193737956597.00 | 410687 |
| 6.00 | 1179302700381.00 | 411001 |
| 7.00 | 1166447185676.00 | 410318 |
| 8.00 | 1154794708036.00 | 410412 |
| 9.00 | 1144011969314.00 | 410524 |
| 10.00 | 1135850057862.00 | 412094 |
+-------------+------------------+----------+
11 rows in set, 1 warning (13.35 sec)

--------------
select calgettrace()
--------------

+----------------------------------------------------------------------------------------------------------+
| calgettrace() |
+----------------------------------------------------------------------------------------------------------+
| Desc Mode Table TableOID ReferencedOIDs PIO LIO PBE Elapsed Rows
1) build N small side hash maps in parallel
BPS PM p5 4477 (4485,4478) 0 1369 0 0.073 139780 BPS PM p4 4477 (4485,4478) 0 1369 0 0.157 139780 BPS PM p3 4477 (4485,4478) 0 1369 0 0.114 139780 BPS PM p2 4477 (4485,4478) 0 1369 0 0.143 139780 BPS PM p1 4477 (4485,4478) 0 1369 0 0.107 139780 BPS PM s5 3048 (3049) 0 490 0 0.152 999995 BPS PM s4 3048 (3049) 0 490 0 0.152 999995 BPS PM s3 3048 (3049) 0 490 0 0.176 999995 BPS PM s2 3048 (3049) 0 490 0 0.157 999995 BPS PM s1 3048 (3049) 0 490 0 0.181 999995 BPS PM c5 3062 (3063) 0 1466 0 0.721 2999995 BPS PM c4 3062 (3063) 0 1466 0 0.915 2999995 BPS PM c3 3062 (3063) 0 1466 0 1.323 2999995 BPS PM c2 3062 (3063) 0 1466 0 1.528 2999995 BPS PM c1 3062 (3063) 0 1466 0 1.108 2999995 BPS PM d5 3078 (3079,3094,3089) 0 14 0 0.007 2549 BPS PM d4 3078 (3079,3094,3089) 0 14 0 0.003 2549 BPS PM d3 3078 (3079,3094,3089) 0 14 0 0.004 2549 BPS PM d2 3078 (3079,3094,3089) 0 14 0 0.003 2549 BPS PM d1 3078 (3079,3092,3094,3089) 0 15 0 0.003 2549
2) stream the largest table past N hash maps
BPS PM lineorder 3028 (3034,3032,3033,3031,3040,3041) 0 183056 264945 9.661 4519640 HJS PM lineorder 3028 - - - - 0.000 - ADS UM - - - - - - 9.416 -

The 20 table join follows a similar job processing model

  1. Build N small side hash maps in parallel shows the longest duration of 1.528. That task happens concurrently with the other tables, but does not necessarily start immediately. Rather the total load that can be issued by a query is gated by some parameters.
  2. Stream the largest table past N hash maps, with the max Elapsed/Duration of 9.416 seconds.


So this trace output reports about 82% of the wall-time for this query. This is short of the ultimate accounting for time as taught by Cary Millsap of Method R and formerly Hotsos, but we believe it is incremental towards that goal.

So come on in to InfiniDB.org, grab a download and put your data to work.


Comments (0)Add Comment

Write comment
You must be logged in to post a comment. Please register if you do not have an account yet.

busy