.... ...........

Contributed by Calpont, InfiniDB Community Edition is an open source, scale-up analytics database engine for your data warehousing, business intelligence and read-intensive application needs. Enabled via MySQLTM and purpose-built for an analytical workload with column-oriented technology at its core, the multi-threaded capabilities of InfiniDB Community Edition fully encompass query, transactional support and bulk load operations.  So come on in, grab a download and get started.

Performant InfiniDB

A source for: - performance tuning recommendations. - profiling different distributed and non-distributed operations: i.e. what works well, and under what conditions. - providing comparative measures

InfiniDB TnT (Tips ‘n Tricks) 1 – data type declarations

Posted by: jtommaney

Tagged in: Untagged 

Columnar DBMS tuning may require some new approaches versus traditional row-based tuning; gone are full/partitioned table scans, index lookups, and nested-loop join operations, instead replaced by full/partitioned column scans, access by RowID, and hash join operations.  This series of articles is intended to measure potential benefits of different approaches to jump-start your expertise with this new paradigm.  Today’s topic is data type declarations and the goal is to help understand the potential benefits of declaring the smallest data type that supports the business data needs. 

 

The test data for this query is a Star Schema Benchmark (SSB) data set at a scale factor of 100, yielding 600 million fact rows.  For this example the table was altered to add additional columns.  For this example the two new columns small_qty and small_disc contain the same data values and distribution as the lo_quantity and lo_discount fields, just in a 1 byte instead of an 8 byte representation.  Actual values vary from 0 to 50 quantities and 0 to 10 for discounts.

 

 

mysql> desc lineorder;

+------------------+---------------+------+-----+---------+-------+

| Field            | Type          | Null | Key | Default | Extra |

+------------------+---------------+------+-----+---------+-------+

| lo_orderkey      | bigint(20)    | YES  |     | NULL    |       |

| lo_linenumber    | int(11)       | YES  |     | NULL    |       |

| lo_custkey       | int(11)       | YES  |     | NULL    |       |

| lo_partkey       | int(11)       | YES  |     | NULL    |       |

| lo_suppkey       | int(11)       | YES  |     | NULL    |       |

| lo_orderdate     | int(11)       | YES  |     | NULL    |       |

| lo_orderpriority | char(15)      | YES  |     | NULL    |       |

| lo_shippriority  | char(1)       | YES  |     | NULL    |       |

| lo_quantity      | decimal(12,2) | YES  |     | NULL    |       |

| lo_extendedprice | decimal(12,2) | YES  |     | NULL    |       |

| lo_ordtotalprice | decimal(12,2) | YES  |     | NULL    |       |

| lo_discount      | decimal(12,2) | YES  |     | NULL    |       |

| lo_revenue       | decimal(12,2) | YES  |     | NULL    |       |

| lo_supplycost    | decimal(12,2) | YES  |     | NULL    |       |

| lo_tax           | decimal(12,2) | YES  |     | NULL    |       |

| lo_commitdate    | int(11)       | YES  |     | NULL    |       |

| lo_shipmode      | char(10)      | YES  |     | NULL    |       |

-- new stuff below --

| char2            | char(2)       | YES  |     | NULL    |       |

| small_qty        | tinyint(4)    | YES  |     | NULL    |       |

| small_disc       | tinyint(4)    | YES  |     | NULL    |       |

+------------------+---------------+------+-----+---------+-------+

20 rows in set (0.01 sec)

 

 

 

Trivial use case to provide relative cost:

 

select calflushcache();

select count(lo_quantity) from lineorder where lo_quantity = 1;

select count(lo_quantity) from lineorder where lo_quantity = 1;

 

select calflushcache();

select count(small_qty) from lineorder where small_qty = 1;

select count(small_qty) from lineorder where small_qty = 1;

 

select calflushcache();

select count(lo_discount) from lineorder where lo_discount = 1;

select count(lo_discount) from lineorder where lo_discount = 1;

 

select calflushcache();

select count(small_disc) from lineorder where small_disc = 1;

select count(small_disc) from lineorder where small_disc = 1;

 

 

trivial query:
count(column) where column = 1

Run 1 Elapsed Seconds

Run 2 Elapsed Seconds

lo_quantity  (8 byte)

15.34

4.11

small_qty (1 byte)

4.24

3.34

lo_discount (8 byte)

15.18

4.38

small_disc (1 byte)

4.45

3.61

 

Run1 Average

Run2 Average

Base

15.26

4.25

Tighter Data type

4.35

3.48

 

 

Quick Notes for Trivial Use Case:

 

1) A significant up-tick in performance is possible when reducing Physical I/O (PIO).

2) An 8-byte from cache and a 1-byte from disk run in close to the same time.

3) For the 1-byte column, disk I/O is no longer the over-riding factor for the query.

 

 

Simplified SSB for series Q1:

 

-- Q1.1 baseline

select count(*)

from dateinfo, lineorder                                       

where lo_orderdate = d_datekey

and d_year = 1993

and lo_orderdate between 19930101 and 19931231                   -- added matching predicate

and lo_discount between 1  and 3

and lo_quantity < 25;

 

The test scenario; flush the cache, then run the baseline Q1.1 query twice.  Then modify the query to use the new 1-byte columns for discount and quantity.  Flushed the cache and ran the new query twice.  Execute the same steps for Q1.2 and Q1.3.

 

 

simplified SSB queries:
select count(*) from …

Run 1 Elapsed Seconds

Run 2 Elapsed Seconds

Q1.1 baseline

6.7

2.07

Q1.1 tighter datatype

3.06

1.85

Q1.2 baseline

1.29

0.32

Q1.2 tighter datatype

0.48

0.28

Q1.3 baseline

6.41

1.36

Q1.3 tighter datatype

2.68

1.12

 

Run1 Average

Run2 Average

Base

4.80

1.25

Tighter Datatype

2.07

1.08

 

 

Quick Notes for SSB series Q1:

 

For more complex queries with additional CPU costs for join operations the I/O cost becomes somewhat less pronounced, but shows the same general pattern.  However, all of the analysis here presumes that all blocks needed for the query are cached for subsequent runs, i.e. the database will run as an in-memory database.  Let’s extend the analysis to look at one more use case where the data blocks for the query can’t be cached.

 

 

 

 

Q1.3 without date filter:

 

select calflushcache();

-- Q1.3 600 million tighter datatype

select count(*)

from dateinfo, lineorder                                       

where lo_orderdate = d_datekey

and small_disc between 5 and 7           

and small_qty between 26 and 35;

select calgetstats();

 

 

Remove date filter from Q1.3

Run 1 Elapsed Seconds

Run 2 Elapsed Seconds

Q1.3 600 million

38.01

38.05

Q1.3 600 million tighter datatype

14.92

6.59

 

 

Quick Notes for SSB series Q1.3 without date filter:

 

We now see the use case where the second run with the larger data type no longer fits in memory.  The second query runs in about the same time for the baseline, while the tighter data type runs in about 40% of the baseline for the first run, and about 17% of baseline for additional runs. 

 

 

Overall Conclusions:

 

Smaller Data Types reduce overall I/O requirements for the query which:

 

1)      Can reduce 50% or more of PIO under conditions where the data blocks have not been previously cached.

2)      Can reduce 100% of PIO for subsequent queries when the tighter data type declarations allow more data to be cached in the same number of blocks.

 

Note also that the benefits may be harder to measure when the database easily fits in memory without any optimization. 

 

 

Let us help you put your data to work.

 

 

Thanks – Jim Tommaney

Chief Product Architect

 

 

 


Let’s extend the analysis of MySQL parallel query processing with InfiniDB against 60 Billion rows:


The Star Schema Benchmark scale factor for 10k generates 60 billion rows fact rows, with the largest dimension tables at 300 million and 100 million.  Previous results were shown at a scale factor of 1000, generating 6 billion facts and 30 million rows in the largest dimension table. 

 

Vadim Tkachenko with Percona (scale factor 1,000):

http://www.mysqlperformanceblog.com/2010/01/07/star-schema-bechmark-infobright-infinidb-and-luciddb/ . 

 

Tuning and Scaling SSB Queries (scale factor 1,000):

http://www.infinidb.org/myblog-admin/mysql-parallel-query-processing-of-ssb-queries-via-infinidb-.html

 

The results here were generated with the same queries and basic methodology as detailed in the scale factor 1000 InfiniDB blog entry referenced above, with the exceptions noted below. Only one server configuration was evaluated, in this case 1 User Module running MySQL daemon and InfiniDB connection management processes, and 6 Performance Modules running our distributed global cache and distributed join and block processing engine.  Additional details on the load methodology, cautions, and exceptions are available at:  http://www.infinidb.org/myblog-admin/infinidb-load-60-billion-ssb-rows-trended.html .  As noted, one of 2560 source files was mistakenly deleted prior to import (pebkac error), so the actual row count is 59,977,404,781 - just short of 60 billion.  The servers themselves had 16GB memory and 8 cores connected to fibre channel attached LUNs.

 

Non-Default Parameter Settings:

 

Handling 10x larger data did require some non-default parameter settings.  To handle the larger aggregation the decimal scale was specified at the session level: 

    set infinidb_decimal_scale=1;

    set infinidb_use_decimal_scale = 'ON';

 

Tuning was set up to allow for distributed joins to take place where possible. The non-default  Calpont.xml server parameters were:

 

./configxml.sh setconfig JobList MaxOutstandingRequests 7

./configxml.sh setconfig HashJoin PmMaxMemorySmallSide 1G

./configxml.sh setconfig HashJoin UmMaxMemorySmallSide 13G

./configxml.sh setconfig HashJoin TotalUmMaxMemorySmallSide 14G

./configxml.sh setconfig RowAggregation MaxMemory 10G

 

Details on these parameters are available in the Performance and Tuning Guide available by registering here:  http://www.infinidb.org/downloads/cat_view/71-advanced-user-documentation .

 

Open bug and work-around for this test:

 

A memory leak was identified when joining 300 million customer dimension rows to the 60 billion row fact table as part of Q4.3.  The issue is actually related to the number of rows in the dimension table (300 million customers in this case); joining 60 billion to 30 million did not show this issue.  The work-around was to re-start the system prior to executing Q4.3, although adding more memory would also get past the issue.  https://bugs.launchpad.net/bugs/522707

 

Analysis:

 

Note that both schemas were in the same instance.  The results were basically in-line with expectations around query performance, although the test methodology of running Q1.1 through Q4.3 in a single stream allowed for caching behavior to take place, benefitting the scale factor 1000 results to a greater degree. This shows up most prominently in Q4.3, where the query did zero Physical I/O against the lineorder fact table with 6 Billion vs. 100% PIO for the lineorder fact table at 60 Billion.

select calgettrace()  for ssb_1k :
Desc Mode Table     TableOID ReferencedOIDs                  PIO  LIO     PBE     Elapsed Rows
BPS  PM   customer  3257     (3258)                          0    14650   0       6.494   30000000
DSS  PM   supplier  3243     (3254)                          0    2       -       0.355   2
BPS  PM   supplier  3243     (3244,3253,3248)                9771 25610   0       0.879   399769
HJS  PM   supplier  3243     -                               -    -       -       0.000   -
BPS  PM   part      3228     (3229,3232,3239)                4586 8803    0       0.264   79913
BPS  PM   dateinfo  3273     (3278,3274)                     0    8       0       1.268   729
BPS  PM   lineorder 3208     (3214,3211,3213,3212,3221,3222) 0    5642519 2224128 13.684  2301852
HJS  PM   lineorder 3208     -                               -    -       -       0.000   -
TAS  UM   -         -        -                               -    -       -       13.576  -
TAS  UM   -         -        -                               -    -       -       13.576  -
 
select calgettrace()  for ssb_10k :
Desc Mode Table     TableOID ReferencedOIDs                  PIO      LIO      PBE      Elapsed Rows
BPS  PM   customer  3135     (3136)                          145974   145932   0        47.878  300000000
DSS  PM   supplier  3121     (3132)                          12       0        -        2.228   12
BPS  PM   supplier  3121     (3122,3131,3126)                244161   255834   0        5.391   4001247
HJS  PM   supplier  3121     -                               -        -        -        0.000   -
BPS  PM   part      3172     (3173,3176,3183)                10529    12278    0        1.045   111730
BPS  PM   dateinfo  3187     (3192,3188)                     8        5        0        1.164   729
BPS  PM   lineorder 3086     (3092,3089,3091,3090,3099,3100) 56395771 56278340 22237184 682.780 23098897
HJS  PM   lineorder 3086     -                               -        -        -        0.000   -
TAS  UM   -         -        -                               -        -        -        682.246 -
TAS  UM   -         -        -                               -        -        -        682.246 –

Extended Analysis:

These queries showed an average elapsed time @60 billion rows of around 1,540 seconds versus average elapsed times posted by Percona @6 billion rows of 6,000 - 8,000 seconds or more for other products.  The differences in the test conditions are significant enough to rule out direct comparisons; we used our Enterprise Edition, had more hardware, loaded the data by month, and tuned the queries.  I fully expect that each of the other products would show benefits under other test conditions, but will leave it to them to show extended capabilities against larger data sets or distributed systems. However, I did want to highlight what is possible with the distributed and parallel processing capabilities of InfiniDB. 

 

Let us help you put your data to work. 

Thanks – Jim Tommaney


I wanted to offer another InfiniDB load rate metric using the SSB lineorder fact table.  In this case we are using a scale factor of 10,000 which translates to 60 Billion rows.  As a point of reference, the recent Percona benchmark was at a scale factor of 1000 (6 billion rows) http://www.mysqlperformanceblog.com/2010/01/07/star-schema-bechmark-infobright-infinidb-and-luciddb/ .  The load rate per hour varied only slightly across the entire run, averaging about 478 million rows per hour.  As always, your actual load rate will vary based on your disk, table, and column definitions, but you should expect consistent load times across very wide cardinality ranges. 

The table is the Lineorder table as defined here:

Some disclaimers about the SSB at 10,000.  Use of the benchmark at this scale  appears to be bleeding edge:
NOTE: Data generation for scale factors >  1000 GB is still in development,
        and is not yet supported.
Your resulting data set MAY NOT BE COMPLIANT!
In addition, I munged one of 2560 files prior to import and so I am missing about 24 million records vs. the expected.   


mysql> desc ssb_10k.lineorder;
+------------------+---------------+------+-----+---------+-------+
| Field            | Type          | Null | Key | Default | Extra |
+------------------+---------------+------+-----+---------+-------+
| lo_orderkey      | bigint(20)    | YES  |     | NULL    |       |
| lo_linenumber    | int(11)       | YES  |     | NULL    |       |
| lo_custkey       | int(11)       | YES  |     | NULL    |       |
| lo_partkey       | int(11)       | YES  |     | NULL    |       |
| lo_suppkey       | int(11)       | YES  |     | NULL    |       |
| lo_orderdate     | int(11)       | YES  |     | NULL    |       |
| lo_orderpriority | char(15)      | YES  |     | NULL    |       |
| lo_shippriority  | char(1)       | YES  |     | NULL    |       |
| lo_quantity      | decimal(12,2) | YES  |     | NULL    |       |
| lo_extendedprice | decimal(12,2) | YES  |     | NULL    |       |
| lo_ordtotalprice | decimal(12,2) | YES  |     | NULL    |       |
| lo_discount      | decimal(12,2) | YES  |     | NULL    |       |
| lo_revenue       | decimal(12,2) | YES  |     | NULL    |       |
| lo_supplycost    | decimal(12,2) | YES  |     | NULL    |       |
| lo_tax           | decimal(12,2) | YES  |     | NULL    |       |
| lo_commitdate    | int(11)       | YES  |     | NULL    |       |
| lo_shipmode      | char(10)      | YES  |     | NULL    |       |
+------------------+---------------+------+-----+---------+-------+

Example syntax to import:
     /usr/local/Calpont/bin/colxml ssb_10k -t lineorder -j 10000
     /usr/local/Calpont/bin/cpimport -j 10000

Actual row count: 59,977,404,781

Example data: 

|1|1|73799965|1551894|41364203|19960102|5-LOW|0|17.00|3307894.00|18660018.00|
4.00|3175578.00|116749.00|2.00|19960212|TRUCK|

A stable and predictable load rate can be very important when dealing with larger and larger data sets. Let us help you put your data to work!

Thanks - Jim Tommaney,  Chief Product Architect


InfiniDB load rate trended through 23.9 billion rows

Posted by: jtommaney

Tagged in: Untagged 

Just a quick graphical presentation of the load rate with InfiniDB importing 23.9 billion rows in about 12 1/2 hours.  Rows per second is just north of .5 million rows per second. 
Your milage will obviously vary based on row width, disk configuration, etc., but the expectation would be that the load rate is stable over a very wide range of cardinalities. 





This table was created and optimized as the fact table for a star schema data model originally implemented on a popular row based dbms. As a result, this table may be narrower than some others. The import used the default cpimport (bulk load) parameters, so 4 total threads are allocated for the bulk load process.  The total seconds were 45,247 and 23,963,603,232 rows were loaded in 6305 individual jobs. 

  create table pos_day
(
  period_id        int,
  location_id      int,
  product_id       int,
  upc_product_id   int,
  adj_first_order_date_id  int,
  recmd_item_flag  char(1),
  new_item_flag    char(1),
  promo_activity_code   char(1),
  item_source_code      char(1),
  sales_unit       decimal(7,2),
  sales_retail     decimal(7,2),
  gross_profit     decimal(7,2),
  promo_unit       decimal(7,2)   ) engine=infinidb;

Example syntax to import:
     /usr/local/Calpont/bin/colxml point_of_sale -t pos_day -j 1129
     /usr/local/Calpont/bin/cpimport -j 1129

Example load data:  6306|5139|124910|98548|1009|Y|N||I|1|0.99|0.41||

Let us help you put your data to work, and also what additional features are of interest. 
Thanks - Jim Tommaney




MySQL Parallel Query Processing of SSB Queries via InfiniDB

Posted by: jtommaney

Tagged in: Untagged 

Let's take a look at MySQL parallel query processing with the InfiniDB storage engine with ~ 6 billion row fact table with the Star Schema Benchmark (SSB).

The star schema benchmark description and a data generator (dbgen) can be found here: http://www.cs.umb.edu/~xuedchen/research/publications/ .  The SSB benchmark modifies TPC-H/DBT-3 following data warehousing rules as described in some of the publications.

Note:  The load files for the fact table as emitted by dbgen randomizes the dates across the 6 billion rows. For all examples here, the data was split into months, but not sorted, prior to cpimport (bulk load) into InfiniDB. This corresponds more closely to typical data load patterns, and, in my opinion, corresponds to a more realistic data set. This allows for significant I/O elimination to take place, dramatically improving performance.

In addition, for 6 of the 13 queries, an additional predicate was added to duplicate a filter originally applied to a dimension table onto the fact table. For Q1.1, the d_year filter was effectively added to the fact table with the following lo_orderdate predicate. A future enhancement will effectively deliver the same I/O elimination automatically.  

-- Q1.1
select sum(lo_extendedprice*lo_discount) as revenue
from dateinfo, lineorder where lo_orderdate = d_datekey
and d_year = 1993
and lo_orderdate between 19930101 and 19931231 -- added matching predicate
and lo_discount between 1 and 3
and lo_quantity < 25;

To give a comparative benefit from the I/O elimination (think partition elimination), the query runs in:

About 22 seconds from disk with the added predicate,
about 10 seconds from cache with the added predicate,
versus about 120 seconds without the predicate.

The timings were captured by running the SSB queries from Q1.1 to Q4.3 in sequence without flushing the cache or restarting system. These times are from the second execution of the script to replicate a warm cache scenario.  The system configuration was then modified to add or remove performance modules (servers). The test script starts with 4 PMs (pm1, pm2, pm3, pm4) and removes from service to test different configurations. Re-enabling modules works the same way.
idbmysql -vvv ssb_1000 < ssb_w_trace.sql > run_ssb_trace_1000sf_4PM_run1.log
idbmysql -vvv ssb_1000 < ssb_w_trace.sql > run_ssb_trace_1000sf_4PM_run2.log

cc altersystem-disablemodule pm3,pm4 y
idbmysql -vvv ssb_1000 < ssb_w_trace.sql > run_ssb_trace_1000sf_2PM_run1.log
idbmysql -vvv ssb_1000 < ssb_w_trace.sql > run_ssb_trace_1000sf_2PM_run2.log

cc altersystem-disablemodule pm2 y
idbmysql -vvv ssb_1000 < ssb_w_trace.sql > run_ssb_trace_1000sf_1PM_run1.log
idbmysql -vvv ssb_1000 < ssb_w_trace.sql > run_ssb_trace_1000sf_1PM_run2.log


The test includes a single performance module implementation (similar to open source single server) as well as the scale-out feature offered as part of the InfiniDB Enterprise Edition. The overall parallel query capabilities are similar for the single server open source edition as well as the enterprise. The cc command shown above is a call to the Calpont Console capabilities to administer the system.  In addition, these tests were run with the default tuning parameter settings. 
From the standpoint of the parallel query capabilities, adding additional cores either by scale-out (enterprise) or scale-up (available with single server) offer approximately the same benefits for many query categories.

The queries/script as run can be found at:
http://docs.google.com/Doc?docid=0AWI1rlqF3OcyZGhjcTVqcnFfNGN3N3RodmMz&hl=en


The timings for the queries against the 6 billion row fact table and up to 30 million rows in the dimension tables for 1,2 and 4 Performance Modules (PMs) are (all times in seconds):

  ----- PM Count -----
  1PM 2PMs 4PMs
Q1.1 81 41.03 22.11
Q1.2 7.91 4.15 2.46
Q1.3 50.73 25.43 16.06
Q2.1 482.25 247.68 129.15
Q2.2 385.21 200.88 103.3
Q2.3 354.32 180.51 96.78
Q3.1 830.53 432.64 228.84
Q3.2 548.38 286.23 150.81
Q3.3 473.07 216.29 113.29
Q3.4 24.48 15.87 8.49
Q4.1 999.43 519.98 280.04
Q4.2 255.01 130.94 57.27
Q4.3 158.44 82.65 26.35


The same metrics displayed graphically show the performance improvement via adding more cores (in this case through scale-out). 


To better understand the results we can view the elapsed time as log(seconds) to compare to a linear trend line, so the graph now includes one additonal line titled Linear Trend Line. 




The overall trend lines show very good scalability across a wide variety of cases.  A look into the details shows a couple of variations from the ideal linear scalabilty trend line.  The results could be classified into 3 general groups:

Group 1 - Most queries show the same slope as Linear Trend Line (purple towards the middle of the graph), indicating something approaching linear scaling.

Group 2 - A few show some modest variations from the best case scaling, indicating that some portion of the query is not fully parallel, or potentially some variation in the timings that may go away with a larger sample size.

Group 3 - Some better than linear scaling occurs for queries 4.2 and 4.3 where the query is more than twice as fast as resources double. This is due to a larger cache allowing elimination of significant amounts of physical I/O.  So, with caching, greater than linear scalability is possible.  Obviously, this is conditional depending on the exact queries, data, and memory.  

This chart shows the reduction in elapsed time as the parallel query is distributed across more cores.  1 -- 2 shows the reduction in time as the system scales from 8 cores (on one PM) to 16 cores (on two PMs).  2 -- 4 shows the reduction in time as the system scales from 16 cores to 32 cores (on 4 PMs).

Group 2 queries are highlighted in yellow, and Group 3 queries are highlighted in green. 

  Scalability Factor
 1 -- 2  2 -- 4
Q1.1 0.507 0.539
Q1.2 0.525 0.593
Q1.3 0.501 0.632
Q2.1 0.514 0.521
Q2.2 0.521 0.514
Q2.3 0.509 0.536
Q3.1 0.521 0.529
Q3.2 0.522 0.527
Q3.3 0.457 0.524
Q3.4 0.648 0.535
Q4.1 0.520 0.539
Q4.2 0.513 0.437
Q4.3 0.522 0.319
Average 0.522 0.519

The overall average reduction in time approaches, but does not reach .50, the theoretical best case result for perfect linear scalability.  I would characterize the Star Schema Benchmark data set and queries as showing very strong parallel query for MySQL with the InfiniDB storage engine. 
Let us know if you have any questions or would like to see some additional data sets profiled.  We look forward to helping you put your data to work. 

Thanks - JT


Profiling InfiniDB Multi-Join

Posted by: jtommaney

Tagged in: Trace

Welcome to Profiling InfiniDB Multi-Join, part of a series designed to highlight different aspects of the InfiniDB join capabilities. InfiniDB is engineered and optimized to execute join operations for millions or billions of rows using a number of key features:

  1. Hash Joins.
  2. Parallel processing (multi-threaded and optionally distributed).
  3. Join sequencing to minimize data shipping costs. (Right-deep join tree instead of left-deep).
  4. InfiniDB multi-join (stream 1 large table past up to 60 hash maps in one operation).
  5. As a result of the above: Join processing occurs within the InfiniDB engine rather than within MySQL process space, final results are returned rather than individual tables.

Join operations are accomplished in InfiniDB via hash joins, nested loop operations have not been implemented. This is an optimization highly suited for joining large data sets, but may be slower than nested loop/indexed if joining a very few rows. In addition, the hash joins are preferentially structured as 'streaming' hash join operations, where the smaller of two tables being hash joined is scanned first and a hash structure built. The larger table is then read in a single pass operation that does not require materializing results from the largest table, as is the case when the joined rows are part of a group by/aggregation operation. In addition, this eliminates the requirement to maintain indexing to support nested loop joins.

Parallel (multi-threaded) join processing for all versions of InfiniDB, with distributed join capabilities also available with Enterprise Edition. This parallelization happens automatically with InfiniDB for all queries, except cases of few row lookups from a few data blocks/pages. Parallelization is another optimization highly suited for large data sets, but may be slower for very few rows. For a full description of parallel capabilities see our Performance Tuning Guide http://infinidb.org/downloads/doc_download/31-performance-tuning-for-the-infinidb-analytics-database .

Joins are sequenced differently within InfiniDB versus standard MySQL. Rather than left-deep trees highly optimized for individual row lookups, InfiniDB transforms join operations into combinations of right-deep and bushy variations of right-deep trees. This join sequencing has a number of benefits important to InfiniDB performance including minimizing data shipping costs by deferring scans and projection of the largest table as late as possible. As a result of the above, a 2 table join and a 20 table join star schema join have a similar execution plan within InfiniDB. The general process flows is:

  1. Build N small side hash maps in parallel.
  2. Stream the largest table past N hash maps.

Of course, joins aren't free, and a 20 table join certainly takes longer than a 2 table join. InfiniDB was built to have join operations deliver a predictable performance experience as well as provide benefits from scaling, either via more threads, or more servers. Let's take a look using an SSB data set with 600 million fact rows.

Table cardinality involved for the Star Schema Benchmark (SSB) data set.

  • dateinfo 2,556
  • supplier 1,000,000
  • part 1,400,000
  • customer 3,000,000
  • lineorder 600,037,902

The basic query and variations of this query are designed to profile "what happens when you add more joins". The group by/aggregation is not complex by design so that measuring the cost of the join operation is not clouded by other processing time. These two filters repeat for all queries and are part of a 2 table join.
- The p_size filter includes 5 of 50 part sizes. The data is evenly distributed, so includes 10%.
- The lo_orderdate filter will filter 600 million lineorders down to 45, 90, or 180 million.
The data was loaded 1 month at a time based on lo_orderdate. This pattern is detected and used to eliminate I/O.


These two filters repeat for all queries and are part of a 2 table join,
- The p_size filter includes 5 of 50 part sizes. The data is evenly distributed, so includes 10%.
- The lo_orderdate filter will filter 600 million lineorders down to 45, 90, or 180 million.
The data was loaded 1 month at a time based on lo_orderdate. This pattern is detected and used to eliminate I/O.

select lo_discount, sum(lo_revenue), count(*)
from lineorder, part
where lo_orderdate between 19930101 and -- use 19930630 or 19931231 or 19941231 and lo_partkey = p_partkey and p_size in (1,3,5,7,11)-- include 5 of 50 part sizes
group by 1
order by 1;

Filters are added for the other dimension tables as well, but are designed to not significantly change the cardinality:
- The s_suppkey filter excludes 5 of 1 million suppliers.
- The c_custkey filter excludes 5 of 3 million customers.
- The date filter excludes one holiday that falls in November.

select lo_discount, sum(lo_revenue), count(*)
from lineorder, part, supplier, customer, dateinfo
where lo_orderdate between 19930101 and 19930630 and lo_partkey = p_partkey and p_size in (1,3,5,7,11) -- include 5 of 50 part sizes
and lo_suppkey = s_suppkey and s_suppkey not in (20,40,80,160,200) -- exclude 5 of 1 million suppliers
and lo_custkey = c_custkey and c_custkey not in (10,30,50,70,110) -- exclude 5 of 3 million customers
and lo_orderdate = d_datekey and not (d_monthnuminyear=11 and d_holidayfl=1) -- exclude 1 of 365 days group by 1
order by 1;

To measure join behavior beyond the 4 dimension tables included in SSB, the 4 same tables are referenced via alias up to 5 times each and the queries are structured to repeat the above filters and cardinality. As these queries are set up, the 4 table join aggregates the same rows and columns as the 20 table join. This was kept static by design to better measure join processing. Full query text available at http://docs.google.com/Doc?docid=0AWI1rlqF3OcyZGhjcTVqcnFfMWZxcmdndGYy&hl=en


select lo_discount, sum(lo_revenue), count(*)
from lineorder, part p1, supplier s1, customer c1, dateinfo d1,
part p2, supplier s2, customer c2, dateinfo d2
where lo_orderdate between 19930101 and 19931231 and lo_partkey = p1.p_partkey and p1.p_size in (1,3,5,7,11) and lo_suppkey = s1.s_suppkey and s1.s_suppkey not in (20,40,80,160,200) and lo_custkey = c1.c_custkey and c1.c_custkey not in (10,30,50,70,110) and lo_orderdate = d1.d_datekey and not (d1.d_monthnuminyear=11 and d1.d_holidayfl=1) and lo_partkey = p2.p_partkey and p2.p_size in (1,3,5,7,11) and lo_suppkey = s2.s_suppkey and s2.s_suppkey not in (20,40,80,160,200) and lo_custkey = c2.c_custkey and c2.c_custkey not in (10,30,50,70,110) and lo_orderdate = d2.d_datekey and not (d2.d_monthnuminyear=11 and d2.d_holidayfl=1) group by 1
order by 1;

The first results come from a single server implementation and show multi-threaded join operations in action. The queries use the above described filters against 1, 2, 3, 4, 8, 12, 16, or 20 SSB dimension tables, and the lineorder column scan finds 45, 90, or 180 million records.

Trending the elapsed time shows a relatively predictable performance curve, so looks good there. How about if we add more processing power? The same queries and data with 4 performance modules in InfiniDB's scale out architecture (Enterprise Edition) show similar predictability as joins are added and further accelerates the queries.

 



Some insight into what is happening under the covers can come from reviewing some trace output (calgettrace) that reports column usage, block touches, elapsed time, blocks eliminated, and rows for each table/column operation involved.

--------------
select lo_discount, sum(lo_revenue), count(*)
from lineorder, part, supplier, customer, dateinfo
where lo_orderdate between 19930101 and 19930630 and lo_partkey = p_partkey and p_size in (1,3,5,7,11) and lo_suppkey = s_suppkey and s_suppkey not in (20,40,80,160,200) and lo_custkey = c_custkey and c_custkey not in (10,30,50,70,110) and lo_orderdate = d_datekey and not (d_monthnuminyear=11 and d_holidayfl=1) group by 1
order by 1
--------------

+-------------+------------------+----------+
| lo_discount | sum(lo_revenue) | count(*) |
+-------------+------------------+----------+
| 0.00 | 1256942863944.00 | 411197 |
| 1.00 | 1242275387105.00 | 410392 |
| 2.00 | 1230022787579.00 | 410506 |
| 3.00 | 1222918049830.00 | 411874 |
| 4.00 | 1207020379870.00 | 410623 |
| 5.00 | 1193737956597.00 | 410687 |
| 6.00 | 1179302700381.00 | 411001 |
| 7.00 | 1166447185676.00 | 410318 |
| 8.00 | 1154794708036.00 | 410412 |
| 9.00 | 1144011969314.00 | 410524 |
| 10.00 | 1135850057862.00 | 412094 |
+-------------+------------------+----------+
11 rows in set, 1 warning (3.15 sec)

--------------
select calgettrace()
--------------

+----------------------------------+
| calgettrace() |
+----------------------------------+
| Desc Mode Table TableOID ReferencedOIDs PIO LIO PBE Elapsed Rows BPS PM part 4477 (4485,4478) 0 1369 0 0.042 139780 BPS PM supplier 3048 (3049) 0 490 0 0.091 999995 BPS PM customer 3062 (3063) 0 1466 0 0.536 2999995 BPS PM dateinfo 3078 (3079,3092,3094,3089) 0 15 0 0.002 2549 BPS PM lineorder 3028 (3034,3032,3033,3031,3040,3041) 0 183056 264945 2.130 4519640 HJS PM lineorder 3028 - - - - 0.000 - ADS UM - - - - - - 1.984 -

For a quick overview of the results there were 4 batch primitive steps (BPS) referencing smaller tables that touched (logical I/O or LIO) between 15 and 1466 blocks of data (8k), with duration for each of the operations between .002 and .536 seconds. The output then reports a BPS and a HJS (hash join step) for lineorder, with one additional piece of information available, the PBE column. This indicates that the scan of lo_orderdate avoided touching 264945 blocks of data, with PBE indicating Partition Blocks Eliminated. The ADS is an aggregation delivery step that is effectively a part 2 of a two step aggregation operation, with part 1 occurring within the previous operation. The column heading indicates Elapsed for the seconds, but perhaps Duration would be a better as a number of operations occur in parallel. Summing elapsed is 4.785 seconds, but the query completed in 3.15 seconds, clearly indicating parallel behavior. The trace output does not currently indicate this as cleanly as possible, but the scans of the 4 smaller tables happen concurrently, and the lineorder BPS, HJS, and ADS also execute in parallel (pipeline parallelization).

  1. Build N small side hash maps in parallel, with the max Elapsed/Duration of 0.536 seconds.
  2. Stream the largest table past N hash maps, with the max Elapsed/Duration of 2.130 seconds.


So, this trace output reports ~ 2.67 seconds consumed within InfiniDB versus 3.15 seconds actual client elapsed time, reporting something like 85% of the time consumption. Any sql parsing or other misc. activities within MySQL does not show through this trace methodology. Let's now look at joining linorder with 20 tables:

select lo_discount, sum(lo_revenue), count(*)
from lineorder, part p1, supplier s1, customer c1, dateinfo d1,
part p2, supplier s2, customer c2, dateinfo d2,
part p3, supplier s3, customer c3, dateinfo d3,
part p4, supplier s4, customer c4, dateinfo d4,
part p5, supplier s5, customer c5, dateinfo d5
where lo_orderdate between 19930101 and 19930630 and lo_partkey = p1.p_partkey and p1.p_size in (1,3,5,7,11) and lo_suppkey = s1.s_suppkey and s1.s_suppkey not in (20,40,80,160,200) and lo_custkey = c1.c_custkey and c1.c_custkey not in (10,30,50,70,110) and lo_orderdate = d1.d_datekey and not (d1.d_monthnuminyear=11 and d1.d_holidayfl=1) and lo_partkey = p2.p_partkey and p2.p_size in (1,3,5,7,11) and lo_suppkey = s2.s_suppkey and s2.s_suppkey not in (20,40,80,160,200) and lo_custkey = c2.c_custkey and c2.c_custkey not in (10,30,50,70,110) and lo_orderdate = d2.d_datekey and not (d2.d_monthnuminyear=11 and d2.d_holidayfl=1) and lo_partkey = p3.p_partkey and p3.p_size in (1,3,5,7,11) and lo_suppkey = s3.s_suppkey and s3.s_suppkey not in (20,40,80,160,200) and lo_custkey = c3.c_custkey and c3.c_custkey not in (10,30,50,70,110) and lo_orderdate = d3.d_datekey and not (d3.d_monthnuminyear=11 and d3.d_holidayfl=1) and lo_partkey = p4.p_partkey and p4.p_size in (1,3,5,7,11) and lo_suppkey = s4.s_suppkey and s4.s_suppkey not in (20,40,80,160,200) and lo_custkey = c4.c_custkey and c4.c_custkey not in (10,30,50,70,110) and lo_orderdate = d4.d_datekey and not (d4.d_monthnuminyear=11 and d4.d_holidayfl=1) and lo_partkey = p5.p_partkey and p5.p_size in (1,3,5,7,11) and lo_suppkey = s5.s_suppkey and s5.s_suppkey not in (20,40,80,160,200) and lo_custkey = c5.c_custkey and c5.c_custkey not in (10,30,50,70,110) and lo_orderdate = d5.d_datekey and not (d5.d_monthnuminyear=11 and d5.d_holidayfl=1) group by 1
order by 1
--------------

+-------------+------------------+----------+
| lo_discount | sum(lo_revenue) | count(*) |
+-------------+------------------+----------+
| 0.00 | 1256942863944.00 | 411197 |
| 1.00 | 1242275387105.00 | 410392 |
| 2.00 | 1230022787579.00 | 410506 |
| 3.00 | 1222918049830.00 | 411874 |
| 4.00 | 1207020379870.00 | 410623 |
| 5.00 | 1193737956597.00 | 410687 |
| 6.00 | 1179302700381.00 | 411001 |
| 7.00 | 1166447185676.00 | 410318 |
| 8.00 | 1154794708036.00 | 410412 |
| 9.00 | 1144011969314.00 | 410524 |
| 10.00 | 1135850057862.00 | 412094 |
+-------------+------------------+----------+
11 rows in set, 1 warning (13.35 sec)

--------------
select calgettrace()
--------------

+----------------------------------------------------------------------------------------------------------+
| calgettrace() |
+----------------------------------------------------------------------------------------------------------+
| Desc Mode Table TableOID ReferencedOIDs PIO LIO PBE Elapsed Rows
1) build N small side hash maps in parallel
BPS PM p5 4477 (4485,4478) 0 1369 0 0.073 139780 BPS PM p4 4477 (4485,4478) 0 1369 0 0.157 139780 BPS PM p3 4477 (4485,4478) 0 1369 0 0.114 139780 BPS PM p2 4477 (4485,4478) 0 1369 0 0.143 139780 BPS PM p1 4477 (4485,4478) 0 1369 0 0.107 139780 BPS PM s5 3048 (3049) 0 490 0 0.152 999995 BPS PM s4 3048 (3049) 0 490 0 0.152 999995 BPS PM s3 3048 (3049) 0 490 0 0.176 999995 BPS PM s2 3048 (3049) 0 490 0 0.157 999995 BPS PM s1 3048 (3049) 0 490 0 0.181 999995 BPS PM c5 3062 (3063) 0 1466 0 0.721 2999995 BPS PM c4 3062 (3063) 0 1466 0 0.915 2999995 BPS PM c3 3062 (3063) 0 1466 0 1.323 2999995 BPS PM c2 3062 (3063) 0 1466 0 1.528 2999995 BPS PM c1 3062 (3063) 0 1466 0 1.108 2999995 BPS PM d5 3078 (3079,3094,3089) 0 14 0 0.007 2549 BPS PM d4 3078 (3079,3094,3089) 0 14 0 0.003 2549 BPS PM d3 3078 (3079,3094,3089) 0 14 0 0.004 2549 BPS PM d2 3078 (3079,3094,3089) 0 14 0 0.003 2549 BPS PM d1 3078 (3079,3092,3094,3089) 0 15 0 0.003 2549
2) stream the largest table past N hash maps
BPS PM lineorder 3028 (3034,3032,3033,3031,3040,3041) 0 183056 264945 9.661 4519640 HJS PM lineorder 3028 - - - - 0.000 - ADS UM - - - - - - 9.416 -

The 20 table join follows a similar job processing model

  1. Build N small side hash maps in parallel shows the longest duration of 1.528. That task happens concurrently with the other tables, but does not necessarily start immediately. Rather the total load that can be issued by a query is gated by some parameters.
  2. Stream the largest table past N hash maps, with the max Elapsed/Duration of 9.416 seconds.


So this trace output reports about 82% of the wall-time for this query. This is short of the ultimate accounting for time as taught by Cary Millsap of Method R and formerly Hotsos, but we believe it is incremental towards that goal.

So come on in to InfiniDB.org, grab a download and put your data to work.



Our performance tuning guide is now available at infinidb.org.  Subjects touched on  include:
  How are hash joins implemented with InfiniDB?    How is work parallelized across multiple cores, and optionally multiple servers? 
  What "rules of thumb" change with InfiniDB's column storage?
  What is the fastest way to load data with InfiniDB?  
  What is a Batch Primitive Step (BPS)?  

- Jim Tommaney
Chief Product Architect

http://infinidb.org/downloads/cat_view/71-advanced-user-documentation




Profiling InfiniDB join behavior with SSB data set.

Posted by: jtommaney

Tagged in: Untagged 


There will be a series of posts to help describe InfiniDB join behavior to help understand what is possible as join conditions change.

Join Profiling Topics:
 1) Profiling Overview (this post) - Profile a 5 table join as cardinality changes.  Show scalability.
 2) Profiling Multi-Join - Explain and measure InfiniDB Multi-Join (join n tables in 1 scalable operation). Profile scalability.  3) Profile In vs. Hash Join - Compare In (list) performance versus join performance.  Profile Scalability.
 4) TBD - any recommended topics or follow up on any questions. 
Table cardinality involved for the Star Schema Benchmark (SSB) data set is:
      2,556  dateinfo
  1,000,000  supplier
  1,400,000  part    3,000,000  customer
600,037,902  lineorder

The star schema benchmark description and a data generator (dbgen) can be found here: http://www.cs.umb.edu/~xuedchen/research/publications/

The SSB benchmark modifies TPC-H/DBT-3 following data warehousing rules as described in some of the publications. 


Note:  The output for the fact table as emitted by dbgen randomizes the dates across the 600 million rows. For all examples here, the data was split into months, but not sorted, prior to cpimport (bulk load) into Infinidb. This corresponds more closely to typical data load patterns, and, in my opinion, corresponds to a more realistic data set.

The data load and installation methodology:

1) install software with default tuning values.
2) create tables.
3) execute dbgen to create data source files.
4) split the lineorder.tbl file into 80 buckets based on lo_orderdate.
5) execute cpimport for all the tables.
6) run the queries 2 or 3 times to stabilize performance.    
Note that no indexing is required or used. Because of this, the performance measured here would be similar for other column filters that return similar cardinality.


This test uses linux hotplug syntax to take cores offline - ( echo 0 > /sys/devices/system/cpu/cpu7/online ), rather than running on 3 different servers. Therefore, this should only be considered an approximation of 2 core or 4 core behavior.

Example query for 15% of the fact table is:

-- --------------------------------------------------------------
-- 600 million total facts
-- 12 months of 80 months - 15%            (lineorder, dateinfo)
-- 12% of data from other dimension tables (part, customer, supplier)
-- --------------------------------------------------------------
   select  p_category, s_nation, c_nation, d_yearmonthnum, sum(lo_extendedprice), count(*)
     from  part, customer, supplier, dateinfo, lineorder
    where  lo_partkey = p_partkey
      and  lo_suppkey = s_suppkey
      and  lo_custkey = c_custkey
      and  lo_orderdate = d_datekey
      and  p_category in ( 'MFGR#11','MFGR#23','MFGR#41' )
        and  s_nation   in ( 'RUSSIA', 'PERU','CANADA' )
      and  c_nation   in ( 'UNITED KINGDOM', 'ARGENTINA', 'EGYPT'  )
      and  d_yearmonthnum between 199501   and 199512
      and  lo_orderdate   between 19950101 and 19951231
 group by  1, 2, 3, 4
 order by  1, 2, 3, 4 ;

Ran in the following elapsed seconds:

23.0        with 2 cores

12.43     with 4 cores

7.77       with 8 cores

4.89     with 2 Performance Modules (Enterprise Edition)

3.13     with 4 Performance Modules (Enterprise Edition)

 

Spreadsheet containing full results available at:

Spreadsheet_with_Graph


Commentary on the results:
1)  The smallest data set (30 million facts with 4 tables ranging from 250 rows to 1.5 million rows) showed decreased benefits with 2 PMs or 4 PMs (2.01 vs. 1.44 elapsed seconds).
2)  The largest data sets (largest was 180 million facts with 4 tables ranging from 766 rows to 9 million rows) generally scaled well with an average scalability factor of .582 for the largest (where .5 is perfect scalability).

Conclusion:  There are some non-distributed operations involved in setting up these join operations such that scalability is not perfectly linear. The cost of these non-scalable operations is about a second or so (for this size data set).  Scaling does not change this time interval so the second or so shows as a higher cost for small queries, but disappears in the noise for larger queries; Amdahl's Law in action.

Other operations (aggregation, scans, or queries with fewer joins) generally have smaller set-up time and typically show better scalability factor.  Stay posted for other profiles. 

 


With many thanks to Vadim at Percona for his analysis of different capabilities of different columnar dbms. Definitely good information, and well documented.  Queries and results available at:

http://www.mysqlperformanceblog.com/2009/11/02/air-traffic-queries-in-infinidb-early-alpha/

Of course InfiniDB, does offer multi-threaded processing for all offerings and distributed processing (enabled with Enterprise Edition additional functionality) that was beyond the scope of his test platform. 

So, using the same airline on-time queries, we can show the relative performance of the queries as we scale both threads and servers. 

System under test is a Dell server with dual Xeon CPU's, with quad cores (8 total cores).

Intel(R) Xeon(R) CPU E5335 @ 2.00GHz


To gain an approximation of different server capabilities, individual cores were then taken offline using Linux hotplug capabilities and the queries were run:

echo 0 > /sys/devices/system/cpu/cpu7/online
. . .

echo 0 > /sys/devices/system/cpu/cpu2/online

The timings are captured below, with some analysis to follow. The 2 core timings measured here appear to correspond well with Vadim's timings ( total of 178.6 seconds reported by Vadim, and 174.56 seconds measured here). Columns 3-5 below show benefits of additional cores, columns 6 and 7 show benefits of additional servers.


Query

InfiniDB
metrics -Vadim  at Percona

Community InfiniDB Repeated w/ 2 cores enabled

Community InfiniDB Repeated w/ 4 cores enabled

Community InfiniDB Repeated w/ 8 cores enabled

InfiniDB Two Performance Modules (Enterprise Edition)

InfiniDB Four Performance Modules (Enterprise Edition)

Q0

NA

 

 

 

 

 

Q1

6.79

6.97

3.43

1.91

1.29

0.87

Q2

4.59

 

4.83

2.48

1.71

1.34

0.69

Q3

4.96

5.56

3.19

1.93

1.7

0.96

Q4

0.75

0.8

0.43

0.36

0.32

0.32

Q5

NA

 

 

 

 

 

Q6

NA

 

 

 

 

 

Q7

NA

 

 

 

 

 

Q8 (1y)

8.13

8.03

4.17

2.54

2.23

2.02

Q8 (2y)

16.54

16.03

8.75

5.13

3.26

3.79

Q8 (3y)

24.46

23.62

12.66

7.72

5.15

3.3

Q8 (4y)

32.49

31.21

16.61

10.06

5.77

3.44

Q8 (10y)

70.35

68.3

37.06

23.08

14.09

7.16

Q9

9.54

9.21

4.55

2.4

1.39

0.86

 

 

Q0, Q5, Q6, Q7 – on roadmap

 

Q1, Q2, Q3, Q4,Q9 - show good scalability overall, however queries under about 1/2 second show either smaller benefits from scaling, or no benefit (Q4 shows the close to the same time for 4 cores and above).

 

Q8 variations - show overall good scalability, but we think there may be some performance opportunities here. 

 

 

Look to this site for additional scalability measurements on Scan operations, Aggregation, and (Hash) Join operations. 

 

 

 


InfiniDB Top 3 Tuning Parameters

Posted by: jtommaney

Tagged in: Tuning , Performance , Parameters

There are a few parameters that may be worth looking at when first installing InfiniDB; NumBlocksPct for managing memory, PmMaxMemorySmallSide for optimizing join behaviors, and MaxOutstandingRequests which changes how individual queries share resources.
The NumBlocksPct parameter manages the amount of memory on each server allocated to store data blocks and is expressed as a % of server memory. For example, to allocate 5 GB for the data buffer cache for a server with 8GB memory, set the parameter to 62.5. This memory utilization will show up within the PrimProc process on a single server installation, or on each distributed Performance Module for an Enterprise Edition installation. This memory is not allocated at startup, instead growing as additional data blocks are cached. In addition, this would reserve up to 3 GB for ExeMgr process space for intermediate results handling, as needed.

Enterprise Edition note: for an EE installation, all available memory on the User Module is automatically available for intermediate results handling, no parameter setting is required. The NumBlocksPct can be set to a much higher percentage, 90% or more because of tightly controlled dynamic memory requirements on the Performance Module.
The PmMaxMemorySmallSide parameter controls where a join takes place within InfiniDB, whether it is run within the User Module process space (ExeMgr), or within the Performance Module process space (PrimProc). In all cases the join operation is a hash join that is optimized for joining millions or billions of rows. For a single server implementation, the parameter should be set so that joins are done within the Performance Module process space to reduce some intra-server data passing. The default parameter setting allows joins where the smaller side of the join operation is less than about 1 million rows to happen within the Performance Module. Raising the parameter value will generally cause more joins to happen within the Performance Module process space.
Enterprise Edition note: for an EE installation, a Performance Module join operation is generally more performant, and more scalable, than a User Module join operation. However this may need additional memory on each Performance Module and so is a trade-off with NumBlocksPct that controls data buffer cache size. Would your most important queries get faster by having more blocks of data cached or by satisfying larger joins in a distributed manner?
The MaxOutstandingRequests parameter determines how aggressively requests to execute individual filters, joins and aggregations are issued by the User Module process to the Performance Module(s) process(es). Each request will execute against up to 8 million rows, and this parameter controls how many requests should be 'active' at any given point in time. A parameter setting of 5 will attempt to keep at least 40 million being actively worked on until all rows in the table are have been processed. Depending on server capabilities and actual operations, this can be accomplished in a fractional second, or some number of seconds. For most cases, small queries should be able to be completed without waiting for completion of any large query through this method. Reducing the parameter can minimize these small delays for smaller queries.