Columnar DBMS tuning may require some new approaches versus traditional row-based tuning; gone are full/partitioned table scans, index lookups, and nested-loop join operations, instead replaced by full/partitioned column scans, access by RowID, and hash join operations. This series of articles is intended to measure potential benefits of different approaches to jump-start your expertise with this new paradigm. Today’s topic is data type declarations and the goal is to help understand the potential benefits of declaring the smallest data type that supports the business data needs.
The test data for this query is a Star Schema Benchmark (SSB) data set at a scale factor of 100, yielding 600 million fact rows. For this example the table was altered to add additional columns. For this example the two new columns small_qty and small_disc contain the same data values and distribution as the lo_quantity and lo_discount fields, just in a 1 byte instead of an 8 byte representation. Actual values vary from 0 to 50 quantities and 0 to 10 for discounts.
mysql> desc lineorder;
+------------------+---------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------------+---------------+------+-----+---------+-------+
| lo_orderkey | bigint(20) | YES | | NULL | |
| lo_linenumber | int(11) | YES | | NULL | |
| lo_custkey | int(11) | YES | | NULL | |
| lo_partkey | int(11) | YES | | NULL | |
| lo_suppkey | int(11) | YES | | NULL | |
| lo_orderdate | int(11) | YES | | NULL | |
| lo_orderpriority | char(15) | YES | | NULL | |
| lo_shippriority | char(1) | YES | | NULL | |
| lo_quantity | decimal(12,2) | YES | | NULL | |
| lo_extendedprice | decimal(12,2) | YES | | NULL | |
| lo_ordtotalprice | decimal(12,2) | YES | | NULL | |
| lo_discount | decimal(12,2) | YES | | NULL | |
| lo_revenue | decimal(12,2) | YES | | NULL | |
| lo_supplycost | decimal(12,2) | YES | | NULL | |
| lo_tax | decimal(12,2) | YES | | NULL | |
| lo_commitdate | int(11) | YES | | NULL | |
| lo_shipmode | char(10) | YES | | NULL | |
-- new stuff below --
| char2 | char(2) | YES | | NULL | |
| small_qty | tinyint(4) | YES | | NULL | |
| small_disc | tinyint(4) | YES | | NULL | |
+------------------+---------------+------+-----+---------+-------+
20 rows in set (0.01 sec)
Trivial use case to provide relative cost:
select calflushcache();
select count(lo_quantity) from lineorder where lo_quantity = 1;
select count(lo_quantity) from lineorder where lo_quantity = 1;
select calflushcache();
select count(small_qty) from lineorder where small_qty = 1;
select count(small_qty) from lineorder where small_qty = 1;
select calflushcache();
select count(lo_discount) from lineorder where lo_discount = 1;
select count(lo_discount) from lineorder where lo_discount = 1;
select calflushcache();
select count(small_disc) from lineorder where small_disc = 1;
select count(small_disc) from lineorder where small_disc = 1;
|
trivial query: count(column) where column = 1
|
Run 1 Elapsed Seconds
|
Run 2 Elapsed Seconds
|
|
lo_quantity (8 byte)
|
15.34
|
4.11
|
|
small_qty (1 byte)
|
4.24
|
3.34
|
|
lo_discount (8 byte)
|
15.18
|
4.38
|
|
small_disc (1 byte)
|
4.45
|
3.61
|
|
|
Run1 Average
|
Run2 Average
|
|
Base
|
15.26
|
4.25
|
|
Tighter Data type
|
4.35
|
3.48
|
Quick Notes for Trivial Use Case:
1) A significant up-tick in performance is possible when reducing Physical I/O (PIO).
2) An 8-byte from cache and a 1-byte from disk run in close to the same time.
3) For the 1-byte column, disk I/O is no longer the over-riding factor for the query.
Simplified SSB for series Q1:
-- Q1.1 baseline
select count(*)
from dateinfo, lineorder
where lo_orderdate = d_datekey
and d_year = 1993
and lo_orderdate between 19930101 and 19931231 -- added matching predicate
and lo_discount between 1 and 3
and lo_quantity < 25;
The test scenario; flush the cache, then run the baseline Q1.1 query twice. Then modify the query to use the new 1-byte columns for discount and quantity. Flushed the cache and ran the new query twice. Execute the same steps for Q1.2 and Q1.3.
|
simplified SSB queries: select count(*) from …
|
Run 1 Elapsed Seconds
|
Run 2 Elapsed Seconds
|
|
Q1.1 baseline
|
6.7
|
2.07
|
|
Q1.1 tighter datatype
|
3.06
|
1.85
|
|
Q1.2 baseline
|
1.29
|
0.32
|
|
Q1.2 tighter datatype
|
0.48
|
0.28
|
|
Q1.3 baseline
|
6.41
|
1.36
|
|
Q1.3 tighter datatype
|
2.68
|
1.12
|
|
|
Run1 Average
|
Run2 Average
|
|
Base
|
4.80
|
1.25
|
|
Tighter Datatype
|
2.07
|
1.08
|
Quick Notes for SSB series Q1:
For more complex queries with additional CPU costs for join operations the I/O cost becomes somewhat less pronounced, but shows the same general pattern. However, all of the analysis here presumes that all blocks needed for the query are cached for subsequent runs, i.e. the database will run as an in-memory database. Let’s extend the analysis to look at one more use case where the data blocks for the query can’t be cached.
Q1.3 without date filter:
select calflushcache();
-- Q1.3 600 million tighter datatype
select count(*)
from dateinfo, lineorder
where lo_orderdate = d_datekey
and small_disc between 5 and 7
and small_qty between 26 and 35;
select calgetstats();
|
Remove date filter from Q1.3
|
Run 1 Elapsed Seconds
|
Run 2 Elapsed Seconds
|
|
Q1.3 600 million
|
38.01
|
38.05
|
|
Q1.3 600 million tighter datatype
|
14.92
|
6.59
|
Quick Notes for SSB series Q1.3 without date filter:
We now see the use case where the second run with the larger data type no longer fits in memory. The second query runs in about the same time for the baseline, while the tighter data type runs in about 40% of the baseline for the first run, and about 17% of baseline for additional runs.
Overall Conclusions:
Smaller Data Types reduce overall I/O requirements for the query which:
1) Can reduce 50% or more of PIO under conditions where the data blocks have not been previously cached.
2) Can reduce 100% of PIO for subsequent queries when the tighter data type declarations allow more data to be cached in the same number of blocks.
Note also that the benefits may be harder to measure when the database easily fits in memory without any optimization.
Let us help you put your data to work.
Thanks – Jim Tommaney
Chief Product Architect