.... ...........

Contributed by Calpont, InfiniDB Community Edition is an open source, scale-up analytics database engine for your data warehousing, business intelligence and read-intensive application needs. Enabled via MySQLTM and purpose-built for an analytical workload with column-oriented technology at its core, the multi-threaded capabilities of InfiniDB Community Edition fully encompass query, transactional support and bulk load operations.  So come on in, grab a download and get started.

InfiniDB Team Blog

News and tidbits from your InfiniDB team.

MySQL User Conference

I'll be presenting "The Thinking Person's Guide to Data Warehouse Design" at the upcoming MySQL User conference. While a lot of people think that bad SQL code is the #1 wrecking ball of data warehouses and marts, the fact is that poor database design is the first cause of both downtime and bad performance. In my presentation, I'll do my best to show how up-front work in a data model pays off and how to take that model into a MySQL physical design, with topics like scale-up/out designs, storage engine decisions, partitioning schemes, indexing issues, and much more being discussed. I'll then wrap up with tips on monitoring and tuning of the design.

Hope to see you there!


MySQL University Recording on MySQL Column Databases Now Available

Posted by: robin

Tagged in: Untagged 

I was privileged to lead a MySQL University session MySQL Universietylast week on MySQL-based column databases - what they are, how they work, why you should use them, cases where you shouldn't use them, benchmarks showing their performance, and more. I covered the main MySQL column databases - Infobright, Kickfire, and InfiniDB - and discussed the various features, benefits, and limitations of each. You can find the slide deck and the recording of the presentation complete with the slide presentation at: http://forge.mysql.com/wiki/MySQL_Column_Databases.  Let me know what you think and if you have any questions about what I presented.


First InfiniDB Community Maintenance Release (1.0.3) Now Available!

Posted by: wweeks

Tagged in: Untagged 

As promised, we are pleased to announce the availability of the 1.0.3 release of InfiniDB Community Edition.  This is the first of our monthly maintenance releases that are scheduled throughout the spring.

This release includes a number of bug fixes that you can see at http://bugs.launchpad.net/infinidb.   We have also included support for prepared statements with bind variables and a performance improvement for queries that select many columns with nested functions and expressions resulting in some queries running up to 10 times faster.  You can download the latest InfiniDB binaries, source code, and updated documentation at: http://infinidb.org/downloads.  We welcome your feedback and thanks for your support of InfiniDB.


Vote and Help us Help You Better

Posted by: robin

Tagged in: Untagged 

Right now, we support InfiniDB on a couple of different Linux flavors. As we march toward a couple of more releases this year, we want to expand our operating system support. We could really use your input on which OS's we should support next. Please go to http://www.infinidb.org/index.php. On the lower left corner you'll see our OS poll - please vote and let us know which OS you use for your production databases. Thanks!


How to use InfiniDB with OpenOffice

Posted by: robin

Tagged in: Untagged 

Last week, I pointed you to a paper I recently wrote on how to use InfiniDB with Microsoft Excel. Of course, I realize there are plenty of people who don’t use Microsoft office tools and use OpenOffice instead, so I’ve just posted another paper on how to use InfiniDB with OpenOffice Calc (OO’s complement of Excel) and Base (OO’s twin of MS Access). FWIW, I found OpenOffice easier and more straightforward to use with InfiniDB and other RDBMS's than Microsoft (and I'm pretty much an MS Office poweruser).

Again, you’ll have to register on our .ORG site to download and read the new paper, but again, don’t worry about us hitting you with a bunch of emails or anything else – we just like to know who’s interested in what on our site.

If you’d like to see other papers written on how-to-use-infinidb-with-, please let me know. And thanks again for your support of InfiniDB.


InfiniDB TnT (Tips ‘n Tricks) 1 – data type declarations

Posted by: jtommaney

Tagged in: Untagged 

Columnar DBMS tuning may require some new approaches versus traditional row-based tuning; gone are full/partitioned table scans, index lookups, and nested-loop join operations, instead replaced by full/partitioned column scans, access by RowID, and hash join operations.  This series of articles is intended to measure potential benefits of different approaches to jump-start your expertise with this new paradigm.  Today’s topic is data type declarations and the goal is to help understand the potential benefits of declaring the smallest data type that supports the business data needs. 

 

The test data for this query is a Star Schema Benchmark (SSB) data set at a scale factor of 100, yielding 600 million fact rows.  For this example the table was altered to add additional columns.  For this example the two new columns small_qty and small_disc contain the same data values and distribution as the lo_quantity and lo_discount fields, just in a 1 byte instead of an 8 byte representation.  Actual values vary from 0 to 50 quantities and 0 to 10 for discounts.

 

 

mysql> desc lineorder;

+------------------+---------------+------+-----+---------+-------+

| Field            | Type          | Null | Key | Default | Extra |

+------------------+---------------+------+-----+---------+-------+

| lo_orderkey      | bigint(20)    | YES  |     | NULL    |       |

| lo_linenumber    | int(11)       | YES  |     | NULL    |       |

| lo_custkey       | int(11)       | YES  |     | NULL    |       |

| lo_partkey       | int(11)       | YES  |     | NULL    |       |

| lo_suppkey       | int(11)       | YES  |     | NULL    |       |

| lo_orderdate     | int(11)       | YES  |     | NULL    |       |

| lo_orderpriority | char(15)      | YES  |     | NULL    |       |

| lo_shippriority  | char(1)       | YES  |     | NULL    |       |

| lo_quantity      | decimal(12,2) | YES  |     | NULL    |       |

| lo_extendedprice | decimal(12,2) | YES  |     | NULL    |       |

| lo_ordtotalprice | decimal(12,2) | YES  |     | NULL    |       |

| lo_discount      | decimal(12,2) | YES  |     | NULL    |       |

| lo_revenue       | decimal(12,2) | YES  |     | NULL    |       |

| lo_supplycost    | decimal(12,2) | YES  |     | NULL    |       |

| lo_tax           | decimal(12,2) | YES  |     | NULL    |       |

| lo_commitdate    | int(11)       | YES  |     | NULL    |       |

| lo_shipmode      | char(10)      | YES  |     | NULL    |       |

-- new stuff below --

| char2            | char(2)       | YES  |     | NULL    |       |

| small_qty        | tinyint(4)    | YES  |     | NULL    |       |

| small_disc       | tinyint(4)    | YES  |     | NULL    |       |

+------------------+---------------+------+-----+---------+-------+

20 rows in set (0.01 sec)

 

 

 

Trivial use case to provide relative cost:

 

select calflushcache();

select count(lo_quantity) from lineorder where lo_quantity = 1;

select count(lo_quantity) from lineorder where lo_quantity = 1;

 

select calflushcache();

select count(small_qty) from lineorder where small_qty = 1;

select count(small_qty) from lineorder where small_qty = 1;

 

select calflushcache();

select count(lo_discount) from lineorder where lo_discount = 1;

select count(lo_discount) from lineorder where lo_discount = 1;

 

select calflushcache();

select count(small_disc) from lineorder where small_disc = 1;

select count(small_disc) from lineorder where small_disc = 1;

 

 

trivial query:
count(column) where column = 1

Run 1 Elapsed Seconds

Run 2 Elapsed Seconds

lo_quantity  (8 byte)

15.34

4.11

small_qty (1 byte)

4.24

3.34

lo_discount (8 byte)

15.18

4.38

small_disc (1 byte)

4.45

3.61

 

Run1 Average

Run2 Average

Base

15.26

4.25

Tighter Data type

4.35

3.48

 

 

Quick Notes for Trivial Use Case:

 

1) A significant up-tick in performance is possible when reducing Physical I/O (PIO).

2) An 8-byte from cache and a 1-byte from disk run in close to the same time.

3) For the 1-byte column, disk I/O is no longer the over-riding factor for the query.

 

 

Simplified SSB for series Q1:

 

-- Q1.1 baseline

select count(*)

from dateinfo, lineorder                                       

where lo_orderdate = d_datekey

and d_year = 1993

and lo_orderdate between 19930101 and 19931231                   -- added matching predicate

and lo_discount between 1  and 3

and lo_quantity < 25;

 

The test scenario; flush the cache, then run the baseline Q1.1 query twice.  Then modify the query to use the new 1-byte columns for discount and quantity.  Flushed the cache and ran the new query twice.  Execute the same steps for Q1.2 and Q1.3.

 

 

simplified SSB queries:
select count(*) from …

Run 1 Elapsed Seconds

Run 2 Elapsed Seconds

Q1.1 baseline

6.7

2.07

Q1.1 tighter datatype

3.06

1.85

Q1.2 baseline

1.29

0.32

Q1.2 tighter datatype

0.48

0.28

Q1.3 baseline

6.41

1.36

Q1.3 tighter datatype

2.68

1.12

 

Run1 Average

Run2 Average

Base

4.80

1.25

Tighter Datatype

2.07

1.08

 

 

Quick Notes for SSB series Q1:

 

For more complex queries with additional CPU costs for join operations the I/O cost becomes somewhat less pronounced, but shows the same general pattern.  However, all of the analysis here presumes that all blocks needed for the query are cached for subsequent runs, i.e. the database will run as an in-memory database.  Let’s extend the analysis to look at one more use case where the data blocks for the query can’t be cached.

 

 

 

 

Q1.3 without date filter:

 

select calflushcache();

-- Q1.3 600 million tighter datatype

select count(*)

from dateinfo, lineorder                                       

where lo_orderdate = d_datekey

and small_disc between 5 and 7           

and small_qty between 26 and 35;

select calgetstats();

 

 

Remove date filter from Q1.3

Run 1 Elapsed Seconds

Run 2 Elapsed Seconds

Q1.3 600 million

38.01

38.05

Q1.3 600 million tighter datatype

14.92

6.59

 

 

Quick Notes for SSB series Q1.3 without date filter:

 

We now see the use case where the second run with the larger data type no longer fits in memory.  The second query runs in about the same time for the baseline, while the tighter data type runs in about 40% of the baseline for the first run, and about 17% of baseline for additional runs. 

 

 

Overall Conclusions:

 

Smaller Data Types reduce overall I/O requirements for the query which:

 

1)      Can reduce 50% or more of PIO under conditions where the data blocks have not been previously cached.

2)      Can reduce 100% of PIO for subsequent queries when the tighter data type declarations allow more data to be cached in the same number of blocks.

 

Note also that the benefits may be harder to measure when the database easily fits in memory without any optimization. 

 

 

Let us help you put your data to work.

 

 

Thanks – Jim Tommaney

Chief Product Architect

 

 

 


A Note on Using InfiniDB and Microsoft Excel

Posted by: robin

Tagged in: Untagged 

A recent survey found that the most used BI tool isn’t what you might think (e.g. Business Objects, Pentaho, Cognos, Jaspersoft, etc.) – it’s Microsoft Excel. And I suppose that’s not so surprising given spreadsheets are the main communication medium for financial and general business analysis.

If you use Excel in this manner, you should know that you can use Excel’s data import capabilities to directly query (via the MySQL ODBC driver) and pull data from InfiniDB into a spreadsheet. But there is one thing you should watch for: for whatever reason, when building a join between two or more tables, Excel’s table/column selection wizard will send a Cartesian join statement to MySQL as it steps through its process. With InfiniDB, we disallow Cartesian joins because (1) most Cartesian joins are a coding mistake; (2) a large Cartesian join that’s executed can kill performance on a database.

So if you want to use Excel and query InfiniDB, you need to cancel out of Excel’s table/column selection wizard when constructing a join, and instead use Excel’s query editor. With Excel’s query editor, you can select tables, visually do your joins, and select the columns you want with no Cartesian join statements being sent to the server.

I’ve written a new white paper on using InfiniDB and Microsoft Excel that’s filled with step-by-step examples, which you can download and read. Given that it’s a white paper, we do require registration on infinidb.org to get it, but rest assured we won’t give out your email address or spam you with correspondence.

Thanks again for your support of InfiniDB!


Interesting (and Good) News for MySQL Data Warehouse Users

Posted by: robin

Tagged in: Untagged 

One aspect of my positions with Calpont, MySQL, and other companies I’ve worked for has been to interact with the various analysts who cover the database scene. It’s definitely an interesting part of the job, especially when you get to query some of the experts who have been around a long time and are good at making solid technology calls on where things are headed.

IDC is one such group of experts and they recently released an interesting set of predictions on where databases – and specifically data warehouses and analytic DB’s – are going. You can read the quick update they issued HERE.

A couple of their ‘within 5 year’ predictions I found particular interesting for MySQL data warehouse users are:

  • Most data warehouses will be stored in a columnar fashion
  • Most large-scale database servers will achieve horizontal scalability through clustering

This is helpful information for MySQL users who want to ensure they’re making the right bets on their data warehouse designs and the storage engines they’re choosing. It’s also good news because MySQL users now have a choice of column databases they can try out and see which is right for them.

I’ve written a new article on the MySQL dev zone on why you should check out a column database if you’re planning to implement a data warehouse, reporting database, or any read-intensive application. It talks about the why and why not’s of column DB’s, discusses why they matter, and provides some interesting benchmark tests of a leading row database vs. InfiniDB.

You can find the new article HERE. Please let us know what you think and what plans you have for your data warehouses.


Roadmap and Schedule for Next Release of InfiniDB

Posted by: robin

Tagged in: Untagged 

We’ve had a good and successful launch of InfiniDB 1.0 (Feb 1), and while we’re pleased with that accomplishment, we’ve now turned our attention to the next release of InfiniDB, which will be 1.1. We’ve got some nice new features, capabilities, and changes planned for you that aren’t very far away. In addition, we plan on backporting some of these changes to a 1.0 patch release so current users can benefit from them now.

You can find the new, updated roadmap available here. You’ll find each new committed feature briefly described, features we’ll implement if time permits, and a schedule showing what new builds will contain what new features, along with when our beta and RC cycles are expected. If all goes as planned, the FINAL InfiniDB 1.1 version will be released the week of July 1.

We’re committed to a release early/often framework, and we look forward to your participation in our upcoming release. If you have any questions about our plans, please let us know. Thanks for your support of InfiniDB!


Let’s extend the analysis of MySQL parallel query processing with InfiniDB against 60 Billion rows:


The Star Schema Benchmark scale factor for 10k generates 60 billion rows fact rows, with the largest dimension tables at 300 million and 100 million.  Previous results were shown at a scale factor of 1000, generating 6 billion facts and 30 million rows in the largest dimension table. 

 

Vadim Tkachenko with Percona (scale factor 1,000):

http://www.mysqlperformanceblog.com/2010/01/07/star-schema-bechmark-infobright-infinidb-and-luciddb/ . 

 

Tuning and Scaling SSB Queries (scale factor 1,000):

http://www.infinidb.org/myblog-admin/mysql-parallel-query-processing-of-ssb-queries-via-infinidb-.html

 

The results here were generated with the same queries and basic methodology as detailed in the scale factor 1000 InfiniDB blog entry referenced above, with the exceptions noted below. Only one server configuration was evaluated, in this case 1 User Module running MySQL daemon and InfiniDB connection management processes, and 6 Performance Modules running our distributed global cache and distributed join and block processing engine.  Additional details on the load methodology, cautions, and exceptions are available at:  http://www.infinidb.org/myblog-admin/infinidb-load-60-billion-ssb-rows-trended.html .  As noted, one of 2560 source files was mistakenly deleted prior to import (pebkac error), so the actual row count is 59,977,404,781 - just short of 60 billion.  The servers themselves had 16GB memory and 8 cores connected to fibre channel attached LUNs.

 

Non-Default Parameter Settings:

 

Handling 10x larger data did require some non-default parameter settings.  To handle the larger aggregation the decimal scale was specified at the session level: 

    set infinidb_decimal_scale=1;

    set infinidb_use_decimal_scale = 'ON';

 

Tuning was set up to allow for distributed joins to take place where possible. The non-default  Calpont.xml server parameters were:

 

./configxml.sh setconfig JobList MaxOutstandingRequests 7

./configxml.sh setconfig HashJoin PmMaxMemorySmallSide 1G

./configxml.sh setconfig HashJoin UmMaxMemorySmallSide 13G

./configxml.sh setconfig HashJoin TotalUmMaxMemorySmallSide 14G

./configxml.sh setconfig RowAggregation MaxMemory 10G

 

Details on these parameters are available in the Performance and Tuning Guide available by registering here:  http://www.infinidb.org/downloads/cat_view/71-advanced-user-documentation .

 

Open bug and work-around for this test:

 

A memory leak was identified when joining 300 million customer dimension rows to the 60 billion row fact table as part of Q4.3.  The issue is actually related to the number of rows in the dimension table (300 million customers in this case); joining 60 billion to 30 million did not show this issue.  The work-around was to re-start the system prior to executing Q4.3, although adding more memory would also get past the issue.  https://bugs.launchpad.net/bugs/522707

 

Analysis:

 

Note that both schemas were in the same instance.  The results were basically in-line with expectations around query performance, although the test methodology of running Q1.1 through Q4.3 in a single stream allowed for caching behavior to take place, benefitting the scale factor 1000 results to a greater degree. This shows up most prominently in Q4.3, where the query did zero Physical I/O against the lineorder fact table with 6 Billion vs. 100% PIO for the lineorder fact table at 60 Billion.

select calgettrace()  for ssb_1k :
Desc Mode Table     TableOID ReferencedOIDs                  PIO  LIO     PBE     Elapsed Rows
BPS  PM   customer  3257     (3258)                          0    14650   0       6.494   30000000
DSS  PM   supplier  3243     (3254)                          0    2       -       0.355   2
BPS  PM   supplier  3243     (3244,3253,3248)                9771 25610   0       0.879   399769
HJS  PM   supplier  3243     -                               -    -       -       0.000   -
BPS  PM   part      3228     (3229,3232,3239)                4586 8803    0       0.264   79913
BPS  PM   dateinfo  3273     (3278,3274)                     0    8       0       1.268   729
BPS  PM   lineorder 3208     (3214,3211,3213,3212,3221,3222) 0    5642519 2224128 13.684  2301852
HJS  PM   lineorder 3208     -                               -    -       -       0.000   -
TAS  UM   -         -        -                               -    -       -       13.576  -
TAS  UM   -         -        -                               -    -       -       13.576  -
 
select calgettrace()  for ssb_10k :
Desc Mode Table     TableOID ReferencedOIDs                  PIO      LIO      PBE      Elapsed Rows
BPS  PM   customer  3135     (3136)                          145974   145932   0        47.878  300000000
DSS  PM   supplier  3121     (3132)                          12       0        -        2.228   12
BPS  PM   supplier  3121     (3122,3131,3126)                244161   255834   0        5.391   4001247
HJS  PM   supplier  3121     -                               -        -        -        0.000   -
BPS  PM   part      3172     (3173,3176,3183)                10529    12278    0        1.045   111730
BPS  PM   dateinfo  3187     (3192,3188)                     8        5        0        1.164   729
BPS  PM   lineorder 3086     (3092,3089,3091,3090,3099,3100) 56395771 56278340 22237184 682.780 23098897
HJS  PM   lineorder 3086     -                               -        -        -        0.000   -
TAS  UM   -         -        -                               -        -        -        682.246 -
TAS  UM   -         -        -                               -        -        -        682.246 –

Extended Analysis:

These queries showed an average elapsed time @60 billion rows of around 1,540 seconds versus average elapsed times posted by Percona @6 billion rows of 6,000 - 8,000 seconds or more for other products.  The differences in the test conditions are significant enough to rule out direct comparisons; we used our Enterprise Edition, had more hardware, loaded the data by month, and tuned the queries.  I fully expect that each of the other products would show benefits under other test conditions, but will leave it to them to show extended capabilities against larger data sets or distributed systems. However, I did want to highlight what is possible with the distributed and parallel processing capabilities of InfiniDB. 

 

Let us help you put your data to work. 

Thanks – Jim Tommaney


  • «
  •  Start 
  •  Prev 
  •  1 
  •  2 
  •  3 
  •  4 
  •  5 
  •  Next 
  •  End 
  • »