Data Warehousing Tip #9 – Test at Volume

My next tip is to test your BI solution with the volume of data that you are expecting.  This is paramount to building a successful system.  You need to ensure that you can not only report on your data in a timely manner but also be able to load the volumes and complexity of data that you are expecting.  I would argue that if you haven’t tested at volume you haven’t really tested at all.

Testing at volume is vitally important to your solution, and the earlier you do it the easier your job will be.  It’s about understanding the challenges of the volume of data you’re dealing with as early as possible in the development process, and actually building the solution to handle that volume of data efficiently both in terms of read and write performance.

Building to spec

How many records are you intending to load per day?  Whether its one million or one billion, make sure you’ve tested loading that volume of data, and not at the end of the development process but right at the beginning.  Can you load it?  Can you query it?  The volume of data you are loading has to shape the solution that you build.

There is no point in building a data warehouse to cope with processing one billion new records a day when you actually only need to process one hundred thousand.  You’ll end up spending a lot of time and a lot of money building a system that completely dwarfs your needs so keep to spec.  But if you don’t know how it’s going to perform processing one hundred thousand rows do you just go and put everything in place that you can think of?  No.  You actually start processing one hundred thousand rows from the word go.

Getting the data

So you might not know your exact throughput, but you’ll probably have a good idea, or a worst case estimate (worst case being the highest possible throughput).  You may be in the position where all of your source data is already available.  If this is the case there really is no excuse.  When you start to build your solution load all of it.  If you don’t have the source data available then mock it up.  If it is coming from flat files then create test files to the exact spec that you are going to receive.  If it’s coming from a DB or from an API do the same thing.

“Upgrade”

The most extreme issue resulting from not testing at volume that I’ve witnessed is where a company I was working with performed an upgrade for one of their clients.  The solution was dealing a couple of thousand inserts a day.  No history was retained as part of the upgrade, so it was a brand new blank database.  Within days of the upgrade going live queries were timing out and the system was unusable for the end user.

I was asked to have a look.  The problem was that they’d built a scalar function to parse some name and value pairs out of some very large text fields.  The function was being called by stored procedures being executed at query time.  Now SQL Server isn’t great at parsing name and value pairs out of unstructured data, especially when it’s having to do it row-by-painstaking-row.  This function in itself wasn’t greatly efficient either.  Parsing the same value multiple times rather than just storing the value retrieved initially in a variable and reusing it.

What was even better was that they were calling this function not just once per record, but multiple times for each record as there were numerous name and value pairs to retrieve.  Awesome.  When questioned about why they took this approach they said it had worked fine in testing…

Most queries, however inefficient, will work fine when you test them with a tiny data set.  The system was minuscule in database terms and had gone from zero to only a few thousand records before it had become unusable.  It was an upgrade that went badly wrong, and problems with the system that should never have seen the light of day were discovered by the end user.

The proper solution to the problem would have been to parse these name and value pairs from within the source system that was also part of the solution, giving the database the data as it wanted it.  Because the issue needed fixing immediately, and the proper solution would take longer to implement, I put in a better workaround within SQL Server.  I rebuilt the function making it as efficient as possible, and then did the parsing as part of an instead of insert statement.  Catching the insert of the unstructured text and parsing it and loading the parsed data into a new structured table.  Performing the parsing once on insert rather than every time that the data was queried.  With the database holding data in the way that it was to be consumed by the UI queries returned instantaneously.

Scaling Up

Another situation I was involved in was one where I was asked to scale up a system from 500 thousand inserts per day to 100 million.  Seriously.  Now, this in itself wasn’t a problem with not testing at volume as the 100 million wasn’t a requirement until all of a sudden it was.  However, imagine you knew the anticipated volume but you were testing the system with 0.5% of the planned daily inserts.  What is the point in that?  What are you actually testing?  It was one of the first BI solutions that I was involved with and getting it to work almost broke me.  I did learn a hell of a lot in the process though.

The one good thing about this situation was that getting the data wasn’t a problem.  They switched the additional feed on almost immediately.  Then I watched everything grind to a halt.  At first I couldn’t get the data into the data warehouse.  We moved from SQL Server Standard Edition to Enterprise so that I could use table partitioning.  I added daily partitions and made other improvements.  I could load the data.

Then I couldn’t load the Analysis Services multidimensional model.  The incremental load started off okay on an empty measure group but very quickly degraded as the measure group got bigger.  This led me to partitioned measure groups with the partitions aligned with the DW table partitions.  I could then load my measure groups.

Finally I couldn’t actually query the data within any reasonable timeframe.  We had a web based UI and we wanted results in seconds rather than minutes.  Numerous improvements were needed in the multidimensional model.  Profiling the queries and finding that they were still hitting every partition even when we’d specified the date range was just one problem.  I pulled the model apart.  Stuff that maybe wasn’t great at 500 thousand records a day but didn’t cause a problem at those volumes needed reworking.  Which in hindsight is what you’d expect when scaling up a solution to handle 200 times the initial throughput.

This is all stuff that we could have tackled up front had we known that was the requirement.  If you know the requirement up front, you don’t have any excuse if you find yourself in that situation.  It could still be a big challenge, but we’re here to resolve challenges.  You can’t tackle it if you aren’t aware of it.  Floating along, oblivious to it.

We did get there in the end.  Just in time for them to decide that they didn’t actually want the system after all.

Don’t leave it to chance

These are both extreme examples, but you never want to be in the position of not testing your system at volume until you go live with it.  Testing at volume from as early as possible in the development phase will help you identify and tackle challenges before they become problems.  You are building a solution that is fit for purpose.

If you aren’t testing at volume you are leaving the performance of your solution to chance.  You may well have built a lot of solutions and built up a lot of knowledge and experience, but no amount of knowledge and experience can compensate for proper volume testing.

Testing at volume will show you where there are immediate issues.  Where you need to rethink something completely, and where you need to tweak something else.  Be it the ETL/ELT, schema design, indexing, partitioning strategies, you name it.  It will be anything that could degrade as you push more data through the data warehouse.  Which within a data warehouse or indeed any database, is pretty much everything.

Leave a Reply