I’ve seen data warehouses that nobody trusts, or are just too complex for end users to report on. Users will voice their concerns (usually quite loudly). Even worse, they will go back to using the relational source to get “accurate” data or because “it’s easier” to get data from there. The data warehouse could become something that you are maintaining but nobody is using, and that’s never a great feeling for a developer. There will always be some resistance to a new reporting solution especially from those that are happy with the existing one, but if your data isn’t right or your system is too hard to work with you will never get buy-in for your new data warehouse, and rightly so.
My second data warehousing tip is to follow some clear rules when building your data warehouse to ensure that your solution delivers accuracy, simplicity, and responsiveness.
Single version of the truth
The first thing that your data warehouse must be is a single definitive version of the truth. This term gets banded about within data warehousing a lot, but it’s value can’t be overstated. This rule is the foundation of your data warehouse. Without this single principle in place the system and everything built on top of it will crumble.
For me, single version of the truth means two things:
The system has to be accurate
This sounds like a no-brainer, but the system must be accurate. There will be challenges to your data, especially if your results differ from those returned from reports built against legacy systems or systems that now feed your data warehouse. There may be cases where reports built from source systems are proven to be inaccurate, and this can be a big win for your shiny new data warehouse. You always need to be able to back this up however. This is where holding your fact data at the leaf level comes in as discussed in tip #1.
You can’t have two ways to ask the same question and get different answers
Your data warehouse could potentially have different ways to come up with different answers to the same question. This could be that there are different ways to route through from your dimension to your fact table (if a snowflake model or many-to-many modelling is being used for example). Or you could potentially have the same metric held in different fact tables for example. Whenever this is the case, your figures have to match, or there has to be a very clear answer as to why they differ. For instance, some metrics could be running totals, and some could be point in time snapshots. If this is the case the metrics themselves and the fact tables that contain them should be named clearly to avoid confusion. Some awesome end-user documentation never goes amiss either.
Your data warehouse has to be simple to query. This starts with a clear dimensional model with clearly defined linkage between tables (and ideally not too many joins). Your dimensional model should be a star or snowflake schema with the fact table at the centre of the model. Not a hybrid relational model, or anything that when plotted on a diagram can’t be nicely laid out or interpreted.
You may have complex source systems. You may have multiple sources coming together in one place. These aren’t excuses for having a complex data warehouse schema however. Your ETL/ELT is the brains of your data warehouse, so do all of your transforms here. Your ETL may be complex, but this is where you want the complexity, hidden away in the back. You don’t want to leave the complexity for the end user. Your data warehouse shouldn’t be telling the end user “Here you go. I’ve thrown all of this stuff together. Now you make sense of it.”. If your warehouse isn’t making the end user’s life easier then it can’t be considered a success.
Data marts are a great way of isolating subsets of functionality. Rather than one big warehouse schema it can be subdivided where there are natural boundaries, such as finance vs operations. This allows the end user to focus on what it is they need to report on rather than overwhelming them with a huge model.
You’ll also want to de-normalise your dimensions wherever possible. It’s standard dimensional modelling stuff, but it’s when things start to depart from the standard that it starts to get messy.
Your end users must be able to understand the dimensional model that you’ve built. The acid test of a data warehouse is whether a (capable) end user can write a query or generate a report from your model without assistance or supporting documentation. For the model to be intuitive it needs consistent and friendly naming of tables, views, and columns. By friendly I mean that the user should be able to understand the meaning of the column from the column name. Not having to look up against supporting documentation to interpret it.
The fact tables should hold data ready to be consumed by the end user. They shouldn’t have to perform any actions on top to get value out of the data (other than simple aggregation), and the model should not be complex. More joins = more complex. More than one way to get from a dimension to a fact table = more complex.
Your data warehouse queries should be responsive. This one is a bit subjective because data warehouses can contain a lot of data. If your end users are going back to the relational source system to run reports because it’s faster then you’ve got a problem though. Hopefully if your model is simple and intuitive then you’ve gone a long way to answering this question. You can boost your performance by reducing the number of joins needed to get from dimension to fact by de-normalising your dimensions, building data marts for specific reporting requirements, clever partitioning strategies, and by clever indexing. Columnstore indexes in particular can be a fantastic addition to a dimensional model. You can always add aggregate fact tables or an Analysis Services database on top of your dimensional model to improve query performance as well.