Important factors in designing a Data Warehouse
Mentioned here are some important factors that will determine how your Data warehouse design will shape up to support the required business needs. These factors will influence the final outcome and should be seriously thought over in the design phase.
Identifying Reports and getting comfortable with the source
One of the most vital requirement before embarking on a warehouse design is to identify what the users and management are expecting from it. This scope has to be documented. It is imperative that users will not be able to provide all such information off-hand, but they will be able to put forward what they always wanted but could never achieve in the source system.Apart from getting this input from users, the concerned Architect designing the system should also study the source system to identify what data could be used for analysis purpose. Such kind of details will reveal itself only if the Architect has been closely involved with the source system for some time.
So my suggestion is that Architects spend some time (weeks or months) to study the source system and just stick around to see how data is flowing and how users are actually interacting with the system. The more time you are involved with the source system, the more your domain expertise increases. This knowledge gain is important to build your comfort level and that helps in making a concrete design.
Some of the worst designs I have come across have been done by folks who had no knowledge of what the online system is actually meant for or how the business actually runs. Even worse, these folks were not able to map the business data into proper dimensions and actually landed up cooking another OLTP-style data warehouse.
Note, by 'worse' I mean designs that could not satisfy the needs, performed so badly that analysis was hard to do, or the aggregated data was just not reliable.
Bottom-up approach
When designing Relational database management system (RDBMS), the Normalization form generally requires you to design in the Top-Down approach, wherein, you separate entities into appropriate Master-Detail relationships. Your hierarchy goes on expanding as you further build relations between tables.In case of Data Warehouse design, you are actually looking at the transactions to establish what kind of reporting is possible. So you already have an existing source system that has to be studied for designing the warehouse. When doing so, focus on the transaction data to identify what facts, measures and metrics are available in the Source system.
In other words, start from the bottom of the hierarchical representation present in the source Online system.The transaction computation values are the one that will reside in FACT Tables and the means to query and summarize this information will reside in the DIMENSION tables. So the bottom-up approach would be to understand what transactions take place on the online system and then arrive at what entities are required to access these.
For instance, in a billing system, identify the transactions that are captured by the system. The system could be storing - invoices, credit notes, credit memos, debit notes, daily sales, goods returned, purchase, purchase returns etc. etc. This forms the Fact.
Next thing to identify is how this data can be queried. The existing masters should be analysed and de-normalized to form appropriate dimensions.
Once you have completed reading this note, refer "Data Warehouse design by example" to understand more on this.
Lowest level of granularity
Technically, I normally recommend identifying the lowest granularity of data before a schema is designed to support a business process. This basically means, identifying what the lowest reporting level of the data is that the Data Warehouse system will be designed to hold.In other words, what is the lowest level of data aggregation. End-users typically want to perform analysis on aggregated data and not look at individual transactions. But there may be a need sometimes to pass on raw online data to warehouse for a very low level reporting.
For instance, the billing system stores information of all items that were processed for an invoice. If management decides to create a data warehouse that will provide information only on customer and money transacted, then the granularity level of the data to be ported to the warehouse would be the invoice. So items transacted for an invoice will not be pushed to the warehouse as no analysis is done on this. This would also mean that in-future if there is a need-felt to actually do analysis at Invoice line level also, the existing design will not support it. Now imagine changing the design to support this need!
You would have understood by now that this factor is very crucial for the very existance of the data warehouse. To explain this further, let me give another example. I have four basic tables here as below, used to store billing information.
CUSTOMER_MASTER (Stores customers)
CUST_NO (PK)
CUST_NAME
ITEM_MASTER (stores items)
ITEM_NO (PK)
ITEM_NAME
INVOICE_HEADER (all invoices)
INVOICE_NO (PK)
INVOICE_DATE
CUSTOMER_NO (FK to CUSTOMER_MASTER)
INVOICE_LINES (lines for every invoice)
INVOICE_NO(FK to INVOICE_HEADER)
LINE_NO
ITEM_NO (FK to ITEM_MASTER)
QTY_SOLD
LINE_TOTAL
In the above example, the lowest level of granularity for transaction is the LINE_NO that reside in the INVOICE_LINES tables. ITEM_NO is the next level that can be aggregated on. In my example here, duplicate items can exist in an invoice. So if my granularity level is LINE_NO, then duplicate items will be reported as is in the Data Warehouse. But if my granularity level is ITEM_NO, then I would show only one line in Data Warehouse for multiple duplicate lines existing in the same invoice.
The next level would be the INVOICE_NO that is present in the INVOICE_HEADER table, and so on. I could create a data aggregation relationship like the below one to identify the different levels.
Invoice Lines (LINE_NO) -> Invoice items (ITEM_NO) -> Invoice header (INVOICE_NO) -> Invoice date (INVOICE_DATE) -> Billed Customer (CUST_NO)
The above relation shows that the data at the lower level can be aggregated upon to get the next level. So summing up the line total at Invoice item level will give me the value of the invoice. Please note, all I am trying to do here is identify what entity level when aggregated gives the next higher entity level.
Now what level does satisfy the business? Let say the management/user has finalized that they are just not interested in the Item level information and will never be. In that case your lowest level of granularity becomes the Invoice header(INVOICE_NO) as that is the next higher aggregated level above item details that can be reported on.
Now if the users decided to also do analysis on Items being sold, the lowest granularity level becomes the invoice items.
Now if the users also wanted to do analysis on how many duplicate items exist in an invoice, the lowest granularity level becomes the invoice lines. This would mean that you are porting the OLTP data in raw form to data warehouse. This is a rare case in real practice and one must debate the rationality behind such requirements.
There is every possibility that the lowest granularity of data you choose for extracting is not the lowest unit in the source system. The most optimal granularity will separate the useful data from the no-use data. The useful data is pushed to the warehouse.
The lower the granularity level, more the number of dimensions. Choose wisely.
Dealing with changing Master data in Production
Ideally, Warehouse data should not be updated and old records should be kept for reference purpose. When an important attribute of a master record is modified in Production, you have a choice to either update the existing dimension record, or create one more dimension record that reflects the change. Creating one more dimension record is recommended to allow tracing back and generating reports on the older snapshot of the data. This should be done only when the Key reporting attributes are modified.For instance, if a customer exists in production with CUST_NO as 30 and CUST_NAME as "My Land Travel Agency". The data warehouse would have the same record existing in one of the dimension. Now if the Customer Name is changed, a new record will be created in Customer dimension.
Before Change
Online Customer Master
CUST_NO(PK) CUST_NAME
30 My Land Travel Agency
Data Warehouse Customer Dimension
CUST_ID(PK) CUST_NO CUST_NAME
1001 30 My Land Travel Agency
After Change
Online Customer Master
CUST_NO(PK) CUST_NAME
30 Your Land Travel Agency
Data Warehouse Customer Dimension
CUST_ID(PK) CUST_NO CUST_NAME
1001 30 My Land Travel Agency
1002 30 Your Land Travel Agency
In the above example, surrogate key (CUST_ID) plays the important role of differentiating the record and linking it with the Fact data.
Not all data needs to be uniquely identified in Data Warehouse. If there is a change in Customer P.O. Box number, there may not be a need to trace back and you can safely update the existing Dimension record.
Identify the important attributes of all the dimensions and design and document the ETL layer to take care of such changes.
Dealing with changing transaction data in Production
Contradictory to the Master data, the transaction data stored in the Fact table is handled in a different way.The ETL layers load increases if you are bringing in ever-changing transaction records to Data Warehouse. My recommendation is to import only transactions in "Completed" form to Data Warehouse. For Instance, an invoice that is in Open state can be modified later to change the value. If this invoice is ported to Data Warehouse, the ETL layer will need to track any updates to the invoice and accordingly perform the same updates in FACT table also. The other option is to port only "completed" invoices to Data Warehouse so that reporting takes place on achieved transactions and not transactions-in-process.
Apart from few exceptions that I have come across, my aim has been to port only completed transactions and this has always worked good for both ETL as well as providing accurate and consistent data for reporting.
If you are porting in-complete transactions to Data Warehouse, cranky users may take advantage of this fact and extrapolate certain figures in Online system to control the output of certain reports. Such extrapolation can then be reverted post-reporting as the transaction is still open. Talk of system flaws...
Incremental or Complete Refresh
Ofcourse, data warehouse has to be updated in incremental mode. That means, only changed and new information hits the Data Warehouse during the ETL process. A design that is build on complete refresh will go down as a bad design once the Data starts growing in volume.Imagine truncating a Fact table and re-populating all transactions from Online system every night. This is OK when the Online system transactions are in few thousands. What happens when online system grows to millions of records? What happens if the online system is purged after few years? I don't need to explain much here.
Incremental updates are performed based-on a column in online transaction table that keeps track of the changes. Mostly this is the date or timestamp field. If you have designed the online system with proper audit in place, then you will have the audit columns pegged against each table. For instance in oracle e-business suite, a standard rule is followed to have the following mandatory audit columns put in each table.
CREATION_DATE
CREATED_BY
LAST_UPDATE_DATE
LAST_UPDATED_BY
Incremental updates should be driven on the column that provides the last update date of the record. This information will allow the ETL to decide if the record has already been pushed to Data warehouse.
I have come across designs that make use of the Primary key in the Source transaction table to perform incremental updates (Transaction No., Invoice No., batch no. etc). Though this is also another method of doing it, I would prefer using the last updated timestamp column as a standard. Reason - change tracking where ever required becomes easier to identify and apply on warehouse when it is based on change timestamp attribute. Besides timestamp will be a common field in all tables, instead of having different keys for different tables.
When to Normalize and When not to
De-normalization is an accepted concept in Data Warehouse and is recommended where ever it can improve reporting and analysis.You will have to take a call when to normalize and when to allow duplication. Dimensions can exist with rows of duplicate business key. As the above mentioned Customer dimension example depicts, multiple records for the same customer are present in the dimension. The records are not exactly the same and do differ in certain attribute.
Facts must exists in proper relational form. Numeric values put in Fact table should not be duplicated. Avoid duplication in Fact to remove design complexities and future complications.
Check the below example to get an idea how Online masters can be merged to form a single dimension in Data Warehouse.
Four Masters are present in Online system - Brand, Category, Sub-Category and Items. Brand and Category are independent masters. Sub-Category master references Category master. Item master references all the other three masters.So a newly created item will have Brand, Category and Sub-Category as its attributes in Item Master.

Now to create dimensions in data warehouse - one option would be to create highly normalized dimensions that do not have any duplicate data - Brand dimension, Category dimension, sub-category dimension and Item dimension. This is a good enough design, but certain reporting will be tedious. For instance if a report is provided with the brand name and should display all items belonging to it, the Brand and the Item dimension will have to be combined to get the list of items, which then can be connected with fact for needful data. It is also to be noted that the more the dimensions, the more the foreign keys that need to be placed in the fact table. This concept of having highly normalized dimensions results in snowflake schema design (dimensions are linked to dimensions). For huge volumes of data, this design may not be good performing, but sometimes is the only option available.
The other alternative is to de-normalize the relational structure to create lesser dimensions that are self-contained. For instance - in the above example, a single Items dimensions can be created with all the other details acting as attributes to the items.
In next posting, I will create a Data warehouse from scratch, based an a sample OLTP module.
Previous Note link: Data Warehouse Architecture
Next recommended Note: Data Warehouse design by example
Best viewed in medium text size. Please refresh (F5) to view the latest information.
This page was created on 13-Jun-09. Last updated on 13-Jun-09.
please forward all queries to amar@amar-padhi.com