amar on web

Data Warehouse - Design by example

If you have gone through my previous postings, you would by now have a fair idea of what designing a data warehouse is all about. For better understanding, lets take a simple Online system example and create a Data mart out of it. For this exercise, I will elaborate on how data in an existing Sales Application can be transformed into Data Mart.

Scope of the Data Warehouse

As discussed in previous posting, the users should be engaged to identify all reports and analysis requirement. For this example, lets assume that the below information has been collected from users regarding what they would like to extract from the data mart at first go -


Apart from this, the architect should also come out with various analysis that can possibly be performed (this will depend on the domain knowledge).

Analyzing the existing online system

The next step is to study and make sense out of the existing online system. The user provided scope should also be vetted to identify if information is available in source system that can satisfy the needs. Any part of the scope that cannot be dealt with should be taken up with the users for more details and clarification.

Functional Process flow
Technical Details

The complete tables setup is appropriately depicted by the below Entity-Relationship diagram.

An ER-Diagram is very important in understanding the flow of data in existing system and how the new Dimensional model diagram should be created. If you are in a position where the source system is not well documented, I would recommend that you highlight to the Project team the importance of spending some time on doing so. Creating and maintaining an ER-Diagram will not only help the Data Mart project, but will also be handy for people supporting the source system.

Will incomplete Transactions be passed to Data Warehouse?

After studying the system, it is identified that


As best practices, this information will not be passed to the Data Warehouse. Only completed and closed transactions will hit the data warehouse system. So two requirements mentioned by the users in scope will be scrapped and a report will be provided in Online system for this.

Note, it is very much possible to pass open transactions to data warehouse but this results in additional effort in the ETL layer and besides there are issues encountered in reporting later on (refer previous posting on Factors affecting the Data warehouse design). I have come across scenarios where Management wanted to see incomplete or subset of transactions in Data warehouse. Transactions that are logically completed can be ported even if the full cycle of the process is not completed. 

For this example, we will assume that the management is interested in having only transactions that are completely closed and the full process cycle has been performed.

Identifying the granularity level of the data for designing datawarehouse

The next applicable step is to identify what depth of data will be ported, in other words, the granularity level of the data. The process flow of the Sales Order (SO) Module gives us the below relation hierarchy for the transactions.


SO lines (SO_LINE_NO) -> SO items (ITEM_CODE) 
-> SO Header (SO_NO) -> Invoices (FIN_DOC_NO)


Based on the user reporting requirements, it is identified that the lowest reporting level will be SO Items. Apart from this, a study of the system also reveals that individual Line level information is not required for any kind of reporting and only item information in an SO will suffice.

So our present design will take all data from the "So items" level and upwards.


The next step is to also identify the Time dimension requirements. Time dimension is a standard dimension in almost all designs and the lowest level has to be finalized to ensure proper data analysis. a time dimension can allow data to be reported and rolled up in the following order for instance.

Minute -> Hour -> Day -> Week 
|
|---> Fortnight
|
|---> Month -> Quarter -> half-Year -> Year -> Bi-Year

For this design, we will assume that the managment is OK with reports on a daily basis and hourly or exact timings are not required.

Designing the Dimensions and Facts

It is eminent from the user request and the system study that reporting is primarily focused on the quantity and value of transactions. The time dimension will have the lowest granularity level as a day (no hourly reporting is required).

Following approach is followed to design the dimension tables. The idea here is to aggregate data from different relational Master tables that can relate to each other into a single dimension. This call has to be taken based on the design of the online system and the reporting needs.


The final design and dimensional model will look as follows.


Verifying the Design

The best way to verify the design is to check if the scope is satisfied.


This concludes the example. I have tried to provide a very basic sample data above to demonstrate how a Relational System can be transformed to a Dimensional Model. Please refer best Practices section (subsequent article) to get more information on what rule-of-thumb should be adhered to.

ETL Development

Once the structure is in place, the final activity from the Data Warehouse perspective is to develop the ETL. This layer should pickup data from the Online system and then merge, group or aggregate before pushing it to Data Mart. This layer can be designed in any language depending on the site preference and need.

The key point here is to identify the unique attribute in the online system tables that gives an indication of record change. For instance, a column like LAST_UPDATE_DATE will inform the ETL layer of if any updates have happened since the last run. I recommend making use of DATE or TIMESTAMP columns for identifying and pushing data in incremental form. You will have to make similar choices based on the existing design of the online system.

The other key point to identify here is what attributes in the Dimensions when changed will result in a new record. For instance, If the Customer Name is changed in online system for some reason, should the existing Data Mart record be updated with the same? Or should a new record be created in the Customer Dimension.

As an exercise, all such attributes should be identified and documented to avoid any future conflicts. Refer "Dealing with changing Master Data in Production" at http://www.amar-padhi.com/oradb_dw_factors_for_designing.html.

As part of this exercise, you will also need to identify how many past years of data should be pushed to Data Warehouse. Best bet would be to push everything so that Data warehouse also becomes a source of reference for data that has been archived in production. This would also mean that it is impertinant that you get in header information in Fact table that will required as source of reference to refer back in history.

Once all pieces are in place, do a full trial run of the ETL. Simulate different scenarios in online system and evaluate how the ETL behaves and if the data is properly loaded.

Design Reports

The Data Warehouse is now ready. Hand it over to the Reporting experts to make use of analytical tools to design the required user reports.


Previous Note link: Important factors in designing a Data Warehouse
Next recommended Note: Data Warehousing best practices


Best viewed in medium text size. Please refresh (F5) to view the latest information.
This page was created on 07-Sep-09. Last updated on 07-Sep-09.
please forward all queries to amar@amar-padhi.com