Data Warehouse - Design by example
If you have gone through my previous postings, you would by now have a fair idea of what designing a data warehouse is all about. For better understanding, lets take a simple Online system example and create a Data mart out of it. For this exercise, I will elaborate on how data in an existing Sales Application can be transformed into Data Mart.
Scope of the Data Warehouse
As discussed in previous posting, the users should be engaged to identify all reports and analysis requirement. For this example, lets assume that the below information has been collected from users regarding what they would like to extract from the data mart at first go -- Allow analysis of daily sales.
- Customer-wise analysis of monthly purchases
- item-wise reporting of daily, weekly and monthly analyis.
- Frequency of sales request from customers.
- List of orders that are still open.
- List of invoices not paid by the customer yet.
- Top 10 customers by value of sale
- Top 10 customers by quantity sold
- Top 10 items by value of sale
- Top 10 items by quantity sold
- Item-wise quantity returned in a period.
- Top 10 invoices by value in a period.
- Which day of the week has high sale for an item.
- Financial year wise sales total for customers.
- Invoice print report
- Sales Order-wise report extraction should be possible from Data mart.
Apart from this, the architect should also come out with various analysis that can possibly be performed (this will depend on the domain knowledge).
Analyzing the existing online system
The next step is to study and make sense out of the existing online system. The user provided scope should also be vetted to identify if information is available in source system that can satisfy the needs. Any part of the scope that cannot be dealt with should be taken up with the users for more details and clarification.Functional Process flow
- Customers are captured in customer master.
- Transaction types determine the document type (invoice, credit note) and are used in sales order entry. These are defined in Transaction Master.
- All items are defined in items master.
- User captures a new sales order in the sales Order entry screen.
- Based on availability of stock, items are shipped to customer from the Sales Order transaction screen. Multiple sales Order transactions can be captured against one sales order, till all the specified order quantity is shipped to customer.
- Each sales order transaction will generate a unique invoice
that will be available in customer billing screen.
- The
sales Order module has three masters - Items (ITEMS_MST), customer
(CUSTOMER_MST) and transaction type (TRX_TYPE_MST). Customer and
Transaction Type masters are referred by the Sales Order entry table.
The Item Master is referred by the Sales Order entry lines table.
- Sales
order entry header (SO_HEADER) and sales order entry lines (SO_LINES)
are the tables for capturing a new sales order and tracking its
progress. Every time quantity is delivered to customer from a sales
order, the transaction is recorded in the Sales Order transaction
header (SO_TRX_HEADER) and lines table (SO_TRX_LINES).
- For every sales order transaction, a financial document will be generated and stored in the customer billing transaction table (CUST_BILLING).
The complete tables setup is appropriately depicted by the below Entity-Relationship diagram.

An ER-Diagram is very important in understanding the flow of data in
existing system and how the new Dimensional model diagram should be
created. If you are in a position where the source system is not well
documented, I would recommend that you highlight to the Project team
the importance of spending some time on doing so. Creating and
maintaining an ER-Diagram will not only help the Data Mart project, but
will also be handy for people supporting the source system.
Will incomplete Transactions be passed to Data Warehouse?
After studying the system, it is identified that
- The Sales Order remain in incomplete state till all
quantity has been delivered to the customer.
- The invoices remain in incomplete state till money has been
received from the customer.
As
best practices, this information will not be passed to the Data
Warehouse. Only completed and closed transactions will hit the data
warehouse system. So two requirements mentioned by the users in scope
will be scrapped and a report will be provided in Online system for
this.
- List of orders that are still open.
- List of invoices not paid by the customer yet.
Note, it is very much possible to pass open transactions to data warehouse but this results in additional effort in the ETL layer and besides there are issues encountered in reporting later on (refer previous posting on Factors affecting the Data warehouse design). I have come across scenarios where Management wanted to see incomplete or subset of transactions in Data warehouse. Transactions that are logically completed can be ported even if the full cycle of the process is not completed.
For this example, we will assume that the management is interested in having only transactions that are completely closed and the full process cycle has been performed.Identifying the granularity level of the data for designing datawarehouse
The next applicable step is to identify what depth of data will be ported, in other words, the granularity level of the data. The process flow of the Sales Order (SO) Module gives us the below relation hierarchy for the transactions.
SO lines (SO_LINE_NO) -> SO items (ITEM_CODE)
-> SO Header (SO_NO) -> Invoices (FIN_DOC_NO)
Based
on the user reporting requirements, it is identified that the lowest
reporting level will be SO Items. Apart from this, a study of the
system also reveals that individual Line level information is not
required for any kind of reporting and only item information in an SO
will suffice.
So our present design will take all data from the "So items" level and upwards.
The next step is to also identify the Time dimension requirements. Time dimension is a standard dimension in almost all designs and the lowest level has to be finalized to ensure proper data analysis. a time dimension can allow data to be reported and rolled up in the following order for instance.
Minute -> Hour -> Day -> Week
|
|---> Fortnight
|
|---> Month -> Quarter -> half-Year -> Year -> Bi-Year
For this design, we will assume that the managment is OK with reports on a daily basis and hourly or exact timings are not required.
Designing the Dimensions and Facts
It
is eminent from the user request and the system study that reporting is
primarily focused on the quantity and value of transactions. The time
dimension will have the lowest granularity level as a day (no hourly
reporting is required).
Following approach is followed to
design the dimension tables. The idea here is to aggregate data from
different relational Master tables that can relate to each other into a
single dimension. This call has to be taken based on the design of the
online system and the reporting needs.
- One item dimension is created with all item attributes.
This is an independent entity.
- One customer dimension is created with all customer attributes. This is an independent entity.
- One
time dimension is created with the day, week, month, quarter and year
information. Time dimension should always be present for reporting.
- Following information is added in Fact table
- Item-wise quantity for every order that is completed
- Item-wise total value for very order that is completed
- It is noted that the Data Mart should have information
about the Sales Order so that users can extract this information as and
when required. This would reduce the aggregation level to order level.
This would also introduce Degenerated
dimension data in the Fact table, this basically means
that header columns that can be queried on are placed in Fact table for
satisfying business requirement.
Though this is not a clean design, I have come across too many of such designs that are working well and satisfying the business requirements. For instance, in this design SO_NO, TRX_NO, FIN_DOC_NO etc is introduced to allow users to query on the information.
- Item-wise quantity for every order that is completed
The final design and dimensional model will look as follows.

Verifying the Design
The best way to verify the design is to check if the scope is
satisfied.
- Allow analysis of daily sales.
OK. Daily sales can be reported as day-wise data is available. This can be driven from any of the dimensions.
- Customer-wise analysis of monthly, quarterly and yearly
purchases.
OK. Customer dimension can be used to get the total values.
- Item-wise reporting of daily, weekly, monthly, quarterly
and yearly sales.
OK. Fact can be queried using the Time and Items dimensions.
- Frequency of sales request from customers.
OK. Can be extracted using the Time and Customer dimensions.
- List of orders that are still open.
N/A. will be handled in Online system as this requires latest data.
- List of invoices not paid by the customer yet.
N/A. will be handled in Online system as this requires latest data.
- Top 10 customers by value of sale
OK. Can be extracted using Customer and optionally time dimensions.
- Top 10 customers by quantity sold
OK. Can be extracted using Customer and optionally time dimensions.
- Top 10 items by value of sale
OK. Can be extracted using Items and optionally time dimensions.
- Top 10 items by quantity sold
OK. Can be extracted using Items and optionally time dimensions.
- Item-wise quantity returned in a period.
OK. Can be extracted using Items and Period dimensions.
- Top 10 invoices by value in a period.
Can be extracted using Orders and Period dimensions.
- Which day of the week has high sale for an item.
Can be extracted using Period and items dimensions.
- Financial year wise sales total for customers.
Can be extracted using the Period and customer dimensions.
- Invoice Print report
N/A. will be handled in Online system. This is an online report and is required for business continuity, will be put in online system.
- Sales Order-wise report extraction should be possible from
Data mart
The data mart Fact table has degenerated dimension data from Sales Order header table to allow grouping of data based on order and invoice information. This requirement is also achieved.
This
concludes the example. I have tried to provide a very basic sample data
above to demonstrate how a Relational System can be transformed to a
Dimensional Model. Please refer best Practices section (subsequent
article) to get more information on what rule-of-thumb should be
adhered to.
ETL Development
Once the structure is in place, the final activity from the Data Warehouse perspective is to develop the ETL. This layer should pickup data from the Online system and then merge, group or aggregate before pushing it to Data Mart. This layer can be designed in any language depending on the site preference and need.The key point here is to identify the unique attribute in the online system tables that gives an indication of record change. For instance, a column like LAST_UPDATE_DATE will inform the ETL layer of if any updates have happened since the last run. I recommend making use of DATE or TIMESTAMP columns for identifying and pushing data in incremental form. You will have to make similar choices based on the existing design of the online system.
The other key point to identify here is what attributes in the Dimensions when changed will result in a new record. For instance, If the Customer Name is changed in online system for some reason, should the existing Data Mart record be updated with the same? Or should a new record be created in the Customer Dimension.
As an exercise, all such attributes should be identified and documented to avoid any future conflicts. Refer "Dealing with changing Master Data in Production" at http://www.amar-padhi.com/oradb_dw_factors_for_designing.html.
As part of this exercise, you will also need to identify how many past years of data should be pushed to Data Warehouse. Best bet would be to push everything so that Data warehouse also becomes a source of reference for data that has been archived in production. This would also mean that it is impertinant that you get in header information in Fact table that will required as source of reference to refer back in history.
Once all pieces are in place, do a full trial run of the ETL. Simulate different scenarios in online system and evaluate how the ETL behaves and if the data is properly loaded.
Design Reports
The Data Warehouse is now ready. Hand it over to the Reporting experts to make use of analytical tools to design the required user reports.
Previous Note link: Important factors
in designing a Data Warehouse
Next recommended Note: Data
Warehousing best practices
Best viewed in medium text size. Please refresh (F5) to view the latest information.
This page was created on 07-Sep-09. Last updated on 07-Sep-09.
please forward all queries to amar@amar-padhi.com