Basics of Data Warehousing
This posting is aimed at providing facts about Data Warehousing and Data Mart concepts. I have come across different setups and different approaches to designing and implementing a data warehouse. Apart from what theory teaches you, my aim is to actually put in here practical facts that are in use in different setups.
Note, I am a proponent of the Star Schema concept and my experience has been on Oracle Databases only. So my views may be biased in this regards, but I will ensure that I put in the facts as they exist.
What is Data Warehouse?
It is a warehouse of an organization's electronically stored Data. Data Warehouse are designed to facilitate reporting and analysis. Reporting generated from Data Warehouse databases are used by management for decision making. It is a vital tool for identifying business trends that aid management in taking proper decisions.Data warehouse designs must be well suited for making Queries, this would sometimes result in de-normalized structures to facilitate the business need.
Data Warehouses form a very crucial component of the Decision Support System (DSS) that the management relies on to take daily decisions and make choices that actually impacts the course of the organization.
Data warehouse is important! This is not just a back office system that is secondary to an Online System. Online systems are import for daily business process and so is the Data Warehouse that is relied on to take crucial decisions that can even severly impact business.Different Online systems can form
the source of data input for a single Data warehousing. I haven't come
across systems where Data warehouse is a source of data input to
another Data Warehouse systems, but there is no reason that it cannot
be. Technically, data warehouse should be so designed as to process a
large set of data at one time. This contradicts with OLTP systems where
the key focus is to allow best response time for online transactions.
Yes
most OLTP systems do support some form of reporting also, but this
should be limited to online and immediate status reports that are
mission critical for business to run (e.g. Invoices, delivery orders,
customer credit notes, purchase orders, Goods receipt voucher etc.).
All batch and analysis reports should be moved to Data warehouse
instance. Anything that is not an online report is a good candidate for
Data warehouse instance reports. The rate at which the Data warehouse
is updated from Online system is business dependent. From my
experience, most systems have a night window to push data from online
to Data Warehouse, and certain type of information is updated more
frequently (e.g.
GL balances, daily sales etc.).
From my experience, Data
Warehouse designs can vary based on business requirements and needs,
approach taken by the technical team and even the availability of time
and
resources. For instance, by implementing normalization on a set of data
you will most often arrive at the same data holding structures. But
there are more than one ways of designing a Data warehouse given the
same set of data. The best approach depends on the one that best suits
the present business needs and is accommodate future growth.
I have also come across designs where technical team decided to replicate the OLTP table structure in another schema and call it a Data Warehouse. Well this looks good when the data is just building. After few months or years of data in place, such designs will mostly be regretted.
It is important to undertand that the architecture of an Online and Data warehouse databases are built on different approaches and they are required to serve different purpose most important for their existance.
What is Data Mart?
A Data Mart is a subset of a Data Warehouse. Most organizations do start designing a data mart to attend to immediate needs. To keep it simple, consider Data Mart as a data reserve that satisfies certain aspect of business or just one application (or a process). Data Warehouse is a super set that engulfs all such mini Data marts to form one big reservoir of information.
(below pic is courtesy of Oracle Data Warehouse Manual)

What application layers are involved in Data Warehouse?
Lets take a look at what layers of applications are present in a Data Warehouse design. This will give an idea about what we are dealing with. Note, this is just a basic representation of how most standard data warehouses are implemented. There can be deviations from what I discuss here, based on the business need analysis and the decisions taken by IT team thereafter.
Data Sources : Data warehouses are not transaction capture system and cannot exist without proper sources in place to push data to it. Data Warehouse sources are primarily the Online systems that are capturing the daily transactions.
ETL:
This stands for Extraction, Transformation and Loading.
This
layer basically "scrubs, massages and polishes" the data and then loads
it to the data warehouse schema. No this does not mean changing the
data itself, it
simply transforms the data to make it more appropriate for reporting
purpose and load it into data Warehouse
structures.
In simple technical terms, this could be a simple PL/SQL program unit running periodically to pull data from sources and push it to Data warehouse tables.
Staging Area: is a space to carry out ETL processing before pushing data to Data Warehouse. This area receives and stores data from online sources. This data is then worked-on and transformed to fit into the Data Warehouse. Note data from multiple sources can be pushed to a single table in Data Warehouse for appropriate reporting.
Staging area is not mandatory, but is very useful when multiple sources exist. I have come across designs where the ETL layer resides in the Source system itself and data is compiled and pushed directly to the Data Warehouse, without any intermediate usage of a staging area.
Data Warehouse Database: This is the Database that forms the warehouse for all incoming data. All reporting is performed in this database. In general, there will be one schema that will be used to hold data from all different sources. Yes, multiple schemas can also be used if there is a need felt to logically separate data.
Metadata: is
data
about the data present in the warehouse database. It gives context,
relationship and meaning to the data that can then be used for
reporting and analysis. For instance, BI Tools will create metadata
information to
present proper labels and relationship to end-users for easy
understanding and reporting purpose.
Some popular OLAP tools are SAP's Business Objects and IBM's Cognos.
Data Mining: This falls under Business Intelligence section too, it is the act of identifying patterns in the gathered data. As the term says, you actually dig into data and try out various permutations to identify an emerging pattern that could be useful to make an impacting decision. For instance, a pattern could emerge that states that a specific product or brand sales more on the internet rather than on the market shelf in a certain geographical location and could result in excellent tax savings!
What info is required to create a Data Warehouse?
It is vital that the kind of reporting that user wants to extract and analyze be identified so that a Data Warehouse architect starts thinking in those terms. You need to orient the design towards the needs of the end-users. This vital input is sometimes ignored and can result in changes to the design, sometimes even restarting from scratch.
Besides this, an Architect should also study the source system to identify how data is stored. This is vital also for coming up with a more clear scope for your design.
So to sum up:-
- Input from users/management on the kind of reports they want.
- Design of the source system to evaluate how data will be extracted and placed.
- Identifying the lowest level of data granularity that will be reported on (I will discuss more on this in a subsequent posting).
- Architect who studies the business requirements, chalk-out and implement the warehouse design.
- DBA who can physically design and implement the architecture on proper hardware.
- Developers who can code the ETL layer based on inputs from the Architect.
- Reports Developers who can custom design the initial set of required reports.
I
will also elaborate more on how to design from scratch by
putting
forward an example in a subsequent posting on my website here. This
will clarify all doubts
that you may have as regards this note.
So what is the difference between Online (OLTP) and Data Warehouse Systems?
This may be a bit theory but will give you some quick
understanding on how the two differ.
| Property | Online Systems | Data Warehouse Systems |
| Purpose | Capture Data and ensure business continuity. | Analyze data and aid in decision making. |
| Object Format | Object Oriented and Tabular | Subject Oriented and Dimensional. |
| Structure | Normalized
structure, good for DML. The objects are Relational and/or Object-Oriented |
Normalized and De-normalize structure, good for
Queries. Relational objects created in Star schema, snowflake schema or fact constellation schema designs. Fact-dimension relationship implemented. Data is Organized but can be de-normalized based on the reporting requirement. |
| Focus | Focus is on speed of transaction. DML must be fast, aim is to maximize performance for DML. | Focus is on speed of retrieval (reading-data) and not transaction. |
| Data Granularity | Data is retained at minute level | Data could be either at the minute level or could be aggregating. It depends on what granularity level is being pushed to the Data Warehouse. |
| SQL Joins | Queries utilize joins across multiple tables to fetch required data. Few to many joins possible. | Queries aimed at making less number of joins to fetch data. Minimizing no. of tables in queries contribute to performance. Limited to few joins for faster querying. |
| Index usage | Limited indexes for better DML. | More indexes for better query. |
| Adhoc Queries | Adhoc queries are no-no in OLTP, but is unavoidable sometimes. | Is designed for adhoc queries and mixing and matching data for different analysis. |
| Old Transactions | OLTP systems may purge old transactions depending on the minimum retention of data as per audit needs (and government policies). | Old Data resides in these systems for all kind of permutation and combination reporting, as well as data mining. |
Next recommended Note: Data Warehouse Architecture
Best viewed in medium text size. Please refresh (F5) to view the latest information.
This page was created on 05-Jun-09. Last updated on 05-Jun-09.
please forward all queries to amar@amar-padhi.com