WHAT-IS . NET
Information and answers to all your common and special questions.
Copyright ©2009 What-is.Net  All rights reserved.
Last Updated: Sep 2009
What is Data Warehousing?
Source System Identification:
Source System Identification: In order to build the data warehouse, the appropriate data must be located. Typically, this  will involve both the current OLTP (On-Line Transaction Processing) system where the "day-to-day" information about the  business resides, and historical data for prior periods, which may be contained in some form of "legacy" system. Often these  legacy systems are not relational databases, so much effort is required to extract the appropriate data.

Data Warehouse Design and Creation:
This describes the process of designing the warehouse, with care taken to ensure that the design supports the types of  queries the warehouse will be used for. This is an involved effort that requires both an understanding of the database schema  to be created, and a great deal of interaction with the user community. The design is often an iterative process and it must  be modified a number of times before the model can be stabilized. Great care must be taken at this stage, because once the  model is populated with large amounts of data, some of which may be very difficult to recreate, the model can not easily be  changed.

Data Acquisition:
This is the process of moving company data from the source systems into the warehouse. It is often the most time-consuming  and costly effort in the data warehousing project, and is performed with software products known as ETL  (Extract/Transform/Load) tools. There are currently over 50 ETL tools on the market. The data acquisition phase can cost  millions of dollars and take months or even years to complete. Data acquisition is then an ongoing, scheduled process, which  is executed to keep the warehouse current to a pre-determined period in time, (i.e. the warehouse is refreshed monthly).

Changed Data Capture:
The periodic update of the warehouse from the transactional system(s) is complicated by the difficulty of identifying which  records in the source have changed since the last update. This effort is referred to as "changed data capture". Changed data  capture is a field of endeavor in itself, and many products are on the market to address it. Some of the technologies that  are used in this area are Replication servers, Publish/Subscribe, Triggers and Stored Procedures, and Database Log Analysis.

Data Cleansing:
This is typically performed in conjunction with data acquisition (it can be part of the "T" in "ETL"). A data warehouse that  contains incorrect data is not only useless, but also very dangerous. The whole idea behind a data warehouse is to enable  decision-making. If a high level decision is made based on incorrect data in the warehouse, the company could suffer severe  consequences, or even complete failure. Data cleansing is a complicated process that validates and, if necessary, corrects  the data before it is inserted into the warehouse. For example, the company could have three "Customer Name" entries in its  various source systems, one entered as "IBM", one as "I.B.M.", and one as "International Business Machines". Obviously, these  are all the same customer. Someone in the organization must make a decision as to which is correct, and then the data  cleansing tool will change the others to match the rule. This process is also referred to as "data scrubbing" or "data  quality assurance". It can be an extremely complex process, especially if some of the warehouse inputs are from older  mainframe file systems (commonly referred to as "flat files" or "sequential files").

Data Aggregation:
is process is often performed during the "T" phase of ETL, if it is performed at all. Data warehouses can be designed to  store data at the detail level (each individual transaction), at some aggregate level (summary data), or a combination of  both. The advantage of summarized data is that typical queries against the warehouse run faster. The disadvantage is that  information, which may be needed to answer a query, is lost during aggregation. The tradeoff must be carefully weighed,  because the decision can not be undone without rebuilding and repopulating the warehouse. The safest decision is to build the  warehouse with a high level of detail, but the cost in storage can be extreme.

Now that the warehouse has been built and populated, it becomes possible to extract meaningful information from it that will  provide a competitive advantage and a return on investment. This is done with tools that fall within the general rubric of  "Business Intelligence".

Business Intelligence (BI)
A very broad field indeed, it contains technologies such as Decision Support Systems (DSS), Executive Information Systems  (EIS), On-Line Analytical Processing (OLAP), Relational OLAP (ROLAP), Multi-Dimensional OLAP (MOLAP), Hybrid OLAP (HOLAP, a  combination of MOLAP and ROLAP), and more. BI can be broken down into four broad fields:

Multi-dimensional Analysis Tools:
Tools that allow the user to look at the data from a number of different "angles". These tools often use a multi-dimensional  database referred to as a "cube".

Query tools:
Tools that allow the user to issue SQL (Structured Query Language) queries against the warehouse and get a result set back.

Data Mining Tools:
Tools that automatically search for patterns in data. These tools are usually driven by complex statistical formulas. The  easiest way to distinguish data mining from the various forms of OLAP is that OLAP can only answer questions you know to ask,  data mining answers questions you didn't necessarily know to ask.

Data Visualization Tools:
Tools that show graphical representations of data, including complex three-dimensional data pictures. The theory is that the  user can "see" trends more effectively in this manner than when looking at complex statistical graphs. Some vendors are  making progress in this area using the Virtual Reality Modeling Language (VRML).


Metadata Management
Throughout the entire process of identifying, acquiring, and querying the data, metadata management takes place. Metadata is  defined as "data about data". An example is a column in a table. The datatype (for instance a string or integer) of the  column is one piece of metadata. The name of the column is another. The actual value in the column for a particular row is  not metadata - it is data. Metadata is stored in a Metadata Repository and provides extremely useful information to all of  the tools mentioned previously. Metadata management has developed into an exacting science that can provide huge returns to  an organization. It can assist companies in analyzing the impact of changes to database tables, tracking owners of individual  data elements ("data stewards"), and much more. It is also required to build the warehouse, since the ETL tool needs to know  the metadata attributes of the sources and targets in order to "map" the data properly. The BI tools need the metadata for  similar reasons.
The data warehousing market consists of tools, technologies, and methodologies that allow for the construction, usage,  management, and maintenance of the hardware and software used for a data warehouse, as well as the actual data itself.

Data Warehousing is a complex field, with many vendors vying for market awareness. The complexity of the technology and the  interactions between the various tools, and the high price points for the products require companies to perform careful  technology evaluation before embarking on a warehousing project. However, the potential for enormous returns on investment  and competitive advantage make data warehousing difficult to ignore.

Data warehousing is essentially what you need to do in order to create a data warehouse, and what you do with it. It is the  process of creating, populating, and then querying a data warehouse and can involve a number of discrete technologies such  as:
Data Warehousing
Return to HOME Page