Hello everyone, I’m trying to improve my english skills, so from now on I will try to in english write as much as possible. I’m also reading Kimballs’ and Casertas’ “The Data Warehouse ETL Toolkit”, and here I will post my chapters summaries. Hope it can help you like its helping me 🙂
Chapter 01 talks about three main structures that are: requirements, architecture and data warehouse mission.
Requirements are sub-divided into business needs, compliance requirements, data profiling, security requirements, data integration, data latency, archiving and lineage, end user delivery interface, available skills, legacy licenses.
Architecture is sub-divided into ETL Tools vs Hand Coding, The Back Room, The Front Room.
Data Warehouse Mission is sub-divided into what the data warehouse is, what the data warehouse is not.
Beside those three main topics chapter 01 also discuss about industry terms not used consistently, resolving architectural conflict the hybrid bus approach, how the data warehouse is changing and the mission of the ETL team.
I’m not going over all topics and sub-topics, but I’d rather highlight here in this summary what I find most relevant, based on my current ETL understanding and perspective.
On the Requirements topics Kimball lists several possible requirements that your company might need in order to implement DW-ETL, however, he states that Business Needs is the most fundamental and important one.
Kimball says that Business Needs are (sic) the information content that end users need to make informed business decisions. It means that every time the ETL team is assessing data sources to ingest, they need to take in consideration how that data source can be used to support users’ “informed business decisions”.
Data Integration is an important topic further discussed in Chapter 05, but it is important to know that its routine to have multiple data sources, and therefore, implement some type of data integration, and by it Kimball means conformed dimensions and conformed facts. In other words, (sic) conforming facts means agreeing on common business metrics across separate databases so that these numbers can be compared mathematically by calculation differences and ratios. In the ETL system, data integration is a separate step identified in our data flow as the conform step. Physically, this step involves enforcing common names of conformed dimension attributes and facts, as well as enforcing common domain contents and common units of measurement.
Archiving and Lineage
I particularly find this topic very interesting. It says that every data warehouse will much likely need various copies of old data, either for comparisons with new data to generate change capture or for reprocessing. Kimballs’ recommendation is that we should archive data every time we have a major staging point. So, if you consider Kimballs ETL approach of ECCD (extract, clean, conform, deliver) you will have 4 points of staging and archiving. So, if you are interested in performing data quality of your ETL process, or just need to reprocess some old data, the archiving will help you, as Kimball says (sic): it is almost always less of a headache to read data back in from permanent media than it is to reprocess data through the ETL system at a later time. And, of course, it may be impossible to reprocess data according to the old processing algorithms if enough time has passed.
On this topic Kimballs says that architecture decisions should be done early and as a whole. The main points on this are to decide between proprietary ETL tool or to hand code your ETL. As of today I rarely see people doing ETL hand coding, but I can be wrong. Either way, whichever you choose make sure you stick to it, otherwise a change in such architecture decision can lead to an entire overhaul of your ETL system. This topic also covers the back and front room architecture, both that are important key aspects of Kimballs approach on ETL.
Exception and Quality Handling
Kimball advocates that errors, exceptions and issues that occurs during, or as result of, the ETL process should be treated as a system-wide, uniform mechanism for reporting all instances of exceptions that happened with the ETL, and not just as a series of random little ad-hoc alerts and comments placed in files. The same way when dealing with quality issues. More on quality on chapter 04.
Recovery and Restart
Kimball do not go greats lengths on this matter, but he says that it is very important that all ETL jobs should be planned in a way that if a job needs to run twice for some reason, might be a bug, an error, etc., it needs to make sure that no data is lost, corrupted nor tampered. It’s of very high importance otherwise your data warehouse will end up receiving incorrect data from your ETL as it is certain that sooner or later some of your ETL jobs will either terminate abnormally or be mistakenly run more than once.
The Back Room – Preparing the Data
Figure 1. Back Room
Back room is where the ETL developer team works, it’s where the data management, or it is where the ETL team prepare the data. No query services should be provided to users on the back room, and that’s because by doing that we can relieve the ETL team from:
- Providing detailed security at a row, column or application level
- Building query performance-enhancing indexes and aggregations
- Guaranteeing that all datasets are consistent with each other
Kimball writes about a restaurant comparison, and it’s a very good, but lengthy one, but we can summarize as: the kitchen is where things are prepared, the dining room (front room) is where dishes are served, well presented to the final users.
The word staging is very important to Kimball as he uses it to define steps, moments, in which data is saved/archived as files or in tables for later use or recovery/restart purposes. He states that (sic) staging almost always implies a temporary or permanent physical snapshot of data. We assume that every ETL system supporting the data warehouse is structured with these four steps and that data is staged (written to disk) in parallel with the data being transferred to the next stage.
Figure 2. ECCD, the four staging steps of a data warehouse.
In the back room there are four steps, which are detailed in future chapters, they are: extracting, cleaning, conforming, delivering.
note: besides most transformation being performed on the cleaning and conforming steps, the best place to resolve certain legacy data format issues is the extract step. These includes resolving repeating groups, REDEFINEs, and overloaded columns and performing low-level data conversions, including converting bit encoding to character, EBCDIC to ASCII, and packed decimal to integer, more on this on chapter 3.
note: figure 2 looks like you must do all the ECCD (extract, clean, conform, deliver) serially, with well-defined boundaries between each pair of steps, but in practice there will be multiple simultaneous flows of data in the ETL happening at the same time, and frequently some of the cleaning steps are embedded in the logic that performs extraction.
Extracting: in this step the raw data coming from various sources are written directly to disk, with minimal to no restructuring. This allow the initial load to be simple and fast as possible, to not burden the source it is extracting data from. This is important in case the source needs to be re-ingested if something happens (like errors or interruptions). This initial data can be used multiple times and can be archived for future reference, or can be stored to be used as reference for change checks for the next ETL cycle and then discarded.
Cleaning: in this step data quality is applied, so checks such as validating values, like if zip codes are in valid format and in valid range, ensure consistency across values, like if the zip codes and the city are consistent, removing duplicates, applying and checking business rules, like if the premium customer have the correct premium status associated to them.
Data-cleaning transformations may involve human intervention and the exercise of judgment. It is sometimes important, depending on the level of transformations applied, to save semi permanently the results of data-cleaning, since they are important to track, irreversible and, many times, complex.
The data exceptions found in the data-cleaning step should be reported back to the source in order to improve the overall data quality of the source.
Conforming: every time you have more than 1 data source you will much likely need data conformation. This includes all text labels of your dimensions, numeric measures that can be in different orders, like some numbers are in kilometers, others in miles, etc. Chapter 5 and 6 talks more about this.
Delivering: the back room prepare data to be ready for querying. The final and crucial back-room step is physically structuring the data into a set of schemas known as dimensional models (star schemas). Chapter 5 talks more about this.
ODS – Operational Data Store
According to Kimball ODS has been absorbed by the data warehouse, but we know that some companies still uses it.
ODS is a hot extract from the operation system database, and it was particularly important when data warehouses were not able to provide the correct grain of data, nor at the expected speed, the users needed.
Since data warehouse can now work much faster, in a daily, or even at online pace, the ODS is no longer needed.
The Front Room – Data Access
The front room is where data is displayed and accessed by users, it is the data that the user gets to see.
The whole purpose of the ETL system is to feed the presentation layer of dimensionally modeled tables that are directly accessed by query tools, report writers, dashboards, and OPLAP cubes. Chapter 05 and 06 talks more about modeling.
Kimball takes special interest talking about data marts. He says that people usually mistake the concept of a data mart, so for Kimball a data mart is (sic) a set of dimensional tables supporting a business process. In Kimballs view, data marts are always measurement-intensive subject areas, and they are surrounded by descriptive entities. Other important topics about data mart:
- Data marts should not duplicate data. A company can only have one ORDERS data mart, so all user tools and queries in the company various departments should access this data mart to have a single, consistently labeled version of ORDERS.
- Data marts should be as atomic as possible, containing all information needed to go to the deepest level of drill-down. So here Kimball states the aggregated data marts are an incorrect assumption.
About Data Warehouse Mission
Since the data warehouse is a decision support system, the main criteria of success is whether the data warehouse effectively contributes to the most important decision-making process in the organization.
The actual scenario is that we have been ingesting years and years of data in our transactional databases, but It’s has been really difficult to pull that data out and make use of it, so the ETL system must play a major role in handling the data to the final end user applications in a useable form. And that is what Data Warehouse is about.
What Data Warehouse Is
Data warehouse is the process of taking data, from one or many sources, and transform it into organized information in a user-friendly format to encourage data analysis and support fact-based business decision making.
In other words, (sic) a data warehouse is a system that extracts, cleans, conforms, and delivers source data into a dimensional data store and then supports and implements querying and analysis for the purpose of decision making.
What Data Warehouse Is Not
As Kimball states, the concept of a data warehouse is often misunderstand. So, he says that, a data warehouse is not a product, nor a language, nor a project, nor a data model, nor a copy of your transaction systems.
Data warehouse includes several components, such as design suits, techniques, tools, and products. The ETL is a major component, but it alone does not make up for a data warehouse.
The Mission of the ETL Team
At the highest level, the ETL team mission is to build the back room of the data warehouse, and, the back room must support the following steps:
- Extracting data from the original sources
- Quality assuring and cleaning data
- Conforming the labels and measures in the data to achieve consistency across the original sources
- Delivering data in a physical format that can be used by query tools, report writers, and dashboards.
Ralph Kimball, Joe Caserta- The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data, 2004.