Data Vault Hands On – First approach

Well, so after much reading, it is time to get my hands on the job and start actually modeling some data vault models. Be aware that this is my very first attempt of turning a 3NF model in DV model.

For this I’m using SQL Server Northwind database, you can get it here: https://github.com/microsoft/sql-server-samples/blob/master/samples/databases/northwind-pubs/instnwnd.sql

The data model

Here’s the full Northwind 3NF model. Highlighted in green those tables that we are going to cover.

Note: for this current project I’m not considering the hierarchical relationship of Employees table.

So, the natural order would be creating hubs, links and sats, but before doing that I want first to create a simple ER diagram to reflect my understanding of the model. I am using diagrams.net for this.

As you can see only strong entities survived our ERD mapping. I’m mainly talking about the “order details” table, since it’s really a bridge table between orders and products. I believe that, in reality, the ERD precedes the 3NF model, but for this exercise I’m doing it anyway. Again, this is just for the sake of understanding how the business looks at the process.

Okay, now that we have our ERD and 3NF, and an understanding of the business, we can start data vault our model.

Creating the Data Vault

So, first step is to find all HUBS in our model, but what are HUBS?

HUB is a place where we can store business keys, they are like the representation of our business entities.

What about business keys? What is a business key?

Business key is an object that identify your business object, like IDs that identify something, like social security number identify a person, or a customer id identify a customer, a driver license number, a product id, etc.

Employees

By analyzing the employees table we find that EmployeeID is our business key. So, our HUB can look like this:

As you can see, our business key EmployeeID is not the PK of our HUB, the hash key of it that it is, meaning, we hash md5 the BK to use as our PK in our HUBS.

Besides the BK and it’s HK (hash key), we also have LOAD_DTS and REC_SRC as metadata in our HUB table.

LOAD_DTS: the load date indicates when the business key initially arrived in the data warehouse. The load date should be the same for all data that arrived in the same batch.

REC_SRC: the record source is hard-coded and applied to maintain traceability of the arriving data set. It is best to refrain from using a generalized record source, such as “SAP” for all SAP data. Instead, use the lowest level of granularity, for example “SAP.FINANCE.GL” to indicate the general ledger module in the financial application of SAP.

Now that we have created our first HUB and understand better how to model it, let’s create the remaining ones.

Other HUBS

So, the remaining HUBS required are: PRODUCTS, SHIPPERS, ORDERS, CUSTOMERS.

LINKS

We have identified our business keys and HUBs, now it’s time to create LINKs between them. LINKS represents relationships.

Here’s my approach on modeling LINKS for this model:

We have a transaction between ORDERS and PRODUCTS, it means that many products can be on many orders, and that many orders can contain many products.

As you can see in the image, our LINK table take HK from all tables related, and that is important because it represent the grain on this relationship. We also add LOAD_DTS and REC_SRC as metadata fields.

Important to understand is that the PK HK_Products_Orders is a HK from the BK from Products and Orders. It is not a HK from the HK, it is a HK from the BK… hope it’s clear.

Moving on, the other LINK needs to represent the grain expected in the transaction of making an order, that’s why we need to include all other entities, aka HUBS, in that next LINK.

Once HUBS and LINKS are set, it’s time to get the SATELLITLES (SATS).

SATELLITES

Now that we have HUBS and LINKS put together, let’s work on our SATS. Here’s how SATS are composed:

They have composed PK: hash key from HUB + LOAD_DTS
SATS store all descriptive attributes/information

Best practice for designing SAT is to keep all descriptive attributes on the same grain on the same SAT, so if your attribute is in another grain, take it to a different SAT. Also, if your data changes in different pace, separate them in different SATS. You can also split SATS by data types and source, but that will depend on how you are approaching your model.

So, for Employees we have in orange attributes that change more slowly, whereas in green we have attributes that might be updated more often. ReportsTo is out of the scope of this article.

This is how our model should look like with our new SATS:

This is our final model:

Okay, but how do I create this model using SQL Server database?

First we need to create our HUB tables, then populate it.

Here’s how we can create our first HUB_EMPLOYEES table. Be advised that what I’m doing here is not best practice, in the real world you should have a stage area to load your data, and here we are not creating a staging area, I will address this topic somewhere in the future.

LOADING HUBS

The SQL code I used to create HUB_EMPLOYEES:

CREATE TABLE [raw].[HUB_EMPLOYEES] (

[HK_EmployeeID] [varbinary](35) NOT NULL,

[LOAD_DTS] [date] NOT NULL,

[REC_SRC] [varchar](100) NOT NULL,

[EmployeeID] [int] NOT NULL,

CONSTRAINT [PK_HUB_EMPLOYEE] PRIMARY KEY CLUSTERED

(

[HK_EmployeeID] ASC

) ON [PRIMARY],

CONSTRAINT [UK_HUB_EMPLOYEE] UNIQUE NONCLUSTERED

(

[EmployeeID] ASC

) ON [PRIMARY],

) ON [PRIMARY]

;

And here’s how populate our new HUB_EMPLOYEES:

INSERT INTO [raw].[HUB_EMPLOYEES]

(

[HK_EmployeeID]

,[LOAD_DTS]

,[REC_SRC]

,[EmployeeID]

)

SELECT HASHBYTES( ‘md5’, convert( varchar(35), STG.EmployeeID ) )

, GETDATE()

, ‘NORTHWIND’

, STG.EmployeeID

FROM [dbo].[Employees] STG

WHERE

NOT EXISTS ( SELECT 1 FROM [RAW].[HUB_EMPLOYEES] HUB WHERE HUB.[EmployeeID] = STG.[EmployeeID] )

;

Note: be advised that in your INSERT INTO statement you will much likely have a where clause that filters the date stamp you want, that is to make sure you are only performing incremental load, and not full loads all the time.

You can repeat this same logic for all our hubs.

LOADING LINKS

Links represent our relationships, the connection between various entities. Therefore, link loading should occur after the hub loading.

For links you can use the following SQL code:

/* LINK_EMP_CUST_SHIP_ORDERS */

CREATE TABLE [raw].[LINK_EMP_CUST_SHIP_ORDERS](

[HK_Emp_Cust_Ship_Orders] [varbinary](35) NOT NULL,

[LOAD_DTS] [date] NOT NULL,

[REC_SRC] [varchar](100) NOT NULL,

[HK_EmployeeID] [varbinary](35) NOT NULL,

[HK_OrderID] [varbinary](35) NOT NULL,

[HK_CustomerID] [varbinary](35) NOT NULL,

[HK_ShipperID] [varbinary](35) NOT NULL

CONSTRAINT [PK_LINK_EMP_CUST_SHIP_ORDERS] PRIMARY KEY CLUSTERED

(

[HK_Emp_Cust_Ship_Orders] ASC

CONSTRAINT [UK_LINK_EMP_CUST_SHIP_ORDERS] UNIQUE NONCLUSTERED

(

[HK_EmployeeID] ASC,

[HK_OrderID] ASC,

[HK_CustomerID] ASC,

[HK_ShipperID] ASC

)

) ON [PRIMARY]

/* ESTABILISHING RELATIONSHIPS BETWEEN LINK AND HUBS */

ALTER TABLE [raw].[LINK_EMP_CUST_SHIP_ORDERS] WITH CHECK ADD CONSTRAINT [FK_LINK_EMP_CUST_SHIP_ORDERS_HUB_EMPLOYEES] FOREIGN KEY([HK_EmployeeID])

REFERENCES [raw].[HUB_EMPLOYEES] ([HK_EmployeeID])

ALTER TABLE [raw].[LINK_EMP_CUST_SHIP_ORDERS] WITH CHECK ADD CONSTRAINT [FK_LINK_EMP_CUST_SHIP_ORDERS_HUB_ORDERS] FOREIGN KEY([HK_OrderID])

REFERENCES [raw].[HUB_ORDERS] ([HK_OrderID])

ALTER TABLE [raw].[LINK_EMP_CUST_SHIP_ORDERS] WITH CHECK ADD CONSTRAINT [FK_LINK_EMP_CUST_SHIP_ORDERS_HUB_CUSTOMER] FOREIGN KEY([HK_CustomerID])

REFERENCES [raw].[HUB_CUSTOMER] ([HK_CustomerID])

ALTER TABLE [raw].[LINK_EMP_CUST_SHIP_ORDERS] WITH CHECK ADD CONSTRAINT [FK_LINK_EMP_CUST_SHIP_ORDERS_HUB_SHIPPERS] FOREIGN KEY([HK_ShipperID])

REFERENCES [raw].[HUB_SHIPPERS] ([HK_ShipperID])

Now, I will load my LINK, but be advised that this is not the recommended practice, I’m doing this way just so we can have the tables filled with data and you can see the end result. Here’s how I did it:

INSERT INTO [raw].[LINK_EMP_CUST_SHIP_ORDERS]

(

[HK_Emp_Cust_Ship_Orders]

,[LOAD_DTS]

,[REC_SRC]

,[HK_EmployeeID]

,[HK_OrderID]

,[HK_CustomerID]

,[HK_ShipperID]

)

SELECT distinct

HASHBYTES( ‘md5’, convert( varchar(35), e.[EmployeeID] ) + convert( varchar(35), o.[OrderID] ) + convert( varchar(35), c.[CustomerID] ) + convert( varchar(35), s.[ShipperID] ) )

,GETDATE()

,’NORTHWIND’

,HASHBYTES( ‘md5’, convert( varchar(35), e.[EmployeeID] ) ) — e.EmployeeID

,HASHBYTES( ‘md5’, convert( varchar(35), o.[OrderID] ) ) — o.OrderID

,HASHBYTES( ‘md5’, convert( varchar(35), c.[CustomerID] ) ) — c.CustomerID

,HASHBYTES( ‘md5’, convert( varchar(35), s.[ShipperID] ) ) — s.ShipperID

FROM Orders O

INNER JOIN Customers C ON C.CustomerID = O.CustomerID

INNER JOIN Employees E ON E.EmployeeID = O.EmployeeID

INNER JOIN Shippers S ON S.ShipperID = O.ShipVia

order by 1,2,3,4

LOADING SATS

Now that we have HUBS and LINKS loaded, it’s time to create and load SATS.

Script to create and load SATS:

/* CREATING SATS */

CREATE TABLE [raw].[SAT_EMP_OTHERS](

[HK_EmployeeID] [varbinary](35) NOT NULL,

[LOAD_DTS] [date] NOT NULL,

[REC_SRC] [varchar](100) NOT NULL,

[Address] [nvarchar](60) NULL,

[City] [nvarchar](60) NULL,

[Region] [nvarchar](60) NULL,

[PostalCode] [nvarchar](60) NULL,

[Country] [nvarchar](60) NULL,

[HomePhone] [nvarchar](60) NULL,

[Extension] [nvarchar](60) NULL,

[Photo] [image] NULL,

[Notes] [ntext] NULL,

[PhotoPath] [nvarchar](255) NULL,

CONSTRAINT [PK_SAT_EMP_OTHERS] PRIMARY KEY CLUSTERED

(

[HK_EmployeeID] ASC,

[LOAD_DTS] ASC

) ON [PRIMARY]

ALTER TABLE [raw].[SAT_EMP_OTHERS] WITH CHECK ADD CONSTRAINT [FK_SatCustomers_HubCustomers] FOREIGN KEY([HK_EmployeeID])

REFERENCES [raw].[HUB_EMPLOYEES] ([HK_EmployeeID])

/* LOADING SATS */

INSERT INTO [raw].[SAT_EMP_OTHERS]

(

[HK_EmployeeID]

,[LOAD_DTS]

,[REC_SRC]

,[Address]

,[City]

,[Region]

,[PostalCode]

,[Country]

,[HomePhone]

,[Extension]

,[Photo]

,[Notes]

,[PhotoPath]

)

SELECT

HE.HK_EmployeeID

,’2022-04-12′ –‘GETDATE() REPLACEING GETDATE() HERE IN ORDER TO MIMIC THAT DATA WERE LOADED THE SAME DAY HUB DATA LOADED

,’NORTHWIND’

,E.Address

,E.City

,E.Region

,E.PostalCode

,E.Country

,E.HomePhone

,E.Extension

,E.Photo

,E.Notes

,E.PhotoPath

FROM Employees E

INNER JOIN RAW.HUB_EMPLOYEES HE ON HE.EmployeeID = E.EmployeeID

Here’s the load result:

This process repeats again for all SATS, in the end this is how our data vault model will look like in our SSMS diagram:

You can find all query script in this file:

https://github.com/p3dro-carvalho/p3dro-carvalho/blob/main/SQLQuery_loadingDV.sql

Well, it’s not a fully functional data-vault, I mean, considering all the automation required an whatnot, but it’s just a start. More to come in next posts! See you there.

References

https://www.udemy.com/course/modeling-data-warehouse-with-data-vault-for-beginners/

https://simplesqlbi.home.blog/2019/06/29/part-3-data-vault-for-beginners/

7 thoughts on “Data Vault Hands On – First approach”

maheshbabu (@hellomaheshbabu) says:

February 21, 2023 at 13:17

1.
CONSTRAINT [UK_HUB_EMPLOYEE] UNIQUE NONCLUSTERED
(
[EmployeeID] ASC
) ON [PRIMARY],

Are you not missing REC_SRC here?

2.
[LOAD_DTS] [date] NOT NULL, — why not DATETIME instead of DATE?

LikeLike

1. Pedro Carvalho says:
  
  February 22, 2023 at 10:38
  
  2. it can be date or datetime, depends on the granularity you want to store your data.
  
  LikeLike
  
2. Pedro Carvalho says:
  
  February 22, 2023 at 10:42
  
  1. REC_SRC identifies the record source, and HUBs are entities where you store business keys, so you want to have duplicated business keys if they are from different sources? If you do, add REC_SRC to your constraint, if you not (like me, I want to normalize my business keys) then do not add REC_SRC there.
  
  LikeLike
  
maheshbabu (@hellomaheshbabu) says:

February 21, 2023 at 13:22

SELECT HASHBYTES( ‘md5’, convert( varchar(35), STG.EmployeeID ) ) — also here, don’t we want to add REC_SRC ?

LikeLike

1. Pedro Carvalho says:
  
  February 22, 2023 at 10:45
  
  Only if you want to differentiate the same business key from different sources. If you want to normalize your business keys you should not add REC_SRC there.
  
  LikeLike
  
  1. maheshbabu (@hellomaheshbabu) says:
    
    February 22, 2023 at 15:09
    
    I thought we need to differentiate business key coming from different sources, because the same value e.g. Product_ID=2002 may mean different things on each source system
    
    LikeLike
  2. Pedro Carvalho says:
    
    February 22, 2023 at 15:30
    
    As I understand from the literature, business keys should be normalized, like when Kimball talks about conformed dimensions. Let me know if you find somewhere else saying otherwise, please.
    
    LikeLike

Data Vault Hands On – First approach

The data model

Creating the Data Vault

This is our final model:

Okay, but how do I create this model using SQL Server database?

Published by Pedro Carvalho

7 thoughts on “Data Vault Hands On – First approach”

Deixe uma resposta Cancel reply

The data model

Creating the Data Vault

This is our final model:

Okay, but how do I create this model using SQL Server database?

Share this:

Published by Pedro Carvalho

7 thoughts on “Data Vault Hands On – First approach”

Deixe uma resposta Cancel reply