This text is just a summary of all information that I found relevant while reading Data Vault official documentation. The documentation is a 41 page book, and you can find it over here: https://www.amazon.com.br/gp/product/B009I8FW9Y/ref=ppx_yo_dt_b_d_asin_title_o00?ie=UTF8&psc=1
Introduction
This text focus on the standards for Loading and Modeling the Data Vault. However, Data Vault is now in its 2.0 version, so, why am I reading this book? Well, according to my internet research this book is still the best way to get information about modeling a data vault, so it is cut to the chase and focused on this matter only, whilst 2.0 books cover several other aspects that I’m not interested right now, so I will save 2.0 books for the near future.
Data Vault Modeling Standards
Entities types
Hubs: list of business keys
Links: list of relationships between business keys
Satellites (sats, for short): Descriptive information
Stand-alone: tables like calendar/time. Do no store history and it’s keys are proliferated throughout the model. Also known as REFERENCE tables. Hubs, Links and Sats should never be stand-alone entities.
Common field elements
Sequence ID (required) – if surrogate keys are used, it is the primary key of all tables.
Load Date Time Stamp (required) – an attribute present on Hubs, Links, Sats, but on Sats it should be part of the primary key.
Record Source (required) – the name of the source system that generated the information. Use to trace back to the source.
Load End-dates (required) – this attribute represents the life of the record in a satellite.
HUBs rules
Definition of HUB: a list of uniquely identified business keys that have a very low propensity to change.
- A hub must have at least 1 business key (BKs).
- A hub key should not be composite.
- Exception when two source use the same key for different meanings.
- A hub should support at least one Sat. Hubs with no Sats are probably bad design.
- A hub BK must be either a system created key or a true BKs that is the single basis for finding information in the source system. BK is often referred as Natural Keys (NK).
- A hub can contain Surrogate Keys (SKs) if the database doesn’t work well with NKs.
- A hub load-date-time attribute should not be part of the hub primary key structure.
- A hub primary key cannot contain a record source.
LINKs rules
Definition of LINK: a list of uniquely identified composite relationships between two hub keys, must have two or more hubs combined, or it can use only one hub but using two keys from the same hub for hierarchical purposes. Hierarchical links should have at least on Sat to indicate the effectivity of the relationship.
- A link must contain at least two hubs, or two linked primary keys.
- A link can contain two keys imported from the same hub for hierarchical or rolled up relationship.
- A link load-date-time attribute should not be part of the link primary key structure.
- A link composite key must be unique.
- A link may contain surrogate sequence key.
- A link may contain 2 or more hub keys.
- A link grain is determined by the number of imported hub or link parent keys.
- A link is a transaction, or a hierarchy, or a relationship.
- A link may have zero or more Sats attached. An hierarchical link must have at least one Sat.
- A link must be at the lowest level of the grain.
- A link must represent at most, 1 instance of a relationship at any given time.
- A link cannot be associated to a reference structure (hub-sat, code-description).
SATELLITEs rules
Definition of SATELLITE: any data with a propensity to change, any descriptive data about a BK. Data in a Sat must be separated by type and rate of change.
- A sat MUST have at least one hub or link PK imported.
- A sat cannot be attached to more than one hub.
- A sat MUST have load-date-time attribute as part of its PK.
- A sat may contain a sequence identifier or an ordering identifier as part of its PK for uniqueness.
- A sat must contain record of source system for trackability.
- A sat must contain at least one descriptive attribute about the hub or link its related.
- A sat may contain aggregated attributes.
- A sat purpose is to store data over time.
- A sat may contain FK to reference tables.
- A sat may contain FK to a reference structure (reference structure is a single hub with a sat, I assume it’s not related to the data vault as a whole, like a reference table)
- A sat may contain FK to calendar table, geography tables.
- A sat may contain load-end-date, its not mandatory, but it’s a best practice.
Naming conventions
Entities
- Hubs, HUB_, H_
- Links, LINK_, L_
- Hierarchical Links, HLINK_, HIER_
- Satellites, SAT_, S_
Fields
- Record source, REC_SRC_, RSRC_
- Sequence IDs, SEQ_ID_, SQN_
- Date time stamps, DTS_
- Date stamps, DT_
- Time stamps, TM_
- Load Date time stamps, LDDTS_
- User, USR_
- Occurrence number, OCNUM_
- End date time stamps, LEDTS_
Avoiding Outer Joins
Insert an empty SAT record for every new hub key, meaning, even if your attributes are null, you at least have the hub key filled in. This allows the queries to equi-joins (inner) and avoid outer joins.
Data Vault Loading Standards
The data warehouse should truly represent the source system in their original format, and the only thing that should be done to the data is basic defaults, and error reporting.
Batch Loading
For batch loading use a set of staging tables, load data to refresh, do NOT keep history in these staging tables, so every time you load data on your stage make sure its truncated first. All process must be consistent, repeatable, highly parallel and scalabe.
Batch Loading – Staging Definitions
The staging definitions should populate required fields up load. The definitions should follow the rules below.
Required fields
This fields should be populated on the way in to the staging tables. Keep transformation to a minimum during load of the stages in order to make loads as parallel and independent as possible. Keep the structure as CLOSE to the source system as possible.
- Staging sequence number
- Load date
- Record source
- Extract date (optional)
- Delta flag / new row flag (optional)
- Target table sequence number (optional)
Goals for stage table loads
- Parallel – all stage loads should be independent of any other stage load
- Run when data is ready on the source
- Big network bandwidth, lots of memory, lots of disk space
Data type changes going into staging tables
- Char/string to datetime
- Char/string to numeric
- Split over-loaded fields
- Char to varchar, trim
- Upper case all strings, or lower case
Best practice: if disk space is available, store both the original value in char mode (prior the conversion), and store the new value in the “domain aligned field”.
Default values
Just suggestions here, you can use whatever values you want, so long as they are in agreement with your user.
- Defaults for datetime: 1/1/1970 00:00:00.0000
- Defaults for numeric: NULL
- Defaults for char: (1)
Batch – HUB loading
Watch for duplicates across staging tables, also make sure the case chosen is consistent (upper/lower/etc).
- Gather a unique union of ALL DISTINCT keys in the staging tables that are targeted to that hub.
- Check to see if your records already exists in your target table.
- If your records already exists in the target table, filter them out from your load.
- If your records don’t already exists, assign the next sequence number and insert.
Batch – LINK loading
The goal of loading links is to have NO DUPLICATES across business key structures (sequence IDs). Each representation of an intersection or relationship must be unique.
- Gather the business keys.
- Default the NULL business keys to an unknown value or “zero record” in the hub.
- Go get the hub sequence ID for that individual business key. Repeat this for all hub keys.
- Check to see if this collection of sequences exist in the link already, if so – filter them out.
- If it doesn’t exist, generate a new sequence for the link and insert it.
Batch – SATELLITE loading
Satellite loading objectives are two: 1) to only load delta changes. 2) split the work, by type of data and rate of change.
- Gather the link or hub surrogate key for the dependent satellite.
- Join those SKs to a current load-date table, and compare the most recent data in the staging to the satellite.
- Select only those rows that have changed.
- Place the new load-date into the PK of the satellite.
- Track the rows to be inserted, to make it easy to end-date old rows in the next pass, as well as easy to update the PIT and bridge tables.
- Insert the new satellite records.
Tracking and Tracability
- Loading load dates
- Every record in a batch cycle must contain exactly the same load-date and time UNLESS each record is tied by a load cycle ID. This way, the loaded group can be backed out and re-applied if necessary.
- Load-end dates
- Load end dates should be 1 second or 1 millisecond BEFORE the next active record. Depending on the grain of the RDBMS engine. (I need to understand this better). This provides a temporal view of the data, allowing queries to execute spatial-temporal analysis.
- Last seen dates
- Last seen dates are generated by the load process. They can be defined two ways, either as the last time we saw the data for the satellite, or the last time we saw the hub key on the feed.
- Tracking satellite
- The tracking satellite is a system generated, system managed satellite. It contains fields that are generated during the load cycles. It tracks the history of arrival and updates to the other tables like hub, links and satellites.
- Record sources
- Record sources are the metadata about where the data in that record came from. They can contain the source system, source application, and sometimes even the source function. The more granular, the better.