3.2. Data Load Processing (ETL ELT)

The data loading process (ETL or ELT) is a key component of implementing Data Vault. This methodology requires a structured approach to extracting, transforming, and loading data to ensure accuracy, consistency, and scalability of the repository.

Loading Stages: Staging, Raw Vault, Business Vault

When working with Data Vault, the data loading process consists of three key stages: Staging, Raw Vault, and Business Vault. Each stage plays a role in ensuring data integrity, performance, and analytical value.

1. Staging: Preparatory Stage

Purpose:
Staging (intermediate storage) is used for loading and temporarily storing data received from various sources.

Characteristics:

Data in Staging is stored in its raw form.
No long-term storage: data is typically cleared after processing.
Used for initial data quality checks.

Typical Actions in the Staging Stage:

Data Extraction: Retrieving data from sources (files, databases, APIs).
Standardization: Converting data formats to a unified form (e.g., date or encoding transformations).
Data Validation: Identifying missing, duplicate, or incorrect values.

Example:
A source provides passenger data:

PassengerID	Name	DateOfBirth	SourceSystem	LoadDate
C001	John Smith	1980-01-01	CRM	2025-01-01

2. Raw Vault: Data Storage in Raw Form

Purpose:
Raw Vault is the central repository where data from sources is stored in its raw form, supplemented with Data Vault keys and other technical attributes.

Characteristics:

Tracking change history: any data changes are recorded.
Independence from data sources.
Primary focus on integrity and reliability.

Actions in the Raw Vault Stage:

Creating Hubs: Generating unique surrogate keys for business keys.
Creating Links: Defining relationships between hubs.
Creating Satellites: Storing additional attributes and historical changes.

Example:
Loading a passenger into Hub_Passengers:

HubPassengerKey	PassengerID	LoadDate	SourceSystem
1	C001	2025-01-01	CRM

Loading name and date of birth into Sat_PassengerDetails:

HubPassengerKey	Name	DateOfBirth	LoadDate	SourceSystem
1	John Smith	1980-01-01	2025-01-01	CRM

3. Business Vault: Data Storage for Analytics

Purpose:
Business Vault extends Raw Vault by adding business logic, aggregates, and additional calculations to prepare data for analytics.

Characteristics:

Data is processed and enriched based on business rules.
All intermediate results are preserved for transparency.
Traceability is maintained (ability to track data origin).

Actions in the Business Vault Stage:

Data Enrichment: Calculating KPIs, combining data from different sources.
Creating Point-in-Time Tables: Simplifying access to data for a specific moment.
Aggregation: Building bridge tables or satellites with aggregated values.

Example:
Calculating the total number of flights for a passenger:

HubPassengerKey	TotalFlights	CalculationDate	LoadDate
1	5	2025-01	2025-02-01

Advantages of the Three-Stage Loading Process

Flexibility:
Clear separation of stages allows the model to adapt to changes in sources or business requirements.
Integrity:
Data history is preserved at the Raw Vault level, preventing information loss.
Transparency:
Each stage provides data in a format suitable for its purpose:
- Staging – for validation and preparation.
- Raw Vault – for storage and integration.
- Business Vault – for analytics and reporting.

Conclusion

The three-stage data loading process is the foundation of Data Vault. It ensures reliable data storage, accessibility for analytics, and adaptability to changes in business logic. Proper use of the Staging, Raw Vault, and Business Vault stages makes the system flexible, transparent, and scalable.

Key Process Stages

1. Data Extraction (Extract)

At this stage, data is extracted from various sources, including databases, file storage, APIs, and real-time data streams.

Purpose: Obtain data in its rawest form, preserving all attributes.
Example tools: Apache NiFi, SSIS, Talend.

Recommendations:

Extract data in its original encoding and format.
Preserve all primary business keys and technical attributes (e.g., timestamps).

2. Data Preparation (Load/Transform)

This stage can be divided into two approaches:

ETL (Extract, Transform, Load):
Data is transformed first and then loaded into the repository.
- Used for scenarios requiring preliminary data cleaning before loading.
- Example: aggregation, deduplication.
ELT (Extract, Load, Transform):
Data is loaded in raw form into the repository and then transformed using SQL or processing tools.
- Suitable for large data volumes and leveraging DBMS capabilities.
- Example tools: dbt, Snowflake, BigQuery.

Data Loading Process in Data Vault:

Hubs:
Contain unique business keys and a minimal set of attributes (e.g., load date).
- During loading, verify the uniqueness of business keys.
- Generate a surrogate key for each new business key.
Links:
Describe relationships between hubs.
- Verify the correspondence of business keys in related hubs.
- Generate unique surrogate keys for each new relationship.
Satellites:
Store attributes related to hubs or links.
- Data loading follows temporal logic (Load Date and End Date).
- Maintain change history for all attributes.

Example of an ETL Process for Data Vault

Source Data:

Source 1 (CRM):

CustomerID Name Email

CUST001 John Doe This email address is being protected from spambots. You need JavaScript enabled to view it.
Source 2 (Booking):

BookingID CustomerID FlightID

B001 CUST001 F123

	CustomerID	Name	Email
	CUST001	John Doe	This email address is being protected from spambots. You need JavaScript enabled to view it.

	BookingID	CustomerID	FlightID
	B001	CUST001	F123

Process:

Customer Hub (Hub_Passengers):
- CustomerID is extracted.
- A Surrogate Key is generated for each new unique CustomerID.
HubKey CustomerID LoadDate

1 CUST001 2025-01-01
Flight Hub (Hub_Flights):
- FlightID is extracted from the booking system.
- A Surrogate Key is generated for each unique FlightID.
HubKey FlightID LoadDate

101 F123 2025-01-02
Link (Link_PassengerFlights):
- A relationship is created between CustomerID and FlightID via their surrogate keys.
LinkKey HubPassengerKey HubFlightKey LoadDate

1001 1 101 2025-01-02
Customer Satellite (Sat_PassengerDetails):
- Customer attributes (Name, Email) are loaded.
- Change history is maintained.
HubPassengerKey Name Email LoadDate EndDate

1 John Doe This email address is being protected from spambots. You need JavaScript enabled to view it. 2025-01-01 NULL

HubKey	CustomerID	LoadDate
1	CUST001	2025-01-01

HubKey	FlightID	LoadDate
101	F123	2025-01-02

LinkKey	HubPassengerKey	HubFlightKey	LoadDate
1001	1	101	2025-01-02

HubPassengerKey	Name	Email	LoadDate	EndDate
1	John Doe	This email address is being protected from spambots. You need JavaScript enabled to view it.	2025-01-01	NULL

Why BookingID from Source 2 (Booking) Cannot Be Used in Link_PassengerFlights

Using BookingID from the data source as a key for the relationship in Link_PassengerFlights contradicts the core principles of Data Vault. Here’s why it is not recommended:

1. Short Lifespan of Business Keys

Business keys, such as BookingID, depend on the data source that provides them. They can:

Change over time (e.g., during system migration or changes in identification logic).
Be deleted or modified in the source system.

If a business key from an external system changes, it will disrupt the relationship in the model and compromise data integrity.

2. Potential Conflicts Between Sources

If data comes from multiple sources, situations may arise where:

Two different sources use the same BookingID to represent different entities.
One source changes the format or length of BookingID.

In such cases, the relationship based on BookingID will be incorrect.

3. Data Vault Principles: Using Surrogate Keys

Data Vault uses surrogate keys (replacement keys) for all hubs and links to:

Ensure model independence from source structures.
Simplify maintenance and data migration.
Avoid issues with duplicate or changing business keys.

In your example, surrogate keys are used for the relationship in Link_PassengerFlights:

HubPassengerKey (generated in Hub_Passengers).
HubFlightKey (generated in Hub_Flights).

4. Storing BookingID in a Satellite

Although BookingID is not used to create the relationship, it can be stored in a satellite associated with Link_PassengerFlights, for example:

LinkKey	BookingID	LoadDate	EndDate
1001	B001	2025-01-02	NULL

This allows:

Preserving the original business key for auditing and tracing.
Using BookingID in analytics without compromising the model’s structural integrity.

5. Scalability and Model Unification

Using surrogate keys allows relationships between hubs to be built independently of future data sources. If a new data source with a different booking identifier is added, the repository structure remains unchanged.

Summary

BookingID can be used for analysis but not for building relationships within Data Vault. Using surrogate keys ensures independence, scalability, and long-term data integrity.

Recommendations for Successful Loading

Log Each Load:
Record load status, row count, errors, and execution time.
Maintain Idempotency:
Data loading should be repeatable without duplicating data.
Use Parallel Processes:
Split data into independent streams (e.g., hubs, links, satellites) to speed up loading.
Ensure Data Quality Monitoring:
Verify data integrity at all loading stages.

Conclusion

The ETL/ELT process in Data Vault requires strict adherence to methodology and the use of tools optimized for processing large data volumes. This process ensures flexibility, scalability, and preservation of change history, making Data Vault an ideal choice for modern data repositories.

Loading Stages: Staging, Raw Vault, Business Vault

1. Staging: Preparatory Stage

2. Raw Vault: Data Storage in Raw Form

3. Business Vault: Data Storage for Analytics

Advantages of the Three-Stage Loading Process

Conclusion

Key Process Stages

1. Data Extraction (Extract)

2. Data Preparation (Load/Transform)

Example of an ETL Process for Data Vault

Source Data:

Process:

Why BookingID from Source 2 (Booking) Cannot Be Used in Link_PassengerFlights

1. Short Lifespan of Business Keys

2. Potential Conflicts Between Sources

3. Data Vault Principles: Using Surrogate Keys

4. Storing BookingID in a Satellite

5. Scalability and Model Unification

Summary

Recommendations for Successful Loading

Conclusion

Feedback from students on the DataVault course