The data loading process (ETL or ELT) is a key component of implementing Data Vault. This methodology requires a structured approach to extracting, transforming, and loading data to ensure accuracy, consistency, and scalability of the repository.
Loading Stages: Staging, Raw Vault, Business Vault
When working with Data Vault, the data loading process consists of three key stages: Staging, Raw Vault, and Business Vault. Each stage plays a role in ensuring data integrity, performance, and analytical value.
1. Staging: Preparatory Stage
Purpose:
Staging (intermediate storage) is used for loading and temporarily storing data received from various sources.
Characteristics:
- Data in Staging is stored in its raw form.
- No long-term storage: data is typically cleared after processing.
- Used for initial data quality checks.
Typical Actions in the Staging Stage:
- Data Extraction: Retrieving data from sources (files, databases, APIs).
- Standardization: Converting data formats to a unified form (e.g., date or encoding transformations).
- Data Validation: Identifying missing, duplicate, or incorrect values.
Example:
A source provides passenger data:
PassengerID | Name | DateOfBirth | SourceSystem | LoadDate |
---|---|---|---|---|
C001 | John Smith | 1980-01-01 | CRM | 2025-01-01 |
2. Raw Vault: Data Storage in Raw Form
Purpose:
Raw Vault is the central repository where data from sources is stored in its raw form, supplemented with Data Vault keys and other technical attributes.
Characteristics:
- Tracking change history: any data changes are recorded.
- Independence from data sources.
- Primary focus on integrity and reliability.
Actions in the Raw Vault Stage:
- Creating Hubs: Generating unique surrogate keys for business keys.
- Creating Links: Defining relationships between hubs.
- Creating Satellites: Storing additional attributes and historical changes.
Example:
Loading a passenger into Hub_Passengers:
HubPassengerKey | PassengerID | LoadDate | SourceSystem |
---|---|---|---|
1 | C001 | 2025-01-01 | CRM |
Loading name and date of birth into Sat_PassengerDetails:
HubPassengerKey | Name | DateOfBirth | LoadDate | SourceSystem |
---|---|---|---|---|
1 | John Smith | 1980-01-01 | 2025-01-01 | CRM |
3. Business Vault: Data Storage for Analytics
Purpose:
Business Vault extends Raw Vault by adding business logic, aggregates, and additional calculations to prepare data for analytics.
Characteristics:
- Data is processed and enriched based on business rules.
- All intermediate results are preserved for transparency.
- Traceability is maintained (ability to track data origin).
Actions in the Business Vault Stage:
- Data Enrichment: Calculating KPIs, combining data from different sources.
- Creating Point-in-Time Tables: Simplifying access to data for a specific moment.
- Aggregation: Building bridge tables or satellites with aggregated values.
Example:
Calculating the total number of flights for a passenger:
HubPassengerKey | TotalFlights | CalculationDate | LoadDate |
---|---|---|---|
1 | 5 | 2025-01 | 2025-02-01 |
Advantages of the Three-Stage Loading Process
-
Flexibility:
Clear separation of stages allows the model to adapt to changes in sources or business requirements. -
Integrity:
Data history is preserved at the Raw Vault level, preventing information loss. -
Transparency:
Each stage provides data in a format suitable for its purpose:- Staging – for validation and preparation.
- Raw Vault – for storage and integration.
- Business Vault – for analytics and reporting.
Conclusion
The three-stage data loading process is the foundation of Data Vault. It ensures reliable data storage, accessibility for analytics, and adaptability to changes in business logic. Proper use of the Staging, Raw Vault, and Business Vault stages makes the system flexible, transparent, and scalable.
Key Process Stages
1. Data Extraction (Extract)
At this stage, data is extracted from various sources, including databases, file storage, APIs, and real-time data streams.
- Purpose: Obtain data in its rawest form, preserving all attributes.
- Example tools: Apache NiFi, SSIS, Talend.
Recommendations:
- Extract data in its original encoding and format.
- Preserve all primary business keys and technical attributes (e.g., timestamps).
2. Data Preparation (Load/Transform)
This stage can be divided into two approaches:
-
ETL (Extract, Transform, Load):
Data is transformed first and then loaded into the repository.- Used for scenarios requiring preliminary data cleaning before loading.
- Example: aggregation, deduplication.
-
ELT (Extract, Load, Transform):
Data is loaded in raw form into the repository and then transformed using SQL or processing tools.- Suitable for large data volumes and leveraging DBMS capabilities.
- Example tools: dbt, Snowflake, BigQuery.
Data Loading Process in Data Vault:
- Hubs:
Contain unique business keys and a minimal set of attributes (e.g., load date).- During loading, verify the uniqueness of business keys.
- Generate a surrogate key for each new business key.
- Links:
Describe relationships between hubs.- Verify the correspondence of business keys in related hubs.
- Generate unique surrogate keys for each new relationship.
- Satellites:
Store attributes related to hubs or links.- Data loading follows temporal logic (Load Date and End Date).
- Maintain change history for all attributes.
Example of an ETL Process for Data Vault
Source Data:
- Source 1 (CRM):
CustomerID Name Email CUST001 John Doe This email address is being protected from spambots. You need JavaScript enabled to view it. - Source 2 (Booking):
BookingID CustomerID FlightID B001 CUST001 F123
Process:
-
Customer Hub (Hub_Passengers):
- CustomerID is extracted.
- A Surrogate Key is generated for each new unique CustomerID.
HubKey CustomerID LoadDate 1 CUST001 2025-01-01 -
Flight Hub (Hub_Flights):
- FlightID is extracted from the booking system.
- A Surrogate Key is generated for each unique FlightID.
HubKey FlightID LoadDate 101 F123 2025-01-02 -
Link (Link_PassengerFlights):
- A relationship is created between CustomerID and FlightID via their surrogate keys.
LinkKey HubPassengerKey HubFlightKey LoadDate 1001 1 101 2025-01-02 -
Customer Satellite (Sat_PassengerDetails):
- Customer attributes (Name, Email) are loaded.
- Change history is maintained.
HubPassengerKey Name Email LoadDate EndDate 1 John Doe This email address is being protected from spambots. You need JavaScript enabled to view it. 2025-01-01 NULL
Why BookingID from Source 2 (Booking) Cannot Be Used in Link_PassengerFlights
Using BookingID from the data source as a key for the relationship in Link_PassengerFlights contradicts the core principles of Data Vault. Here’s why it is not recommended:
1. Short Lifespan of Business Keys
Business keys, such as BookingID, depend on the data source that provides them. They can:
- Change over time (e.g., during system migration or changes in identification logic).
- Be deleted or modified in the source system.
If a business key from an external system changes, it will disrupt the relationship in the model and compromise data integrity.
2. Potential Conflicts Between Sources
If data comes from multiple sources, situations may arise where:
- Two different sources use the same BookingID to represent different entities.
- One source changes the format or length of BookingID.
In such cases, the relationship based on BookingID will be incorrect.
3. Data Vault Principles: Using Surrogate Keys
Data Vault uses surrogate keys (replacement keys) for all hubs and links to:
- Ensure model independence from source structures.
- Simplify maintenance and data migration.
- Avoid issues with duplicate or changing business keys.
In your example, surrogate keys are used for the relationship in Link_PassengerFlights:
- HubPassengerKey (generated in Hub_Passengers).
- HubFlightKey (generated in Hub_Flights).
4. Storing BookingID in a Satellite
Although BookingID is not used to create the relationship, it can be stored in a satellite associated with Link_PassengerFlights, for example:
LinkKey | BookingID | LoadDate | EndDate |
---|---|---|---|
1001 | B001 | 2025-01-02 | NULL |
This allows:
- Preserving the original business key for auditing and tracing.
- Using BookingID in analytics without compromising the model’s structural integrity.
5. Scalability and Model Unification
Using surrogate keys allows relationships between hubs to be built independently of future data sources. If a new data source with a different booking identifier is added, the repository structure remains unchanged.
Summary
BookingID can be used for analysis but not for building relationships within Data Vault. Using surrogate keys ensures independence, scalability, and long-term data integrity.
Recommendations for Successful Loading
- Log Each Load:
Record load status, row count, errors, and execution time. - Maintain Idempotency:
Data loading should be repeatable without duplicating data. - Use Parallel Processes:
Split data into independent streams (e.g., hubs, links, satellites) to speed up loading. - Ensure Data Quality Monitoring:
Verify data integrity at all loading stages.
Conclusion
The ETL/ELT process in Data Vault requires strict adherence to methodology and the use of tools optimized for processing large data volumes. This process ensures flexibility, scalability, and preservation of change history, making Data Vault an ideal choice for modern data repositories.