When we start modeling using the Data Vault methodology, the first and main component of the model is Hubs. The hub is the heart of our model because it represents the key business entities that we want to capture in the data warehouse.
Hub is a table that is responsible for the unique identification of business entities in the data warehouse. Each record in the hub represents a unique business key received from one or more sources.
Key principles of working with hubs:
-
Data minimalism:
The hub contains only:- A unique business key (or surrogate key).
- Metadata about the data load date (Load Date).
- Data source (Source System), if required.
All other data (e.g., entity attributes or relationships between entities) are stored in satellites and links.
-
Flexibility and scalability:
Hubs remain unchanged even if new sources are added or business rules change. -
Focus on uniqueness:
One business key always corresponds to one record in the hub, regardless of which source the data came from.
Example: Hub "Passengers"
Imagine we are creating a data warehouse for an airline. We need to consider passenger data from various systems:
- Booking system (BookingSystem).
- CRM system (CRMSystem).
- Flight management system (FlightSystem).
Each of these systems contains unique business keys for the same passenger:
- Passport number from BookingSystem.
- Unique customer ID from CRMSystem.
- Reservation number (PNR) from FlightSystem.
Hub_Passengers structure
The "Passengers" hub stores only unique business keys and metadata:
Surrogate Key | Business Key | Load Date | Source System |
---|---|---|---|
1 | 1234567890 | 2025-01-03 | BookingSystem |
1 | CUST001 | 2025-01-03 | CRMSystem |
1 | PNR12345 | 2025-01-03 | FlightSystem |
- Surrogate Key: A unique key that links different business keys to a single passenger.
- Business Key: Unique identifiers of a passenger in different systems.
- Load Date: The date when information was loaded into the hub.
- Source System: The source of the data.
Attributes and Relationships:
Let's jump ahead a little...
All additional information (such as surname, date of birth, contacts) about the passenger is stored in satellites, while relationships between passengers and their business keys are stored in links.
Example of a Satellite Table:
Hub Key | Surname | Date of Birth | Source System | Load Date | End Date |
---|---|---|---|---|---|
1 | Smith | 1990-05-10 | CRMSystem | 2025-01-03 | NULL |
- All entity attributes are stored in the satellite.
- Data historization is supported (through Load Date and End Date).
A question may arise: "Why did you put Date of Birth in the satellite? This parameter cannot change for a client."
You are absolutely right — Date of Birth is an immutable attribute of the client. At first glance, it could indeed be placed in the hub because it has a static nature. However, in the Data Vault methodology, there are specific rules explaining why such attributes are better placed in satellites:
1. Centralization of Business Keys in the Hub
The hub should contain only:
- A unique business key.
- A system surrogate key (if used).
- Metadata for loading (Load Date and Source System, if necessary).
All other attributes, even immutable ones, relate to business data rather than entity identification. According to Data Vault rules, all business data (even immutable ones) should be placed in satellites.
2. Flexibility and Scalability
Storing immutable attributes in a satellite allows:
- Adding or modifying the data structure without changing the hub.
- Preserving the original information, even if it was incorrect during the initial load (e.g., the date of birth could have been entered incorrectly).
Conclusion:
Hubs are the foundation of Data Vault. They are minimalist, scalable, and focused on the unique identification of entities. All business logic, attributes, and relationships are moved to satellites and links, allowing for the creation of flexible, long-lasting data warehouses.