Metadata Management is a key process that ensures transparency, ease of administration, and support for changes in the data warehouse. Metadata describes the structure of data, its origin, quality, and the relationships between elements of the model.
Why is metadata management needed?
- Simplified administration: Metadata allows for the automation of data loading and updating tasks.
- Flexibility: When data sources or business requirements change, the system easily adapts.
- Transparency: Metadata provides insight into the origin of data (data lineage) and its usage.
- Data quality: Enables tracking and resolving issues related to duplication, missing data, or incorrect data.
Main types of metadata
-
Technical metadata:
- Description of table structures (hubs, links, satellites).
- Information about keys, indexes, data types.
Example:
TableMetadata_Tables
contains descriptions of all tables in Data Vault.CREATE TABLE Metadata_Tables ( TableName NVARCHAR(100), TableType NVARCHAR(50), -- Hub, Link, Satellite SourceSystem NVARCHAR(100), CreatedDate DATE ); INSERT INTO Metadata_Tables VALUES ('Hub_Passengers', 'Hub', 'BookingSystem', '2025-01-01'), ('Sat_PassengerDetails', 'Satellite', 'BookingSystem', '2025-01-01');
-
Business metadata:
- Description of business keys and their values.
- Information about data transformation rules.
Example:
TableMetadata_BusinessKeys
describes how business keys are linked to sources.CREATE TABLE Metadata_BusinessKeys ( BusinessKeyName NVARCHAR(100), SourceSystem NVARCHAR(100), TransformationRule NVARCHAR(200) ); INSERT INTO Metadata_BusinessKeys VALUES ('PassengerID', 'CRMSystem', 'Trim and Uppercase'), ('BookingID', 'BookingSystem', 'As Is');
-
Operational metadata:
- Logs of ETL/ELT processes.
- Data loading history (dates, volume, status).
Example:
TableMetadata_LoadHistory
stores information about loading processes.CREATE TABLE Metadata_LoadHistory ( TableName NVARCHAR(100), LoadDate DATETIME, RowCount INT, LoadStatus NVARCHAR(50) ); INSERT INTO Metadata_LoadHistory VALUES ('Hub_Passengers', '2025-01-02 10:00:00', 1000, 'Success'), ('Sat_PassengerDetails', '2025-01-02 10:10:00', 2000, 'Success');
Example: Managing changes using metadata
Scenario: A new field CustomerTier
has been added to the data source, which needs to be integrated into the model.
-
Updating metadata:
A record for the new field is added to the metadata table.INSERT INTO Metadata_Tables VALUES ('Sat_CustomerDetails', 'Satellite', 'CRMSystem', '2025-01-03'); INSERT INTO Metadata_BusinessKeys VALUES ('CustomerTier', 'CRMSystem', 'As Is');
-
Automating ETL process generation:
Based on metadata, SQL scripts for processing the new field are automatically generated.INSERT INTO Sat_CustomerDetails (HubCustomerKey, LoadDate, CustomerTier) SELECT HubCustomerKey, GETDATE() AS LoadDate, CustomerTier FROM Staging_Customers;
Tools for metadata management
-
Metadata management systems:
- Apache Atlas: For tracking lineage and ensuring compliance.
- Collibra: For data cataloging and business metadata management.
-
Metadata repositories:
- Using separate tables in a relational database to store metadata.
-
BI tools:
- Power BI or Tableau can integrate with metadata repositories to provide data quality analytics.
Conclusion
Metadata management is an integral part of effective Data Vault operations. With a structured approach and automation based on metadata, administration can be simplified, changes can be implemented faster, and high data quality can be ensured. Using metadata not only improves technical implementation but also makes data more transparent and accessible to end users.
Let's take a closer look at the section
Automating ETL process generation
Automating ETL processes in Data Vault significantly reduces development time, minimizes errors, and ensures standardization. Using metadata, code can be dynamically generated to load data into hubs, links, and satellites, as well as to handle changes in data sources.
Benefits of automation
- Development speed: Generating ETL code based on metadata allows for quick creation of new processes.
- Flexibility: Easier to adapt the model to changes in data sources.
- Consistency: Automatically generated code follows established standards.
- Scalability: Simplifies the maintenance of large and complex models.
Approaches to automation
-
Using metadata:
Metadata describes table structures, relationships between them, and data transformation rules.Example metadata table:
CREATE TABLE Metadata_ETL ( TableName NVARCHAR(100), ColumnName NVARCHAR(100), ColumnType NVARCHAR(50), SourceTable NVARCHAR(100), SourceColumn NVARCHAR(100), TransformationRule NVARCHAR(200) ); INSERT INTO Metadata_ETL VALUES ('Hub_Passengers', 'PassengerKey', 'INT', 'Staging_Passengers', 'PassengerID', 'Generate Surrogate Key'), ('Sat_PassengerDetails', 'Name', 'NVARCHAR(100)', 'Staging_Passengers', 'FullName', 'As Is'), ('Link_PassengerFlights', 'HubPassengerKey', 'INT', 'Staging_Bookings', 'PassengerID', 'Generate Surrogate Key');
-
Dynamic SQL code generation:
Using metadata, SQL scripts for data loading can be automatically generated.Example of dynamic SQL generation:
Based on metadata, a script is written to load data into a hub.DECLARE @SQL NVARCHAR(MAX); SELECT @SQL = STRING_AGG( 'INSERT INTO ' + TableName + ' (' + ColumnName + ') SELECT ' + SourceColumn + ' FROM ' + SourceTable, '; ' ) FROM Metadata_ETL WHERE TableName = 'Hub_Passengers'; EXEC sp_executesql @SQL;
-
ETL process templates:
Creating universal templates for ETL using parameters filled with metadata.Example template:
INSERT INTO {TargetTable} ({TargetColumns}) SELECT {SourceColumns} FROM {SourceTable};
Python script to fill the template:
import pandas as pd # Read metadata metadata = pd.read_sql("SELECT * FROM Metadata_ETL", connection) # Generate SQL script for table in metadata['TableName'].unique(): target_columns = ", ".join(metadata[metadata['TableName'] == table]['ColumnName']) source_columns = ", ".join(metadata[metadata['TableName'] == table]['SourceColumn']) source_table = metadata[metadata['TableName'] == table]['SourceTable'].iloc[0] sql = f""" INSERT INTO {table} ({target_columns}) SELECT {source_columns} FROM {source_table}; """ print(sql) # For debugging or execution
-
Automation tools:
- DBT (Data Build Tool): For writing and managing SQL templates.
- Talend, Informatica, Apache NiFi: Automation using graphical interfaces.
Example: Automatic data loading into a hub
Scenario: Data needs to be loaded into the hub Hub_Passengers
based on the table Staging_Passengers
.
-
Step 1: Adding metadata
Metadata describes the correspondence between staging table columns and the hub.INSERT INTO Metadata_ETL VALUES ('Hub_Passengers', 'PassengerKey', 'INT', 'Staging_Passengers', 'PassengerID', 'Generate Surrogate Key'), ('Hub_Passengers', 'LoadDate', 'DATE', 'Staging_Passengers', 'LoadTimestamp', 'CAST to DATE');
-
Step 2: Generating SQL code
Based on metadata, an SQL script is created.INSERT INTO Hub_Passengers (PassengerKey, LoadDate) SELECT HASHBYTES('SHA1', CAST(PassengerID AS NVARCHAR(MAX))) AS PassengerKey, CAST(LoadTimestamp AS DATE) AS LoadDate FROM Staging_Passengers;
-
Step 3: Automatic execution
An ETL framework (e.g., Python or DBT) uses the template to execute the code.
Conclusion
Automating ETL processes through metadata and SQL code generation makes the process of loading data into Data Vault fast, flexible, and standardized. This reduces development costs, minimizes the risk of errors, and allows for effective change management.