5.2. Metadata Management

Metadata Management is a key process that ensures transparency, ease of administration, and support for changes in the data warehouse. Metadata describes the structure of data, its origin, quality, and the relationships between elements of the model.

Why is metadata management needed?

Simplified administration: Metadata allows for the automation of data loading and updating tasks.
Flexibility: When data sources or business requirements change, the system easily adapts.
Transparency: Metadata provides insight into the origin of data (data lineage) and its usage.
Data quality: Enables tracking and resolving issues related to duplication, missing data, or incorrect data.

Main types of metadata

Technical metadata:

Description of table structures (hubs, links, satellites).
Information about keys, indexes, data types.

Example:
Table Metadata_Tables contains descriptions of all tables in Data Vault.

CREATE TABLE Metadata_Tables (
    TableName NVARCHAR(100),
    TableType NVARCHAR(50),  -- Hub, Link, Satellite
    SourceSystem NVARCHAR(100),
    CreatedDate DATE
);

INSERT INTO Metadata_Tables VALUES 
    ('Hub_Passengers', 'Hub', 'BookingSystem', '2025-01-01'),
    ('Sat_PassengerDetails', 'Satellite', 'BookingSystem', '2025-01-01');

Business metadata:

Description of business keys and their values.
Information about data transformation rules.

Example:
Table Metadata_BusinessKeys describes how business keys are linked to sources.

CREATE TABLE Metadata_BusinessKeys (
    BusinessKeyName NVARCHAR(100),
    SourceSystem NVARCHAR(100),
    TransformationRule NVARCHAR(200)
);

INSERT INTO Metadata_BusinessKeys VALUES 
    ('PassengerID', 'CRMSystem', 'Trim and Uppercase'),
    ('BookingID', 'BookingSystem', 'As Is');

Operational metadata:

Logs of ETL/ELT processes.
Data loading history (dates, volume, status).

Example:
Table Metadata_LoadHistory stores information about loading processes.

CREATE TABLE Metadata_LoadHistory (
    TableName NVARCHAR(100),
    LoadDate DATETIME,
    RowCount INT,
    LoadStatus NVARCHAR(50)
);

INSERT INTO Metadata_LoadHistory VALUES 
    ('Hub_Passengers', '2025-01-02 10:00:00', 1000, 'Success'),
    ('Sat_PassengerDetails', '2025-01-02 10:10:00', 2000, 'Success');

Example: Managing changes using metadata

Scenario: A new field CustomerTier has been added to the data source, which needs to be integrated into the model.

Updating metadata:
A record for the new field is added to the metadata table.

INSERT INTO Metadata_Tables VALUES 
    ('Sat_CustomerDetails', 'Satellite', 'CRMSystem', '2025-01-03');
INSERT INTO Metadata_BusinessKeys VALUES 
    ('CustomerTier', 'CRMSystem', 'As Is');

Automating ETL process generation:
Based on metadata, SQL scripts for processing the new field are automatically generated.

INSERT INTO Sat_CustomerDetails (HubCustomerKey, LoadDate, CustomerTier)
SELECT 
    HubCustomerKey, 
    GETDATE() AS LoadDate, 
    CustomerTier
FROM 
    Staging_Customers;

Tools for metadata management

Metadata management systems:
- Apache Atlas: For tracking lineage and ensuring compliance.
- Collibra: For data cataloging and business metadata management.
Metadata repositories:
- Using separate tables in a relational database to store metadata.
BI tools:
- Power BI or Tableau can integrate with metadata repositories to provide data quality analytics.

Conclusion

Metadata management is an integral part of effective Data Vault operations. With a structured approach and automation based on metadata, administration can be simplified, changes can be implemented faster, and high data quality can be ensured. Using metadata not only improves technical implementation but also makes data more transparent and accessible to end users.

Let's take a closer look at the section

Automating ETL process generation

Automating ETL processes in Data Vault significantly reduces development time, minimizes errors, and ensures standardization. Using metadata, code can be dynamically generated to load data into hubs, links, and satellites, as well as to handle changes in data sources.

Benefits of automation

Development speed: Generating ETL code based on metadata allows for quick creation of new processes.
Flexibility: Easier to adapt the model to changes in data sources.
Consistency: Automatically generated code follows established standards.
Scalability: Simplifies the maintenance of large and complex models.

Approaches to automation

Using metadata:
Metadata describes table structures, relationships between them, and data transformation rules.

Example metadata table:

CREATE TABLE Metadata_ETL (
    TableName NVARCHAR(100),
    ColumnName NVARCHAR(100),
    ColumnType NVARCHAR(50),
    SourceTable NVARCHAR(100),
    SourceColumn NVARCHAR(100),
    TransformationRule NVARCHAR(200)
);

INSERT INTO Metadata_ETL VALUES 
    ('Hub_Passengers', 'PassengerKey', 'INT', 'Staging_Passengers', 'PassengerID', 'Generate Surrogate Key'),
    ('Sat_PassengerDetails', 'Name', 'NVARCHAR(100)', 'Staging_Passengers', 'FullName', 'As Is'),
    ('Link_PassengerFlights', 'HubPassengerKey', 'INT', 'Staging_Bookings', 'PassengerID', 'Generate Surrogate Key');

Dynamic SQL code generation:
Using metadata, SQL scripts for data loading can be automatically generated.

Example of dynamic SQL generation:
Based on metadata, a script is written to load data into a hub.

DECLARE @SQL NVARCHAR(MAX);

SELECT @SQL = STRING_AGG(
    'INSERT INTO ' + TableName + ' (' + ColumnName + ')
     SELECT ' + SourceColumn + ' 
     FROM ' + SourceTable, 
    '; '
)
FROM Metadata_ETL
WHERE TableName = 'Hub_Passengers';

EXEC sp_executesql @SQL;

ETL process templates:
Creating universal templates for ETL using parameters filled with metadata.

Example template:

INSERT INTO {TargetTable} ({TargetColumns})
SELECT {SourceColumns}
FROM {SourceTable};

Python script to fill the template:

import pandas as pd

# Read metadata
metadata = pd.read_sql("SELECT * FROM Metadata_ETL", connection)

# Generate SQL script
for table in metadata['TableName'].unique():
    target_columns = ", ".join(metadata[metadata['TableName'] == table]['ColumnName'])
    source_columns = ", ".join(metadata[metadata['TableName'] == table]['SourceColumn'])
    source_table = metadata[metadata['TableName'] == table]['SourceTable'].iloc[0]

    sql = f"""
    INSERT INTO {table} ({target_columns})
    SELECT {source_columns}
    FROM {source_table};
    """
    print(sql)  # For debugging or execution

Automation tools:
- DBT (Data Build Tool): For writing and managing SQL templates.
- Talend, Informatica, Apache NiFi: Automation using graphical interfaces.

Example: Automatic data loading into a hub

Scenario: Data needs to be loaded into the hub Hub_Passengers based on the table Staging_Passengers.

Step 1: Adding metadata
Metadata describes the correspondence between staging table columns and the hub.

INSERT INTO Metadata_ETL VALUES 
    ('Hub_Passengers', 'PassengerKey', 'INT', 'Staging_Passengers', 'PassengerID', 'Generate Surrogate Key'),
    ('Hub_Passengers', 'LoadDate', 'DATE', 'Staging_Passengers', 'LoadTimestamp', 'CAST to DATE');

Step 2: Generating SQL code
Based on metadata, an SQL script is created.

INSERT INTO Hub_Passengers (PassengerKey, LoadDate)
SELECT 
    HASHBYTES('SHA1', CAST(PassengerID AS NVARCHAR(MAX))) AS PassengerKey,
    CAST(LoadTimestamp AS DATE) AS LoadDate
FROM Staging_Passengers;

Step 3: Automatic execution
An ETL framework (e.g., Python or DBT) uses the template to execute the code.

Conclusion

Automating ETL processes through metadata and SQL code generation makes the process of loading data into Data Vault fast, flexible, and standardized. This reduces development costs, minimizes the risk of errors, and allows for effective change management.

Why is metadata management needed?

Main types of metadata

Example: Managing changes using metadata

Tools for metadata management

Conclusion

Automating ETL process generation

Benefits of automation

Approaches to automation

Example: Automatic data loading into a hub

Conclusion

Feedback from students on the DataVault course