Introduction

To work with the dataset Airline_Delay_Cause.csv Sources and Installation of programs we will build a Data Vault model. It will provide flexibility, scalability, and the ability to integrate data from various sources. Data Vault is suitable for analytical purposes, such as analyzing flight delays, cancellation reasons, and airline performance.


Data Vault Structure

The main components of our model will be:

  1. Hubs — store unique business keys.
  2. Links — establish relationships between entities.
  3. Satellites — store attributes describing business keys or relationships.

1. Hubs

In our case, we will need three hubs:

  1. Hub_Flights

    • Unique flights.
    • Fields:
      • FlightKey (Surrogate Key).
      • FlightNumber (Business Key).
      • LoadDate.
  2. Hub_Airports

    • List of airports.
    • Fields:
      • AirportKey (Surrogate Key).
      • AirportCode (Business Key).
      • LoadDate.
  3. Hub_Airlines

    • List of airlines.
    • Fields:
      • AirlineKey (Surrogate Key).
      • AirlineCode (Business Key).
      • LoadDate.

2. Links

  1. Link_FlightRoutes

    • Relationship between flights and routes (departure and destination airports).
    • Fields:
      • FlightRouteKey (Surrogate Key).
      • FlightKey.
      • OriginAirportKey.
      • DestinationAirportKey.
      • LoadDate.
  2. Link_FlightDelays

    • Relationship between flights and delays.
    • Fields:
      • FlightDelayKey (Surrogate Key).
      • FlightKey.
      • DelayCategoryKey (e.g., "weather", "carrier").
      • LoadDate.

3. Satellites

  1. Sat_FlightDetails

    • Flight details.
    • Fields:
      • FlightKey.
      • FlightDate.
      • ScheduledDepartureTime.
      • ScheduledArrivalTime.
      • LoadDate.
  2. Sat_AirportDetails

    • Additional information about airports.
    • Fields:
      • AirportKey.
      • AirportName.
      • City.
      • State.
      • LoadDate.
  3. Sat_DelayDetails

    • Delay details.
    • Fields:
      • FlightDelayKey.
      • DelayMinutes.
      • DelayCause.
      • LoadDate.

Data Example

Original CSV:

FlightNumber Origin Destination DelayMinutes DelayCause FlightDate AirlineCode OriginName DestinationName ScheduledDepartureTime ScheduledArrivalTime
1234 LAX JFK 120 Weather 2025-01-01 AA Los Angeles New York 08:00 16:00

Hub_Flights:

FlightKey FlightNumber LoadDate
1 1234 2025-01-03

Hub_Airports:

AirportKey AirportCode LoadDate
1 LAX 2025-01-03
2 JFK 2025-01-03

Hub_Airlines:

AirlineKey AirlineCode LoadDate
1 AA 2025-01-03

Link_FlightRoutes:

FlightRouteKey FlightKey OriginAirportKey DestinationAirportKey LoadDate
1 1 1 2 2025-01-03

Sat_DelayDetails:

FlightDelayKey FlightKey DelayMinutes DelayCause LoadDate
1 1 120 Weather 2025-01-03

Result

After implementing Data Vault, students will be able to:

  1. Analyze delays by airports and airlines.
  2. Build flight routes.
  3. Create analytical reports and dashboards.

Practice

download the dataset Airline_Delay_Cause.csv Sources and Installation of programs to your computer

Designing a Data Vault model is a key stage in creating a data warehouse that allows us to structure information so that it is flexible, scalable, and meets business requirements. In this section, we will create three main databases:

  1. Staging_Area — for loading raw data from the source file.
  2. Raw_Vault — for storing the Data Vault model, where data is structured into hubs, links, and satellites.
  3. Business_Vault — for adding business logic and creating data marts.

We will use the following tools:

  • MS SQL Express — for creating the database and implementing the Data Vault structure.
  • SSMS (SQL Server Management Studio) — for working with the database.
  • Pandas — for loading source data from the Airline_Delay_Cause.csv file into the Staging_Area database.

Important: the goal of this training course is to study Data Vault concepts and structure. We will not delve into the use of ETL tools in detail to focus on building the data warehouse structure.


Data Vault Database Structure

  1. Staging_Area:

    • Contains raw data loaded directly from the source file.
  2. Raw_Vault:

    • Hub Schema — hub tables containing unique business keys.
    • Link Schema — relationship tables between hubs.
    • Sat Schema — satellite tables for storing details, changing information, and timestamps.
  3. Business_Vault:

    • Includes business logic and data transformed for analytical tasks.

SQL Code for Creating Databases and Schemas

1. Creating Databases

-- Creating the Staging_Area database
CREATE DATABASE Staging_Area;
GO

-- Creating the Raw_Vault database
CREATE DATABASE Raw_Vault;
GO

-- Creating the Business_Vault database
CREATE DATABASE Business_Vault;
GO

2. Creating Schemas in the Raw_Vault Database

-- Switching to the Raw_Vault database
USE Raw_Vault;
GO

-- Creating schemas for hubs, links, and satellites
CREATE SCHEMA Hub;
CREATE SCHEMA Link;
CREATE SCHEMA Sat;
GO

SQL Code for Creating Tables

1. Tables in the Staging_Area Database

USE Staging_Area;
GO

CREATE TABLE Staging_AirlineData (
    Year INT,
    Month INT,
    DayofMonth INT,
    Carrier NVARCHAR(10),
    OriginAirport NVARCHAR(10),
    DestAirport NVARCHAR(10),
    FlightNum NVARCHAR(50),
    DepDelay INT,
    ArrDelay INT,
    Cancelled BIT,
    CancellationCode NVARCHAR(10),
    WeatherDelay INT
);
GO

2. Tables in the Hub Schema of the Raw_Vault Database

USE Raw_Vault;
GO

-- Hub for flights (Flight)
CREATE TABLE Hub.Flights (
    FlightKey NVARCHAR(255) PRIMARY KEY,
    FlightNumber NVARCHAR(50) NOT NULL,
    LoadDate DATETIME NOT NULL
);

-- Hub for airports (Airport)
CREATE TABLE Hub.Airports (
    AirportCode NVARCHAR(10) PRIMARY KEY,
    LoadDate DATETIME NOT NULL
);

-- Hub for airlines (Carrier)
CREATE TABLE Hub.Airlines (
    AirlineCode NVARCHAR(10) PRIMARY KEY,
    LoadDate DATETIME NOT NULL
);

3. Tables in the Link Schema of the Raw_Vault Database

-- Link for flight routes
CREATE TABLE Link.FlightRoutes (
    FlightRouteKey NVARCHAR(255) PRIMARY KEY,
    FlightKey NVARCHAR(255) NOT NULL,
    Origin NVARCHAR(10) NOT NULL,
    Destination NVARCHAR(10) NOT NULL,
    LoadDate DATETIME NOT NULL
);

4. Tables in the Sat Schema of the Raw_Vault Database

-- Satellite for flight details
CREATE TABLE Sat.FlightDetails (
    FlightKey NVARCHAR(255) NOT NULL,
    FlightDate DATE NOT NULL,
    ScheduledDepartureTime NVARCHAR(10),
    ScheduledArrivalTime NVARCHAR(10),
    LoadDate DATETIME NOT NULL
);

Summary

In this section, we created three databases (Staging_Area, Raw_Vault, Business_Vault) and prepared the structure for building the Data Vault model. We also set up schemas in the Raw_Vault database for hubs, links, and satellites.

The next step will be loading data from the source file Airline_Delay_Cause.csv into the Staging_Area database using Python and Pandas, designing and implementing the data loading process from Staging_Area to Raw_Vault.

Feedback from students on the DataVault course