Introduction
To work with the dataset Airline_Delay_Cause.csv Sources and Installation of programs we will build a Data Vault model. It will provide flexibility, scalability, and the ability to integrate data from various sources. Data Vault is suitable for analytical purposes, such as analyzing flight delays, cancellation reasons, and airline performance.
Data Vault Structure
The main components of our model will be:
- Hubs — store unique business keys.
- Links — establish relationships between entities.
- Satellites — store attributes describing business keys or relationships.
1. Hubs
In our case, we will need three hubs:
-
Hub_Flights
- Unique flights.
- Fields:
- FlightKey (Surrogate Key).
- FlightNumber (Business Key).
- LoadDate.
-
Hub_Airports
- List of airports.
- Fields:
- AirportKey (Surrogate Key).
- AirportCode (Business Key).
- LoadDate.
-
Hub_Airlines
- List of airlines.
- Fields:
- AirlineKey (Surrogate Key).
- AirlineCode (Business Key).
- LoadDate.
2. Links
-
Link_FlightRoutes
- Relationship between flights and routes (departure and destination airports).
- Fields:
- FlightRouteKey (Surrogate Key).
- FlightKey.
- OriginAirportKey.
- DestinationAirportKey.
- LoadDate.
-
Link_FlightDelays
- Relationship between flights and delays.
- Fields:
- FlightDelayKey (Surrogate Key).
- FlightKey.
- DelayCategoryKey (e.g., "weather", "carrier").
- LoadDate.
3. Satellites
-
Sat_FlightDetails
- Flight details.
- Fields:
- FlightKey.
- FlightDate.
- ScheduledDepartureTime.
- ScheduledArrivalTime.
- LoadDate.
-
Sat_AirportDetails
- Additional information about airports.
- Fields:
- AirportKey.
- AirportName.
- City.
- State.
- LoadDate.
-
Sat_DelayDetails
- Delay details.
- Fields:
- FlightDelayKey.
- DelayMinutes.
- DelayCause.
- LoadDate.
Data Example
Original CSV:
| FlightNumber | Origin | Destination | DelayMinutes | DelayCause | FlightDate | AirlineCode | OriginName | DestinationName | ScheduledDepartureTime | ScheduledArrivalTime |
|---|---|---|---|---|---|---|---|---|---|---|
| 1234 | LAX | JFK | 120 | Weather | 2025-01-01 | AA | Los Angeles | New York | 08:00 | 16:00 |
Hub_Flights:
| FlightKey | FlightNumber | LoadDate |
|---|---|---|
| 1 | 1234 | 2025-01-03 |
Hub_Airports:
| AirportKey | AirportCode | LoadDate |
|---|---|---|
| 1 | LAX | 2025-01-03 |
| 2 | JFK | 2025-01-03 |
Hub_Airlines:
| AirlineKey | AirlineCode | LoadDate |
|---|---|---|
| 1 | AA | 2025-01-03 |
Link_FlightRoutes:
| FlightRouteKey | FlightKey | OriginAirportKey | DestinationAirportKey | LoadDate |
|---|---|---|---|---|
| 1 | 1 | 1 | 2 | 2025-01-03 |
Sat_DelayDetails:
| FlightDelayKey | FlightKey | DelayMinutes | DelayCause | LoadDate |
|---|---|---|---|---|
| 1 | 1 | 120 | Weather | 2025-01-03 |
Result
After implementing Data Vault, students will be able to:
- Analyze delays by airports and airlines.
- Build flight routes.
- Create analytical reports and dashboards.
Practice
download the dataset Airline_Delay_Cause.csv Sources and Installation of programs to your computer
Designing a Data Vault model is a key stage in creating a data warehouse that allows us to structure information so that it is flexible, scalable, and meets business requirements. In this section, we will create three main databases:
- Staging_Area — for loading raw data from the source file.
- Raw_Vault — for storing the Data Vault model, where data is structured into hubs, links, and satellites.
- Business_Vault — for adding business logic and creating data marts.
We will use the following tools:
- MS SQL Express — for creating the database and implementing the Data Vault structure.
- SSMS (SQL Server Management Studio) — for working with the database.
- Pandas — for loading source data from the
Airline_Delay_Cause.csvfile into the Staging_Area database.
Important: the goal of this training course is to study Data Vault concepts and structure. We will not delve into the use of ETL tools in detail to focus on building the data warehouse structure.
Data Vault Database Structure
-
Staging_Area:
- Contains raw data loaded directly from the source file.
-
Raw_Vault:
- Hub Schema — hub tables containing unique business keys.
- Link Schema — relationship tables between hubs.
- Sat Schema — satellite tables for storing details, changing information, and timestamps.
-
Business_Vault:
- Includes business logic and data transformed for analytical tasks.
SQL Code for Creating Databases and Schemas
1. Creating Databases
-- Creating the Staging_Area database
CREATE DATABASE Staging_Area;
GO
-- Creating the Raw_Vault database
CREATE DATABASE Raw_Vault;
GO
-- Creating the Business_Vault database
CREATE DATABASE Business_Vault;
GO
2. Creating Schemas in the Raw_Vault Database
-- Switching to the Raw_Vault database
USE Raw_Vault;
GO
-- Creating schemas for hubs, links, and satellites
CREATE SCHEMA Hub;
CREATE SCHEMA Link;
CREATE SCHEMA Sat;
GO
SQL Code for Creating Tables
1. Tables in the Staging_Area Database
USE Staging_Area;
GO
CREATE TABLE Staging_AirlineData (
Year INT,
Month INT,
DayofMonth INT,
Carrier NVARCHAR(10),
OriginAirport NVARCHAR(10),
DestAirport NVARCHAR(10),
FlightNum NVARCHAR(50),
DepDelay INT,
ArrDelay INT,
Cancelled BIT,
CancellationCode NVARCHAR(10),
WeatherDelay INT
);
GO
2. Tables in the Hub Schema of the Raw_Vault Database
USE Raw_Vault;
GO
-- Hub for flights (Flight)
CREATE TABLE Hub.Flights (
FlightKey NVARCHAR(255) PRIMARY KEY,
FlightNumber NVARCHAR(50) NOT NULL,
LoadDate DATETIME NOT NULL
);
-- Hub for airports (Airport)
CREATE TABLE Hub.Airports (
AirportCode NVARCHAR(10) PRIMARY KEY,
LoadDate DATETIME NOT NULL
);
-- Hub for airlines (Carrier)
CREATE TABLE Hub.Airlines (
AirlineCode NVARCHAR(10) PRIMARY KEY,
LoadDate DATETIME NOT NULL
);
3. Tables in the Link Schema of the Raw_Vault Database
-- Link for flight routes
CREATE TABLE Link.FlightRoutes (
FlightRouteKey NVARCHAR(255) PRIMARY KEY,
FlightKey NVARCHAR(255) NOT NULL,
Origin NVARCHAR(10) NOT NULL,
Destination NVARCHAR(10) NOT NULL,
LoadDate DATETIME NOT NULL
);
4. Tables in the Sat Schema of the Raw_Vault Database
-- Satellite for flight details
CREATE TABLE Sat.FlightDetails (
FlightKey NVARCHAR(255) NOT NULL,
FlightDate DATE NOT NULL,
ScheduledDepartureTime NVARCHAR(10),
ScheduledArrivalTime NVARCHAR(10),
LoadDate DATETIME NOT NULL
);
Summary
In this section, we created three databases (Staging_Area, Raw_Vault, Business_Vault) and prepared the structure for building the Data Vault model. We also set up schemas in the Raw_Vault database for hubs, links, and satellites.
The next step will be loading data from the source file Airline_Delay_Cause.csv into the Staging_Area database using Python and Pandas, designing and implementing the data loading process from Staging_Area to Raw_Vault.