In this section, we will explore how to organize the process of loading data from the source system into the data warehouse based on the ETL/ELT approach. Our scenario involves working with data from the file Airline_Delay_Cause.csv, which first needs to be loaded into the Staging Area and then moved to the Raw Vault, following the Data Vault concept.
Main Stages of the ETL/ELT Process
- Extract: Data is extracted from the file and loaded into an intermediate table (Staging_Area).
- Load: Raw data from the Staging Area is moved to Raw Vault tables (Hubs, Links, Satellites).
- Transform: Business transformation logic is applied later during the formation of data marts (Business Vault or Data Marts).
Implementation of the ETL/ELT Process
We will use Python with Pandas and SQLAlchemy to extract and load data into the Staging Area, and then write SQL scripts to transfer the data from Staging to Raw Vault. I intentionally do not use any ETL tool for this section, as this is not the goal of this course. The main thing is understanding the basic principles of data loading, and this can be sufficiently demonstrated with simple SQL scripts.
1. Loading Data into the Staging Area
To load data from the file Airline_Delay_Cause.csv
into the table Staging_AirlineData, we will use the Pandas library and the pyodbc driver to connect to MS SQL.
import pandas as pd
from sqlalchemy import create_engine
# Read the CSV file
file_path = 'Airline_Delay_Cause.csv'
data = pd.read_csv(file_path)
# Connect to MS SQL Express
engine = create_engine('mssql+pyodbc://localhost\\SQLEXPRESS/Staging_Area?driver=SQL+Server+Native+Client+11.0')
# Load data into the Staging_AirlineData table
data.to_sql('Staging_AirlineData', con=engine, if_exists='replace', index=False, method='multi')
print("Data successfully loaded into the Staging Area!")
2. Transferring Data from Staging to Raw Vault
After loading the data into the Staging Area, we will transfer it to Raw Vault tables using SQL queries.
Creating ETL Scripts for Data Transfer
- Loading data into hubs (Hubs):
INSERT INTO Raw_Vault.Hub.Hub_Airline (Airline_SurrogateKey, Airline_Code, Load_Date, Record_Source)
SELECT
NEWID() AS Airline_SurrogateKey, -- Generate surrogate key
DISTINCT Airline_Code,
GETDATE() AS Load_Date,
'Staging_AirlineData' AS Record_Source
FROM
Staging_Area.dbo.Staging_AirlineData;
- Loading data into links (Links):
INSERT INTO Raw_Vault.Link.Link_AirlineFlights (Link_SurrogateKey, Airline_SurrogateKey, Flight_SurrogateKey, Load_Date, Record_Source)
SELECT
NEWID() AS Link_SurrogateKey,
h.Airline_SurrogateKey,
f.Flight_SurrogateKey,
GETDATE() AS Load_Date,
'Staging_AirlineData' AS Record_Source
FROM
Staging_Area.dbo.Staging_AirlineData s
JOIN Raw_Vault.Hub.Hub_Airline h ON s.Airline_Code = h.Airline_Code
JOIN Raw_Vault.Hub.Hub_Flight f ON s.Flight_Code = f.Flight_Code;
- Loading data into satellites (Satellites):
INSERT INTO Raw_Vault.Sat.Sat_AirlineDetails (Airline_SurrogateKey, Airline_Name, Year, Month, Total_Delays, Load_Date, Record_Source)
SELECT
h.Airline_SurrogateKey,
s.Airline_Name,
s.Year,
s.Month,
s.Total_Delays,
GETDATE() AS Load_Date,
'Staging_AirlineData' AS Record_Source
FROM
Staging_Area.dbo.Staging_AirlineData s
JOIN Raw_Vault.Hub.Hub_Airline h ON s.Airline_Code = h.Airline_Code;
Explanations and Examples
-
Surrogate Keys:
- Surrogate keys are generated using the NEWID() function to ensure the uniqueness of each record in hubs and links.
- This allows linking data from different systems and ensures the integrity of the model.
-
Schemas and Data Organization:
- Using separate schemas for hubs (Hub), links (Link), and satellites (Sat) in the Raw_Vault database simplifies management and ensures clear separation of logic.
Conclusion
This approach to developing the ETL/ELT process allows:
- Simplifying data management through a clear structure.
- Maximizing the use of SQL for processing large volumes of data.
- Flexibly integrating Python for initial data loading and preprocessing.
In the next section, we will explore how to add business logic to the Business Vault.