Open In App

Flight Delay Prediction using Deep Learning

Air travel has become an important part of our lives, and with this comes the problem of flights being delayed. Deep learning models can automatically learn hierarchical representations from data, making them best for flight delay prediction. In the article, we will build a flight delay predictor using TensorFlow framework.

How can we use deep learning to build a flight delay predictor?

Building a Flight Delay Predictor

We will use the US Domestic Flights Delay Prediction(2013-2018) dataset. The dataset will be used for training and testing the model. It has various features like flight date, origin, destination, scheduled departure time, distance, arrival time and many more. Now let's load the dataset into our Kaggle notebook and look into a few data points.

import pandas as pd
import numpy as np
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout

data = pd.read_csv('/kaggle/input/us-domestic-flights-delay-prediction-2013-2018/flight_delay_predict.csv')
data.head()


Output:

is_delay    Year    Quarter    Month    DayofMonth    DayOfWeek    FlightDate    Reporting_Airline    Origin    OriginState    Dest    DestState    CRSDepTime    Cancelled    Diverted    Distance    DistanceGroup    ArrDelay    ArrDelayMinutes    AirTime
0 1.0 2014 1 1 1 3 2014-01-01 UA LAX CA ORD IL 900 0.0 0.0 1744.0 7 43.0 43.0 218.0
1 0.0 2014 1 1 1 3 2014-01-01 AA IAH TX DFW TX 1750 0.0 0.0 224.0 1 2.0 2.0 50.0
2 1.0 2014 1 1 1 3 2014-01-01 AA LAX CA ORD IL 1240 0.0 0.0 1744.0 7 26.0 26.0 220.0
3 1.0 2014 1 1 1 3 2014-01-01 AA DFW TX LAX CA 1905 0.0 0.0 1235.0 5 159.0 159.0 169.0
4 0.0 2014 1 1 1 3 2014-01-01 AA DFW TX CLT NC 1115 0.0 0.0 936.0 4 -13.0 0.0 108.0

EDA(Exploratory Data Analysis) and Model Building

EDA is a very important step in understanding the data. It helps us understand the structure, distribution, and relationships within the dataset. One important step of EDA is visualizing the dataset. We can visualize the average arrival delays at different origin and destination airports.

avg_delay_by_origin = data.groupby('Origin')['ArrDelay'].mean().reset_index()

bar_plot = px.bar(avg_delay_by_origin, x='Origin', y='ArrDelay', title='Average 
Arrival Delay by Origin Airport')
bar_plot.update_layout(xaxis_title='Origin Airport', yaxis_title='Average Arrival Delay')

bar_plot.show()

Output:

Screenshot-2024-03-26-at-83330-PM

OUTPUT


avg_delay_by_dest = data.groupby('Dest')['ArrDelay'].mean().reset_index()

bar_plot_dest = px.bar(avg_delay_by_dest, x='Dest', y='ArrDelay', title='Average Arrival Delay 
by Destination Airport')
bar_plot_dest.update_layout(xaxis_title='Destination Airport', yaxis_title='Average Arrival Delay')

bar_plot_dest.show()


Output:

Screenshot-2024-03-26-at-83222-PM

OUTPUT


numeric_data = data.select_dtypes(include=['number'])

corr_matrix = numeric_data.corr()

plt.figure(figsize=(15, 10))
sns.heatmap(corr_matrix, annot = True)

Output:

__results___7_1

OUTPUT


data['FlightDate'] = pd.to_datetime(data['FlightDate'])

avg_delay_month = data.groupby(data['FlightDate'].dt.month)['is_delay'].mean().reset_index()
fig = px.bar(avg_delay_month, x='FlightDate', y='is_delay', labels={'FlightDate': 'Month', 
'is_delay': 'Average Delay'}, 
             title='Average Delay by Month')
fig.update_traces(marker_color='skyblue')
fig.show()

Output:

Screenshot-2024-03-26-at-84230-PM

OUTPUT


Splitting the Data

Now, let's get into the main part of this blog which is the model building. First, we will assign the features and the target variables to X and y respectively. Then we will split the dataset with 80% of the data for training and the rest 20% for testing. Then we will scale the features using the StandardScaler method from sklearn.

# Splitting the data into training and testing sets
X = data[['AirTime', 'Distance']]
y = data[['ArrDelayMinutes', 'is_delay']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scaling the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Model Building

Now, we will define the architecture of our model using the Sequential model from TensorFlow.Keras. We will use three dense layers using relu activation function. Then we will compile the model using mean squared error as a loss function and an Adam Optimizer. Finally, we will train the model using the fit() function and save the model into our working directory.

model = Sequential()
model.add(Dense(64, input_dim=X_train.shape[1], activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(32, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(2, activation='linear'))

model.compile(loss='mean_squared_error', optimizer='adam', metrics=['accuracy'])

model.fit(X_train, y_train, epochs=5, batch_size=32, verbose=1)
score, accuracy = model.evaluate(X_test, y_test, verbose=0)

model.save('/kaggle/working/model.h5')

Output:

Epoch 1/5
40890/40890 ━━━━━━━━━━━━━━━━━━━━ 68s 2ms/step - accuracy: 0.9959 - loss: 793.4816
Epoch 2/5
40890/40890 ━━━━━━━━━━━━━━━━━━━━ 66s 2ms/step - accuracy: 1.0000 - loss: 803.0837
Epoch 3/5
40890/40890 ━━━━━━━━━━━━━━━━━━━━ 66s 2ms/step - accuracy: 1.0000 - loss: 781.1000
Epoch 4/5
40890/40890 ━━━━━━━━━━━━━━━━━━━━ 66s 2ms/step - accuracy: 1.0000 - loss: 751.3886
Epoch 5/5
40890/40890 ━━━━━━━━━━━━━━━━━━━━ 82s 2ms/step - accuracy: 1.0000 - loss: 777.7186
Test loss: 729.39306640625
Test accuracy: 1.0

Now, we will take input from the user, preprocess it and predict the output.

# Real-time Prediction
air_time = float(input("Enter Air Time in minutes: "))
distance = float(input("Enter Distance in miles: "))
user_input = np.array([[air_time, distance]])
user_input_scaled = scaler.transform(user_input)
predictions = model.predict(user_input_scaled)
if predictions[0][1] >= 0.5:
    print(f"The flight is delayed by {predictions[0][0]} minutes.")
else:
    print("The flight is not delayed.")
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 56ms/step
The flight is delayed by 75.59285736083984 minutes.

Conclusion

In this blog, you have learned about the critical issues of flight delays and how they can impact both passengers and airlines. Through hands-on experience, we learned how to preprocess data, build a deep learning mode, and integrate it into a web application using Flask.

Key Takeaways

Article Tags :