Code
library(dplyr)
library(knitr)
library(ggplot2)
library(tidyr)
library(bayesrules)Analyzing Factors Influencing Flight Delays
Slides: View Slides
Bayesian linear regression (BLR) is a statistical approach that uses Bayesian principles to model relationships between variables. It combines prior knowledge with data to make predictions and handle uncertainty in the estimates. Unlike traditional Frequentist methods, BLR provides a probability-based approach for both estimating values and making forecasts.
Linear regression itself is a key technique for predicting one variable based on another by finding the best-fitting line that minimizes the gap between what we predict and what actually happens (“What Is Linear Regression?” 2021). This method is used in both Frequentist and Bayesian approaches and forms the groundwork for more advanced analysis.
In Bayesian linear regression, Bayes’ Theorem is used to update prior beliefs about model parameters with new data, resulting in a posterior distribution that reflects a range of possible values rather than a single estimate (Bayes 1763). This method takes into account previous knowledge and measures the uncertainty in predictions, giving a richer and often more useful picture compared to the single-point estimates from Frequentist approaches (Koehrsen 2018).
The Bayesian Analysis Toolkit (BAT) provides a real-world application of Bayesian methods, utilizing Markov Chain Monte Carlo (MCMC) techniques for posterior sampling and model comparison. BAT allows for flexible modeling and easy numerical integration, showing how Bayesian techniques can be applied in practical situations (Caldwell, Kollár, and Kröninger 2009).
Visualization plays a critical role in Bayesian workflows, as highlighted by Gabry et al. (Gabry et al. 2019). Tools such as trace plots and Hamiltonian Monte Carlo (HMC) diagnostics are crucial for figuring out if a model is working well and identifying any issues. In addition, techniques like posterior predictive checks and leave-one-out (LOO) cross-validation help fine-tune the model by testing how well it predicts new data.
Zyphur and Oswald (Zyphur and Oswald 2015) compare Bayesian and Frequentist methods, showing that Bayesian techniques often offer more dependable results. They do this by incorporating prior knowledge and overcoming some of the challenges that Frequentist methods face, especially with small sample sizes. Their study highlights how Bayesian methods can enhance predictions and help in better understanding the data.
Van de Schoot et al. (Schoot et al. 2021) explore how Bayesian statistics are becoming increasingly relevant and adaptable in today’s research. They point out that, with the help of modern deep learning and powerful computers, Bayesian methods are significantly improving the analysis of complex data. For instance, techniques like variational autoencoders, which combine Bayesian ideas with deep learning, are great for managing high-dimensional data and making precise predictions. This blend of Bayesian methods with advanced technology shows how Bayesian analysis is evolving to tackle modern challenges and boost scientific progress.
Gelman et al. (Gelman et al. 2013) offer a detailed look at Bayesian methods, including how they apply to linear regression. They highlight the advantages of using Bayesian approaches to handle uncertainty and improve models by incorporating prior information, which in turn boosts predictive accuracy and overall model performance.
Recent progress in variational inference, as highlighted by Blei, Kucukelbir, and McAuliffe (Blei, Kucukelbir, and McAuliffe 2017), offers scalable alternatives to MCMC methods. Variational inference works by approximating posterior distributions through optimization, making it a more efficient choice for handling large datasets and complex models.
Carpenter et al. (Carpenter et al. 2017) focus on how Bayesian methods are used in hierarchical models, particularly for complex systems. They point out that Bayesian hierarchical modeling is especially useful for handling data with multiple levels and capturing detailed relationships between variables, which improves both the flexibility and the accuracy of models in practical situations.
In conclusion, Bayesian linear regression is a powerful tool for modeling and predicting how variables are related. It stands out because it blends prior knowledge with a way to handle uncertainty. With the help of advanced diagnostics, practical tools like BAT, and recent breakthroughs in computing and deep learning, Bayesian methods show just how flexible and effective they can be for analyzing complex data in today’s world.
The dataset utilized in this analysis is derived from the “Airline Delay Cause” dataset, which encompasses flight arrival and delay information for U.S. airports. The dataset includes various attributes such as the number of arriving flights, delay reasons, cancellations, and diversions, spanning multiple years and airlines. For this study, the focus was specifically on data from August 2023, which consists of 171,666 observations.
Link to dataset:
https://www.kaggle.com/datasets/sriharshaeedala/airline-delay?resource=download
library(dplyr)
library(knitr)
library(ggplot2)
library(tidyr)
library(bayesrules)AirlineDelays <- read.csv("Airline_Delay_Cause.csv")
head(AirlineDelays) year month carrier carrier_name airport
1 2023 8 9E Endeavor Air Inc. ABE
2 2023 8 9E Endeavor Air Inc. ABY
3 2023 8 9E Endeavor Air Inc. AEX
4 2023 8 9E Endeavor Air Inc. AGS
5 2023 8 9E Endeavor Air Inc. ALB
6 2023 8 9E Endeavor Air Inc. ATL
airport_name arr_flights
1 Allentown/Bethlehem/Easton, PA: Lehigh Valley International 89
2 Albany, GA: Southwest Georgia Regional 62
3 Alexandria, LA: Alexandria International 62
4 Augusta, GA: Augusta Regional at Bush Field 66
5 Albany, NY: Albany International 92
6 Atlanta, GA: Hartsfield-Jackson Atlanta International 1636
arr_del15 carrier_ct weather_ct nas_ct security_ct late_aircraft_ct
1 13 2.25 1.60 3.16 0 5.99
2 10 1.97 0.04 0.57 0 7.42
3 10 2.73 1.18 1.80 0 4.28
4 12 3.69 2.27 4.47 0 1.57
5 22 7.76 0.00 2.96 0 11.28
6 256 55.98 27.81 63.64 0 108.57
arr_cancelled arr_diverted arr_delay carrier_delay weather_delay nas_delay
1 2 1 1375 71 761 118
2 0 1 799 218 1 62
3 1 0 766 56 188 78
4 1 1 1397 471 320 388
5 2 0 1530 628 0 134
6 32 11 29768 9339 4557 4676
security_delay late_aircraft_delay
1 0 425
2 0 518
3 0 444
4 0 218
5 0 768
6 0 11196
dim(AirlineDelays)[1] 171666 21
delayed_summary <- AirlineDelays %>%
summarise(
total_delayed_flights = sum(arr_delay > 0, na.rm = TRUE),
count_15_to_30 = sum(arr_delay > 15 & arr_delay <= 30, na.rm = TRUE),
count_30_to_60 = sum(arr_delay > 30 & arr_delay <= 60, na.rm = TRUE),
count_over_60 = sum(arr_delay > 60, na.rm = TRUE)
)
print(delayed_summary) total_delayed_flights count_15_to_30 count_30_to_60 count_over_60
1 164638 2543 3919 157889
# List unique carriers
unique_carriers <- AirlineDelays %>%
distinct(carrier, carrier_name) # You can include carrier_name for more clarity
# Display the unique carriers
print(unique_carriers) carrier carrier_name
1 9E Endeavor Air Inc.
2 AA American Airlines Inc.
3 AS Alaska Airlines Inc.
4 B6 JetBlue Airways
5 DL Delta Air Lines Inc.
6 F9 Frontier Airlines Inc.
7 G4 Allegiant Air
8 HA Hawaiian Airlines Inc.
9 MQ Envoy Air
10 NK Spirit Air Lines
11 OH PSA Airlines Inc.
12 OO SkyWest Airlines Inc.
13 UA United Air Lines Inc.
14 WN Southwest Airlines Co.
15 YX Republic Airline
16 QX Horizon Air
17 YV Mesa Airlines Inc.
18 EV ExpressJet Airlines LLC
19 EV ExpressJet Airlines Inc.
20 VX Virgin America
21 US US Airways Inc.
22 FL AirTran Airways Corporation
23 MQ American Eagle Airlines Inc.
total_arrived_flights_corrected <- AirlineDelays %>%
group_by(carrier, airport) %>%
summarise(arrived_flights = sum(arr_flights, na.rm = TRUE)) %>%
summarise(total_flights = sum(arrived_flights))
print(total_arrived_flights_corrected)# A tibble: 21 × 2
carrier total_flights
<chr> <dbl>
1 9E 1466178
2 AA 7973061
3 AS 1990957
4 B6 2609697
5 DL 8661561
6 EV 2786903
7 F9 1158943
8 FL 143429
9 G4 613357
10 HA 727265
# ℹ 11 more rows
year: The year of the data.
month: The month of the data.
carrier: Carrier code.
carrier_name: Carrier name.
airport: Airport code.
airport_name: Airport name.
arr_flights: Number of arriving flights.
arr_del15: Number of flights delayed by 15 minutes or more.
carrier_ct: Carrier count (delay due to the carrier).
weather_ct: Weather count (delay due to weather).
nas_ct: NAS (National Airspace System) count (delay due to the NAS).
security_ct: Security count (delay due to security).
late_aircraft_ct: Late aircraft count (delay due to late aircraft arrival).
arr_cancelled: Number of flights canceled.
arr_diverted: Number of flights diverted.
arr_delay: Total arrival delay.
carrier_delay: Delay attributed to the carrier.
weather_delay: Delay attributed to weather.
nas_delay: Delay attributed to the NAS.
security_delay: Delay attributed to security.
late_aircraft_delay: Delay attributed to late aircraft arrival.
# Summarize the dataset by carrier
carrier_summary <- AirlineDelays %>%
group_by(carrier) %>%
summarise(
total_flights = sum(arr_flights, na.rm = TRUE),
total_delayed_flights = sum(arr_del15, na.rm = TRUE) # Total delayed flights (15 minutes or more)
)
# Calculate total metrics for August
total_arrived_flights <- sum(AirlineDelays$arr_flights, na.rm = TRUE)
total_cancelled_flights <- sum(AirlineDelays$arr_cancelled, na.rm = TRUE)
total_diverted_flights <- sum(AirlineDelays$arr_diverted, na.rm = TRUE)
# Total delayed flights (where arr_del15 > 0)
total_delayed_flights <- sum(AirlineDelays$arr_del15, na.rm = TRUE)
# Breakdown of reasons for delays
delay_reasons <- AirlineDelays %>%
summarise(
carrier_delay = sum(carrier_ct, na.rm = TRUE),
weather_delay = sum(weather_ct, na.rm = TRUE),
nas_delay = sum(nas_ct, na.rm = TRUE),
security_delay = sum(security_ct, na.rm = TRUE),
late_aircraft_delay = sum(late_aircraft_ct, na.rm = TRUE)
)
# Calculate percentages for delay reasons
delay_percentages <- delay_reasons / total_delayed_flights * 100
# Create the summary table with detailed sections for delays
summary_table <- data.frame(
Characteristic = c("Total Months of Data (August)",
"Total Carriers",
"Total Arrived Flights (Count Data)",
"Total Delayed Flights (15+ min)",
" - Carrier Delays",
" - Weather Delays",
" - NAS Delays",
" - Security Delays",
" - Late Aircraft Delays",
"Total Cancelled Flights",
"Total Diverted Flights",
"Cancelled Flights (%)",
"Diverted Flights (%)"),
Value = c(1, # Total months (August)
length(unique(AirlineDelays$carrier)),
round(total_arrived_flights, 2),
round(total_delayed_flights, 2),
round(delay_reasons$carrier_delay, 2),
round(delay_reasons$weather_delay, 2),
round(delay_reasons$nas_delay, 2),
round(delay_reasons$security_delay, 2),
round(delay_reasons$late_aircraft_delay, 2),
round(total_cancelled_flights, 2),
round(total_diverted_flights, 2),
round((total_cancelled_flights / total_arrived_flights) * 100, 2),
round((total_diverted_flights / total_arrived_flights) * 100, 2))
)
# Format the delay reasons to include percentages in parentheses
summary_table$Characteristic[5:9] <- paste0(
summary_table$Characteristic[5:9],
" (", round(delay_percentages, 2), "%)"
)
# Display the summary table
kable(summary_table, caption = "Table 1: Summary of Flight Arrivals, Delays, Cancellations, and Diversions (August Data)",
format.args = list(big.mark = ",", scientific = FALSE))| Characteristic | Value |
|---|---|
| Total Months of Data (August) | 1.00 |
| Total Carriers | 21.00 |
| Total Arrived Flights (Count Data) | 62,146,805.00 |
| Total Delayed Flights (15+ min) | 11,375,095.00 |
| - Carrier Delays (31.34%) | 3,565,080.59 |
| - Weather Delays (3.39%) | 385,767.94 |
| - NAS Delays (29.21%) | 3,322,432.52 |
| - Security Delays (0.24%) | 26,930.39 |
| - Late Aircraft Delays (35.82%) | 4,074,891.00 |
| Total Cancelled Flights | 1,290,923.00 |
| Total Diverted Flights | 148,007.00 |
| Cancelled Flights (%) | 2.08 |
| Diverted Flights (%) | 0.24 |
Dataset 1 takes a deep dive into the airline data, offering interesting insights into how different airlines performed. The dataset is made up of information regarding flight arrivals and delays across various U.S. airports, focusing specifically on data from August 2023. It includes a total of 171,666 observations with a variety of relevant variables related to airline performance. Overall, there were a whopping 62,146,805 arriving flights, showing just how busy the skies were. However, not everything went smoothly; 11,375,095 of those flights were delayed by at least 15 minutes, which is a significant number and definitely impacts travelers.
When I looked at the reasons for these delays, it became clear that most were due to issues on the airlines’ end, with 3,565,081 flights (31.34%) delayed because of carrier-related problems. Weather played a role too, causing delays for 385,768 flights (3.39%). National Airspace System delays accounted for another 3,322,433 flights (29.21%). Security delays were relatively minor, affecting just 26,930 flights (0.24%), while late arrivals of other aircraft were responsible for 4,074,891 flights (35.82%).
On the cancellation front, there were 1,290,923 flights that were called off, which is about 2.08% of all arriving flights. Additionally, 148,007 flights were diverted, a small fraction as well.
These numbers highlight the challenges that airlines face, especially with delays stemming mainly from their own operations. While weather and air traffic issues also contribute, it’s clear that improving airline efficiency could make a real difference. Overall, the relatively low rates of cancellations and diversions suggest that, despite the delays, the month appears to have been fairly stable for airline operations.
# Summarize the dataset by carrier
summary_by_airline <- AirlineDelays %>%
group_by(carrier, carrier_name) %>%
summarise(
total_flights = sum(arr_flights, na.rm = TRUE),
delayed_flights = sum(arr_del15, na.rm = TRUE),
cancelled_flights = sum(arr_cancelled, na.rm = TRUE)
) %>%
ungroup() %>%
mutate(
on_time_flights = total_flights - delayed_flights,
on_time_percentage = round((on_time_flights / total_flights) * 100, 0),
delayed_percentage = round((delayed_flights / total_flights) * 100, 0),
cancelled_percentage = round((cancelled_flights / total_flights) * 100, 0)
)
# Create a new table with selected columns
table1 <- summary_by_airline %>%
select(carrier, carrier_name, on_time_flights, on_time_percentage,
delayed_flights, delayed_percentage,
cancelled_flights, cancelled_percentage)
# Display the summary table
kable(table1, caption = "Table 2: Flight Summary by Airline (August 2023)",
format.args = list(big.mark = ",", scientific = FALSE))| carrier | carrier_name | on_time_flights | on_time_percentage | delayed_flights | delayed_percentage | cancelled_flights | cancelled_percentage |
|---|---|---|---|---|---|---|---|
| 9E | Endeavor Air Inc. | 1,265,619 | 86 | 200,559 | 14 | 31,742 | 2 |
| AA | American Airlines Inc. | 6,434,800 | 81 | 1,538,261 | 19 | 167,510 | 2 |
| AS | Alaska Airlines Inc. | 1,673,227 | 84 | 317,730 | 16 | 25,877 | 1 |
| B6 | JetBlue Airways | 1,961,778 | 75 | 647,919 | 25 | 59,509 | 2 |
| DL | Delta Air Lines Inc. | 7,462,677 | 86 | 1,198,884 | 14 | 85,395 | 1 |
| EV | ExpressJet Airlines Inc. | 2,097,091 | 80 | 540,578 | 20 | 85,844 | 3 |
| EV | ExpressJet Airlines LLC | 119,562 | 80 | 29,672 | 20 | 8,198 | 5 |
| F9 | Frontier Airlines Inc. | 867,540 | 75 | 291,403 | 25 | 22,063 | 2 |
| FL | AirTran Airways Corporation | 118,203 | 82 | 25,226 | 18 | 1,812 | 1 |
| G4 | Allegiant Air | 463,414 | 76 | 149,943 | 24 | 24,303 | 4 |
| HA | Hawaiian Airlines Inc. | 637,471 | 88 | 89,794 | 12 | 4,850 | 1 |
| MQ | American Eagle Airlines Inc. | 218,151 | 78 | 61,132 | 22 | 15,074 | 5 |
| MQ | Envoy Air | 1,679,060 | 81 | 394,625 | 19 | 74,306 | 4 |
| NK | Spirit Air Lines | 1,195,368 | 78 | 330,708 | 22 | 34,307 | 2 |
| OH | PSA Airlines Inc. | 1,094,689 | 83 | 231,648 | 17 | 43,897 | 3 |
| OO | SkyWest Airlines Inc. | 5,737,246 | 83 | 1,167,810 | 17 | 138,798 | 2 |
| QX | Horizon Air | 171,737 | 86 | 28,682 | 14 | 3,608 | 2 |
| UA | United Air Lines Inc. | 4,433,496 | 81 | 1,030,741 | 19 | 86,820 | 2 |
| US | US Airways Inc. | 649,641 | 83 | 134,516 | 17 | 12,167 | 2 |
| VX | Virgin America | 237,303 | 79 | 64,605 | 21 | 3,156 | 1 |
| WN | Southwest Airlines Co. | 10,061,654 | 80 | 2,460,563 | 20 | 275,976 | 2 |
| YV | Mesa Airlines Inc. | 744,238 | 81 | 170,060 | 19 | 30,366 | 3 |
| YX | Republic Airline | 1,447,745 | 84 | 270,036 | 16 | 55,345 | 3 |
# Summarize the dataset by carrier
summary_by_airline <- AirlineDelays %>%
group_by(carrier, carrier_name) %>%
summarise(
total_flights = sum(arr_flights, na.rm = TRUE),
delayed_flights = sum(arr_del15, na.rm = TRUE),
cancelled_flights = sum(arr_cancelled, na.rm = TRUE),
.groups = 'drop' # Prevent grouping warnings
)
# Calculate proportions
summary_by_airline <- summary_by_airline %>%
mutate(
delayed_proportion = delayed_flights / total_flights,
cancelled_proportion = cancelled_flights / total_flights
) %>%
select(carrier, carrier_name, total_flights, delayed_proportion, cancelled_proportion)
# Reshape data for plotting
summary_long <- summary_by_airline %>%
pivot_longer(cols = c(delayed_proportion, cancelled_proportion),
names_to = "Flight_Status",
values_to = "Proportion")
# Create the bar plot
ggplot(summary_long, aes(x = reorder(carrier_name, -Proportion), y = Proportion, fill = Flight_Status)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Flight Status Proportions by Airline (August 2023)",
x = "Airline",
y = "Proportion of Flights",
fill = "Flight Status") +
scale_y_continuous(labels = scales::label_percent()) + # Format y-axis as percentage
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))# Summarize the dataset by carrier
summary_by_airline <- AirlineDelays %>%
group_by(carrier, carrier_name) %>%
summarise(
total_flights = sum(arr_flights, na.rm = TRUE),
delayed_flights = sum(arr_del15, na.rm = TRUE),
cancelled_flights = sum(arr_cancelled, na.rm = TRUE),
.groups = 'drop' # Prevent grouping warnings
)
# Calculate total counts for pie chart
total_counts <- summary_by_airline %>%
summarise(
total_flights = sum(total_flights),
delayed_flights = sum(delayed_flights),
cancelled_flights = sum(cancelled_flights)
) %>%
mutate(on_time_flights = total_flights - (delayed_flights + cancelled_flights)) %>%
select(-total_flights) %>% # Remove total_flights if not needed in the pie chart
pivot_longer(cols = everything(), names_to = "Flight_Status", values_to = "Count")
# Create the pie chart
ggplot(total_counts, aes(x = "", y = Count, fill = Flight_Status)) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y") +
labs(title = "Proportion of Flight Status (August 2023)",
fill = "Flight Status") +
theme_minimal() +
theme(axis.title.x = element_blank(), axis.title.y = element_blank(), axis.text.x = element_blank())# Summarize reasons for delays
delay_reasons <- AirlineDelays %>%
summarise(
Carrier_Delay = sum(carrier_ct, na.rm = TRUE),
Weather_Delay = sum(weather_ct, na.rm = TRUE),
NAS_Delay = sum(nas_ct, na.rm = TRUE),
Security_Delay = sum(security_ct, na.rm = TRUE),
Late_Aircraft_Delay = sum(late_aircraft_ct, na.rm = TRUE)
)
# Reshape the data for plotting
delay_reasons_long <- pivot_longer(delay_reasons, cols = everything(),
names_to = "Reason", values_to = "Count")
# Create the bar chart
ggplot(delay_reasons_long, aes(x = Reason, y = Count, fill = Reason)) +
geom_bar(stat = "identity") +
labs(title = "Reasons for Flight Delays (August 2023)",
x = "Reason for Delay",
y = "Number of Delays") +
scale_y_continuous(labels = scales::comma) + # Format y-axis labels as whole numbers
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))Model Specification: Define the linear relationship between the dependent and independent variables.
Choose Priors: Select prior distributions for the model parameters, reflecting any existing knowledge about their values.
Data Collection: Gather relevant data for the variables in the model.
Model Fitting: Use computational methods, such as Markov Chain Monte Carlo (MCMC), to estimate the posterior distributions of the parameters based on the observed data.
Result Interpretation: Analyze the posterior distributions to understand the relationships between variables, including estimating means and credible intervals.
I chose to focus on weather delays because, even though they’re less common, they pose unique challenges that airlines can’t control. Unlike delays due to carrier or NAS issues, weather is unpredictable and can have a big impact on safety and scheduling. By analyzing weather’s effect on delays, I hoped to uncover patterns that might help airlines anticipate and better prepare for these types of disruptions, especially during certain seasons or in specific regions. This focus could lead to insights that improve the travel experience and help airlines handle weather-related delays more effectively.
The model can be written as
\[ \begin{align*} Y_i|\beta_0, \beta_1, \sigma &\overset{\text{ind}}{\sim} N (\mu_i, \sigma^2) && \text{with } && \mu_i = \beta_0 + \beta_1X_i \\ \beta_{0} &\sim N(m_0, s_0^2)\\ \beta_1 &\sim N(m_1, s_1^2)\\ \sigma &\sim \text{Exp}(l) \end{align*} \]
Let: \[ Y_i | \beta_0, \beta_1, \sigma \sim \text{N}(\mu_i, \sigma^2) \] with \[ \mu_i = \beta_0 + \beta_1 \cdot X_i \]
where:
This model specification allows us to assess the relationship between weather-related incidents and arrival delays by estimating the effect of weather on delays through the parameter \(\beta_1\). The model assumes that each flight’s delay is conditionally independent, given the model parameters, which means that the delay of one flight does not depend on another once we account for weather-related factors.
Regression Parameters
Intercept: The prior for the intercept \(\beta_0\) is specified as a normal distribution: \[ \beta_0 \sim \text{N}(m_0, s_0^2) \]
Slope: The prior for the slope \(\beta_1\), representing the effect of weather incidents on arrival delays, is also specified as a normal distribution: \[ \beta_1 \sim \text{N}(m_1, s_1^2) \]
Error: The prior for the error term \(\sigma\) is specified as an exponential distribution: \[ \sigma \sim \text{Exp}(l) \]
In this setup: - \(( m_0 )\) and \((m_1)\) are the means for the intercept and slope priors, reflecting initial beliefs about their central values. - \(( s_0^2 )\) and \(( s_1^2 )\) are the variances for the intercept and slope priors, allowing flexibility in the range of plausible values for each parameter. - \(( l )\) is the rate parameter for the exponential prior on the error term \(( \sigma )\), which controls the expected variability in arrival delays beyond the effect of weather incidents.
These priors were chosen to be relatively non-informative, allowing the data to inform the posterior estimates while still incorporating baseline expectations.
The priors for the model parameters were tuned as follows:
\[ \begin{align*} \beta_0 &\sim N(0, 5^2) \\ \beta_1 &\sim N(0, 5^2) \\ \sigma &\sim \text{Exp}(1) \end{align*} \]
These values were chosen to represent relatively non-informative priors, allowing the data to primarily guide the posterior estimates.
The Bayesian linear regression model is updated with the following parameter specifications:
\[ \begin{align*} Y_i | \beta_0, \beta_1, \sigma &\overset{\text{ind}}{\sim} N (\mu_i, \sigma^2) \quad \text{with} \quad \mu_i = \beta_0 + \beta_1 X_i \\ \beta_0 &\sim N(0, 5^2) \\ \beta_1 &\sim N(0, 5^2) \\ \sigma &\sim \text{Exp}(1) \end{align*} \]
Where: - \(( Y_i )\) represents the arrival delay (in minutes) for the ( i )-th flight. - \(( X_i )\) represents the number of weather-related delay incidents for the ( i )-th flight. - \(( \mu_i )\) is the expected (mean) arrival delay, modeled as a linear function of weather incidents. - \(( \beta_0 )\), \(( \beta_1 )\), and \(( \sigma )\) are parameters estimated through the Bayesian model, with priors specified above.
This updated model formulation allows us to understand the influence of weather-related incidents on arrival delays while incorporating prior beliefs and accounting for uncertainty in the parameter estimates.
# Load necessary packages
library(brms)
# Clean the data: Handle missing values for relevant variables (arrival delay and weather-related incidents)
AirlineDelays1 <- AirlineDelays %>%
filter(!is.na(arr_delay), !is.na(weather_ct))
# Define the single-predictor Bayesian model
single_predictor_model <- brm(
arr_delay ~ weather_ct,
data = AirlineDelays1, # Use the cleaned dataset here
family = gaussian(),
prior = c(
prior(normal(0, 5), class = "Intercept"),
prior(normal(0, 5), class = "b"),
prior(exponential(1), class = "sigma")
),
chains = 4,
iter = 2000,
warmup = 1000,
seed = 123
)Running /opt/R/4.3.2/lib/R/bin/R CMD SHLIB foo.c
using C compiler: ‘gcc (GCC) 11.4.1 20231218 (Red Hat 11.4.1-3)’
gcc -I"/opt/R/4.3.2/lib/R/include" -DNDEBUG -I"/opt/R/4.3.2/lib/R/library/Rcpp/include/" -I"/opt/R/4.3.2/lib/R/library/RcppEigen/include/" -I"/opt/R/4.3.2/lib/R/library/RcppEigen/include/unsupported" -I"/opt/R/4.3.2/lib/R/library/BH/include" -I"/opt/R/4.3.2/lib/R/library/StanHeaders/include/src/" -I"/opt/R/4.3.2/lib/R/library/StanHeaders/include/" -I"/opt/R/4.3.2/lib/R/library/RcppParallel/include/" -I"/opt/R/4.3.2/lib/R/library/rstan/include" -DEIGEN_NO_DEBUG -DBOOST_DISABLE_ASSERTS -DBOOST_PENDING_INTEGER_LOG2_HPP -DSTAN_THREADS -DUSE_STANC3 -DSTRICT_R_HEADERS -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION -D_HAS_AUTO_PTR_ETC=0 -include '/opt/R/4.3.2/lib/R/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp' -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1 -I/usr/local/include -fpic -g -O2 -c foo.c -o foo.o
In file included from /opt/R/4.3.2/lib/R/library/RcppEigen/include/Eigen/Core:19,
from /opt/R/4.3.2/lib/R/library/RcppEigen/include/Eigen/Dense:1,
from /opt/R/4.3.2/lib/R/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp:22,
from <command-line>:
/opt/R/4.3.2/lib/R/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:679:10: fatal error: cmath: No such file or directory
679 | #include <cmath>
| ^~~~~~~
compilation terminated.
make: *** [/opt/R/4.3.2/lib/R/etc/Makeconf:191: foo.o] Error 1
SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 0.000834 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 8.34 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1:
Chain 1:
Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 1:
Chain 1: Elapsed Time: 10.796 seconds (Warm-up)
Chain 1: 4.829 seconds (Sampling)
Chain 1: 15.625 seconds (Total)
Chain 1:
SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 2).
Chain 2:
Chain 2: Gradient evaluation took 0.000818 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 8.18 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2:
Chain 2:
Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 2: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 2: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 2: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 2: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 2:
Chain 2: Elapsed Time: 11.126 seconds (Warm-up)
Chain 2: 4.641 seconds (Sampling)
Chain 2: 15.767 seconds (Total)
Chain 2:
SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 3).
Chain 3:
Chain 3: Gradient evaluation took 0.000801 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 8.01 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3:
Chain 3:
Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 3: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 3: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 3: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 3: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 3: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 3: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 3: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 3: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 3: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 3: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 3:
Chain 3: Elapsed Time: 13.242 seconds (Warm-up)
Chain 3: 5.059 seconds (Sampling)
Chain 3: 18.301 seconds (Total)
Chain 3:
SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 4).
Chain 4:
Chain 4: Gradient evaluation took 0.000769 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 7.69 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4:
Chain 4:
Chain 4: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 4: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 4: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 4: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 4: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 4: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 4: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 4: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 4: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 4: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 4: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 4: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 4:
Chain 4: Elapsed Time: 9.549 seconds (Warm-up)
Chain 4: 5.387 seconds (Sampling)
Chain 4: 14.936 seconds (Total)
Chain 4:
# Summarize the model output
summary(single_predictor_model) Family: gaussian
Links: mu = identity; sigma = identity
Formula: arr_delay ~ weather_ct
Data: AirlineDelays1 (Number of observations: 171426)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -2116.53 7.67 -2131.41 -2100.91 1.00 2103 2096
weather_ct 1041.97 2.66 1036.73 1047.15 1.00 2000 1859
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 8676.19 15.52 8646.95 8706.92 1.00 3708 3157
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Prior to analysis, the dataset was cleaned and preprocessed. Missing values were identified and handled appropriately to ensure the integrity of the model. Specifically, I filtered out any observations where the dependent variable (arrival delay) or independent variable (weather delays) were missing. The cleaning process ensured that the dataset only included relevant entries for the analysis.
The dependent variable in this analysis was the total arrival delay (arr_delay), measured in minutes. The independent variable was weather count (weather_ct), which is the number of delays attributed to weather-related issues.
Bayesian Linear Regression (BLR) is a powerful statistical method that combines the principles of linear regression with Bayesian inference. This approach allows us to model the relationship between a dependent variable (e.g., flight arrival delays) and one or more independent variables (e.g., weather-related delays) while incorporating uncertainty and prior beliefs about these relationships. To analyze the impact of the independent variable on arrival delays, a Bayesian linear regression model was implemented using the brms package in R (R Core Team 2023). The model was formulated as follows:
\[ \text{arr\_delay} \sim \text{weather\_ct}\]
This model specification allows us to assess the relationship between arrival delays and the specified delay reasons while accounting for the uncertainty inherent in the data.
The Bayesian model was fitted using Markov Chain Monte Carlo (MCMC) methods, with the following parameters:
Chains: 4
Iterations: 2000
Warmup: 1000
These settings were chosen to ensure adequate convergence and to provide a stable estimate of the posterior distribution.
Prior distributions for the model coefficients were specified as normal distributions with a mean of 0 and a standard deviation of 5, reflecting a relatively uninformative prior that allows the data to inform the posterior distributions.
Model diagnostics were conducted to evaluate convergence and the quality of the fitted model. The summary of the model output, along with trace plots, was examined to ensure that the MCMC chains had stabilized and adequately explored the parameter space.
Incorporation of Prior Knowledge
-Allows for the inclusion of expert knowledge through prior distributions, enhancing model accuracy.
Uncertainty Quantification
-Provides a full posterior distribution for each parameter, offering a nuanced understanding of uncertainty.
Expanded Hypotheses
-Allows for a broader range of testable hypotheses.
-Results interpreted intuitively, avoiding reliance on NHST.
Automatic Meta-Analyses
-Combines prior findings with new data seamlessly.
-Facilitates integration of existing research into new studies.
Improved Handling of Small Samples
-Prior findings enhance the robustness of results.
-Reduces issues associated with small-sample studies.
Complex Model Estimation
-Capable of estimating models where traditional methods struggle.
-Handles model complexity effectively, expanding analytical possibilities.
Bayesian methods have some great advantages that make them a good fit for this analysis. They allow us to bring in prior knowledge, making the results more informed. Plus, they make it easier to understand what the findings mean, especially when we’re working with smaller datasets. Bayesian approaches also do a fantastic job of capturing uncertainty, which is really important when we’re dealing with something as complex as airline delays, where a lot of different factors are at play. Overall, these features help us get a clearer picture of what’s going on in the data.
plot(single_predictor_model)Below, we interpret the trace plots and posterior distributions for the parameters in our Bayesian model. These plots were generated to assess the convergence and distribution of each parameter in the model.
Posterior Distributions
On the left side of the image, we see the posterior distributions of the following parameters: - b_Intercept: The intercept term, centered around -2120. - b_weather_ct: The effect of weather counts on arrival delay, centered around 1040. - sigma: The residual standard deviation, centered around 8700.
These histograms represent the estimated range of values for each parameter after sampling, helping us understand the most likely values.
Trace Plots
On the right side, the trace plots show the sampled values of each parameter across four chains: - X-axis: Represents the iteration steps. - Y-axis: Represents the parameter values for each iteration.
Each color corresponds to one of the four chains. The chains appear well-mixed, with no discernible patterns, suggesting convergence.
Convergence Check
To assess convergence, we can visually inspect these plots and check the \((\hat{R})\) values. Ideally, \((\hat{R})\) values close to 1 indicate good convergence across chains. While these trace plots suggest convergence, the actual \((\hat{R})\) values can be confirmed by examining the model’s numerical summary. In our model, the \((\hat{R})\) value is exactly 1.00.
# Load necessary libraries
library(brms)
# Assuming your model is named 'single_predictor_model'
model_summary <- summary(single_predictor_model)
# Extract effective sample sizes
bulk_ess <- model_summary$fixed$Bulk_ESS
tail_ess <- model_summary$fixed$Tail_ESS
# Calculate statistics from posterior samples
posterior_samples <- as.data.frame(posterior_samples(single_predictor_model))
mean_delay <- mean(posterior_samples$`b_weather_ct`)
median_delay <- median(posterior_samples$`b_weather_ct`)
sd_delay <- sd(posterior_samples$`b_weather_ct`)
credible_interval <- quantile(posterior_samples$`b_weather_ct`, probs = c(0.025, 0.975))
# Print the results
cat("Effective Sample Size (Bulk):", bulk_ess, "\n")Effective Sample Size (Bulk): 2102.722 2000.139
cat("Effective Sample Size (Tail):", tail_ess, "\n")Effective Sample Size (Tail): 2095.692 1858.849
cat("Mean Arrival Delay (minutes):", mean_delay, "\n")Mean Arrival Delay (minutes): 1041.966
cat("Median Arrival Delay (minutes):", median_delay, "\n")Median Arrival Delay (minutes): 1041.971
cat("Standard Deviation of Arrival Delay:", sd_delay, "\n")Standard Deviation of Arrival Delay: 2.660956
cat("95% Credible Interval for Mean Arrival Delay:", credible_interval, "\n")95% Credible Interval for Mean Arrival Delay: 1036.731 1047.15
| Parameter | Estimate | Standard Error | 95% Credible Interval |
|---|---|---|---|
| Intercept | -2116.53 | 7.67 | [-2131.41, -2100.91] |
| Weather Count | 1041.97 | 2.66 | [1036.73, 1047.15] |
| Sigma | 8676.19 | 15.52 | [8646.95, 8706.92] |
| Statistic | Value |
|---|---|
| Number of Observations | 171,426 |
| Model Family | Gaussian |
| Formula | arr_delay ~ weather_ct |
| Iterations | 2000 |
| Warmup | 1000 |
| Chains | 4 |
| Effective Sample Size (Bulk) [Intercept, Weather Count] | [2102.722, 2000.139] |
| Effective Sample Size (Tail) [Intercept, Weather Count] | [2095.692, 1858.849] |
| Mean Arrival Delay (minutes) | 1041.966 |
| Median Arrival Delay (minutes) | 1041.971 |
| Standard Deviation of Arrival Delay | 2.660956 |
| 95% Credible Interval for Mean Arrival Delay | [1036.731, 1047.15] |
# Generate posterior predictive samples for a random subset of observations
set.seed(123)
sample_indices <- sample(1:nrow(AirlineDelays1), size = 5000) # Adjust the sample size as needed
posterior_predictive <- posterior_predict(single_predictor_model, newdata = AirlineDelays1[sample_indices, ])
# Extract the observed arrival delays for the sampled subset
observed_arrival_delays <- AirlineDelays1$arr_delay[sample_indices]
# Calculate the mean predicted delay for each observation in the subset
mean_predicted_delays <- apply(posterior_predictive, 2, mean)
# Create a data frame for plotting
ppc_data <- data.frame(
Type = c(rep("Observed", length(observed_arrival_delays)), rep("Predicted", length(mean_predicted_delays))),
Arrival_Delay = c(observed_arrival_delays, mean_predicted_delays)
)
# Plot with log-transformed x-axis
ggplot(ppc_data, aes(x = Arrival_Delay, fill = Type)) +
geom_density(alpha = 0.5) +
labs(title = "Posterior Predictive Check: Observed vs. Predicted Arrival Delays",
x = "Arrival Delays (minutes, log scale)",
y = "Density") +
theme_minimal() +
scale_fill_manual(values = c("Observed" = "lightgray", "Predicted" = "blue")) +
scale_x_log10()# Load necessary libraries
library(ggplot2)
library(dplyr)
# Create a new data frame for the observed data
observed_data <- AirlineDelays %>%
select(weather_ct, arr_delay)
# Generate predictions from the model
# Create a grid of weather_ct values for plotting the fitted line
weather_seq <- seq(min(AirlineDelays$weather_ct, na.rm = TRUE), max(AirlineDelays$weather_ct, na.rm = TRUE), length.out = 100)
prediction_data <- data.frame(weather_ct = weather_seq)
# Get the fitted values from the Bayesian model
predicted_values <- posterior_predict(single_predictor_model, newdata = prediction_data)
# Calculate the mean predicted value for each weather incident
mean_predicted <- apply(predicted_values, 2, mean)
prediction_data$arr_delay <- mean_predicted
# Create the plot
ggplot() +
geom_point(data = observed_data, aes(x = weather_ct, y = arr_delay), alpha = 0.3) + # observed data
geom_line(data = prediction_data, aes(x = weather_ct, y = arr_delay), color = "blue", size = 1) + # fitted line
labs(title = "Bayesian Regression: Arrival Delays vs. Weather Incidents",
x = "Number of Weather Incidents",
y = "Arrival Delays (minutes)") +
theme_minimal()# Density plot for the posterior distribution of weather count coefficient
weather_posterior <- as.data.frame(posterior_samples(single_predictor_model))$b_weather_ct
ggplot(data.frame(weather_posterior), aes(x = weather_posterior)) +
geom_density(fill = "lightblue") +
labs(title = "Posterior Distribution of Weather Count Coefficient",
x = "Weather Count Coefficient (Posterior)",
y = "Density") +
theme_minimal()Summary
# Load necessary libraries
library(dplyr)
# Load the dataset (replace the file path if needed)
AirlineDelays <- read.csv("Airline_Delay_Cause.csv")
# Show the first few rows of the dataset, especially `weather_ct`
head(AirlineDelays$weather_ct)
# Check the structure of the `weather_ct` variable
str(AirlineDelays$weather_ct)
# Get summary statistics for `weather_ct`
summary(AirlineDelays$weather_ct)The Bayesian linear regression model aimed to assess the relationship between weather-related incidents (`weather_ct`) and arrival delays (`arr_delay`) using data from the Airline Delays dataset for August 2023. A Gaussian model was employed, with the following regression formula:$$\text{arr\_delay} \sim \text{weather\_ct}$$The output from the model indicates a significant positive relationship between the number of weather incidents and the length of arrival delays.
Intercept: The estimated intercept value is -2116.53 with a 95% credible interval of [-2131.41, -2100.91]. This negative intercept indicates that in the absence of weather-related issues, arrival delays might be significantly shorter, which aligns with the expectation that weather incidents cause an increase in delays.
Weather Count Coefficient: The regression coefficient for weather_ct was estimated at 1041.97 with a standard error of 2.66, and a 95% credible interval of [1036.73, 1047.15]. This result demonstrates a substantial effect of weather-related incidents on arrival delays. An increase in weather-related incidents by one unit correlates with an estimated increase in arrival delay by approximately 1042 minutes on average. Despite the relatively smaller number of delays attributed to weather, this strong coefficient suggests that when weather incidents do occur, they tend to cause significant disruptions in arrival schedules.
Uncertainty Measures: The estimated standard deviation of the model’s residuals \((\sigma)\) is 8676.19, indicating variability in the observed arrival delays beyond the effects of the weather count variable. This could be due to other unmeasured factors affecting delays or the inherent randomness in arrival times.
The convergence diagnostic measures for each parameter \((\hat{R})\) were equal to 1.00, indicating successful convergence of the model using the No-U-Turn Sampler (NUTS). Effective sample sizes for each parameter were also large, suggesting that the posterior estimates are reliable and well-sampled.
The model suggests a statistically significant relationship between weather-related incidents and arrival delays. Despite being less frequent compared to other causes, weather incidents lead to much larger disruptions when they occur. This highlights how important weather is when it comes to managing airline schedules and reducing arrival delays.
This Bayesian linear regression analysis provides valuable insights into how weather-related incidents affect flight arrival delays. The findings show a significant positive relationship between the number of weather incidents and the length of arrival delays, emphasizing that weather can lead to considerable disruptions. However, several aspects of this analysis could benefit from further refinement and expanded research.
One major area for improvement is the inclusion of additional predictor variables. This analysis focused solely on weather-related incidents, but considering other factors, like carrier delays, NAS (National Airspace System) issues, or even socioeconomic variables, could give a fuller picture of what drives flight delays. It would also be useful to explore interactions between these factors, which might reveal more complex and interdependent relationships.
Another key consideration is the dataset itself. This analysis used data from August 2023, but incorporating data from multiple months or years could help identify seasonal trends and longer-term patterns. Adding real-time data could also be valuable, allowing for insights into how delays change dynamically over time and providing more actionable information for decision-makers.
In addition, the Bayesian linear regression model used in this analysis could be expanded to more sophisticated Bayesian models in future research. For instance, hierarchical models could account for variations between different airports or airlines, offering a more nuanced view of delays. Techniques like variational inference could also be explored to efficiently work with larger datasets and more complex models.
Regarding outliers, it was assumed in this analysis that the outliers in arrival delays were valuable and informative rather than anomalies to be removed. This decision may have influenced the results, highlighting an area for future exploration. Future studies could look at using Bayesian methods specifically designed to manage outliers or incorporate domain-specific knowledge to better distinguish and handle extreme values. Conducting a comparison of results with and without outliers would provide clearer insights into their impact.
There were also several assumptions made with respect to the key predictor variable, weather_ct. It was assumed that the reported count data was accurate and consistently measured across flights, that weather incidents affected delays independently of one another, and that the impact was consistent across different airports and times. Additionally, a linear relationship was assumed between weather_ct and arrival delays, which may oversimplify the actual dynamics. Addressing these assumptions in future research could enhance the validity of the results.
This Bayesian linear regression analysis highlights the significant impact of weather-related incidents on flight arrival delays. Although weather issues are less frequent than other delay causes, they have a disproportionately large effect on delay times. This finding suggests that special attention must be given to weather management and forecasting, alongside addressing other common delay factors such as carrier and NAS-related issues.
Using a Bayesian approach allows for effectively accounting for uncertainty, providing credible intervals for all parameter estimates and offering a deeper understanding of how weather-related incidents affect arrival delays. This probabilistic framework supports more informed decision-making, especially in resource allocation and policy-making aimed at reducing disruptions in airline operations.
Bayesian linear regression serves as an effective tool for exploring complex relationships in the aviation industry, particularly regarding the impact of weather. By integrating prior knowledge and addressing uncertainty, this analysis provides insights that are both informative and practical. For future research, expanding the model to include additional variables, refining data sources, exploring advanced Bayesian models, and developing interactive visualizations are promising avenues. These directions have the potential to enhance the applicability of this work, contributing to more effective strategies for mitigating flight delays and improving airline efficiency.