Bayesian Linear Regression

Analyzing Factors Influencing Flight Delays

Author

Heather Anderson (Advisor: Dr. Seals)

Published

November 23, 2024

Introduction

Bayesian linear regression (BLR) is a statistical approach that uses Bayesian principles to model relationships between variables. It combines prior knowledge with data to make predictions and handle uncertainty in the estimates. Unlike traditional Frequentist methods, BLR provides a probability-based approach for both estimating values and making forecasts.

Linear regression itself is a key technique for predicting one variable based on another by finding the best-fitting line that minimizes the gap between what we predict and what actually happens (“What Is Linear Regression?” 2021). This method is used in both Frequentist and Bayesian approaches and forms the groundwork for more advanced analysis.

In Bayesian linear regression, Bayes’ Theorem is used to update prior beliefs about model parameters with new data, resulting in a posterior distribution that reflects a range of possible values rather than a single estimate (Bayes 1763). This method takes into account previous knowledge and measures the uncertainty in predictions, giving a richer and often more useful picture compared to the single-point estimates from Frequentist approaches (Koehrsen 2018).

The Bayesian Analysis Toolkit (BAT) provides a real-world application of Bayesian methods, utilizing Markov Chain Monte Carlo (MCMC) techniques for posterior sampling and model comparison. BAT allows for flexible modeling and easy numerical integration, showing how Bayesian techniques can be applied in practical situations (Caldwell, Kollár, and Kröninger 2009).

Visualization plays a critical role in Bayesian workflows, as highlighted by Gabry et al. (Gabry et al. 2019). Tools such as trace plots and Hamiltonian Monte Carlo (HMC) diagnostics are crucial for figuring out if a model is working well and identifying any issues. In addition, techniques like posterior predictive checks and leave-one-out (LOO) cross-validation help fine-tune the model by testing how well it predicts new data.

Zyphur and Oswald (Zyphur and Oswald 2015) compare Bayesian and Frequentist methods, showing that Bayesian techniques often offer more dependable results. They do this by incorporating prior knowledge and overcoming some of the challenges that Frequentist methods face, especially with small sample sizes. Their study highlights how Bayesian methods can enhance predictions and help in better understanding the data.

Van de Schoot et al. (Schoot et al. 2021) explore how Bayesian statistics are becoming increasingly relevant and adaptable in today’s research. They point out that, with the help of modern deep learning and powerful computers, Bayesian methods are significantly improving the analysis of complex data. For instance, techniques like variational autoencoders, which combine Bayesian ideas with deep learning, are great for managing high-dimensional data and making precise predictions. This blend of Bayesian methods with advanced technology shows how Bayesian analysis is evolving to tackle modern challenges and boost scientific progress.

Gelman et al. (Gelman et al. 2013) offer a detailed look at Bayesian methods, including how they apply to linear regression. They highlight the advantages of using Bayesian approaches to handle uncertainty and improve models by incorporating prior information, which in turn boosts predictive accuracy and overall model performance.

Recent progress in variational inference, as highlighted by Blei, Kucukelbir, and McAuliffe (Blei, Kucukelbir, and McAuliffe 2017), offers scalable alternatives to MCMC methods. Variational inference works by approximating posterior distributions through optimization, making it a more efficient choice for handling large datasets and complex models.

Carpenter et al. (Carpenter et al. 2017) focus on how Bayesian methods are used in hierarchical models, particularly for complex systems. They point out that Bayesian hierarchical modeling is especially useful for handling data with multiple levels and capturing detailed relationships between variables, which improves both the flexibility and the accuracy of models in practical situations.

In conclusion, Bayesian linear regression is a powerful tool for modeling and predicting how variables are related. It stands out because it blends prior knowledge with a way to handle uncertainty. With the help of advanced diagnostics, practical tools like BAT, and recent breakthroughs in computing and deep learning, Bayesian methods show just how flexible and effective they can be for analyzing complex data in today’s world.

Methods

Data Source

The dataset utilized in this analysis is derived from the “Airline Delay Cause” dataset, which encompasses flight arrival and delay information for U.S. airports. The dataset includes various attributes such as the number of arriving flights, delay reasons, cancellations, and diversions, spanning multiple years and airlines. For this study, the focus was specifically on data from August 2023, which consists of 171,666 observations.

Link to dataset:

https://www.kaggle.com/datasets/sriharshaeedala/airline-delay?resource=download

Data Exploration

Code

library(dplyr)
library(knitr)
library(ggplot2)
library(tidyr)
library(bayesrules)

Code

AirlineDelays <- read.csv("Airline_Delay_Cause.csv")
head(AirlineDelays)

  year month carrier      carrier_name airport
1 2023     8      9E Endeavor Air Inc.     ABE
2 2023     8      9E Endeavor Air Inc.     ABY
3 2023     8      9E Endeavor Air Inc.     AEX
4 2023     8      9E Endeavor Air Inc.     AGS
5 2023     8      9E Endeavor Air Inc.     ALB
6 2023     8      9E Endeavor Air Inc.     ATL
                                                 airport_name arr_flights
1 Allentown/Bethlehem/Easton, PA: Lehigh Valley International          89
2                      Albany, GA: Southwest Georgia Regional          62
3                    Alexandria, LA: Alexandria International          62
4                 Augusta, GA: Augusta Regional at Bush Field          66
5                            Albany, NY: Albany International          92
6       Atlanta, GA: Hartsfield-Jackson Atlanta International        1636
  arr_del15 carrier_ct weather_ct nas_ct security_ct late_aircraft_ct
1        13       2.25       1.60   3.16           0             5.99
2        10       1.97       0.04   0.57           0             7.42
3        10       2.73       1.18   1.80           0             4.28
4        12       3.69       2.27   4.47           0             1.57
5        22       7.76       0.00   2.96           0            11.28
6       256      55.98      27.81  63.64           0           108.57
  arr_cancelled arr_diverted arr_delay carrier_delay weather_delay nas_delay
1             2            1      1375            71           761       118
2             0            1       799           218             1        62
3             1            0       766            56           188        78
4             1            1      1397           471           320       388
5             2            0      1530           628             0       134
6            32           11     29768          9339          4557      4676
  security_delay late_aircraft_delay
1              0                 425
2              0                 518
3              0                 444
4              0                 218
5              0                 768
6              0               11196

Code

dim(AirlineDelays)

[1] 171666     21

Code

delayed_summary <- AirlineDelays %>%
  summarise(
    total_delayed_flights = sum(arr_delay > 0, na.rm = TRUE),
    count_15_to_30 = sum(arr_delay > 15 & arr_delay <= 30, na.rm = TRUE),
    count_30_to_60 = sum(arr_delay > 30 & arr_delay <= 60, na.rm = TRUE),
    count_over_60 = sum(arr_delay > 60, na.rm = TRUE)
  )

print(delayed_summary)

  total_delayed_flights count_15_to_30 count_30_to_60 count_over_60
1                164638           2543           3919        157889

Code

# List unique carriers
unique_carriers <- AirlineDelays %>%
  distinct(carrier, carrier_name)  # You can include carrier_name for more clarity

# Display the unique carriers
print(unique_carriers)

   carrier                 carrier_name
1       9E            Endeavor Air Inc.
2       AA       American Airlines Inc.
3       AS         Alaska Airlines Inc.
4       B6              JetBlue Airways
5       DL         Delta Air Lines Inc.
6       F9       Frontier Airlines Inc.
7       G4                Allegiant Air
8       HA       Hawaiian Airlines Inc.
9       MQ                    Envoy Air
10      NK             Spirit Air Lines
11      OH            PSA Airlines Inc.
12      OO        SkyWest Airlines Inc.
13      UA        United Air Lines Inc.
14      WN       Southwest Airlines Co.
15      YX             Republic Airline
16      QX                  Horizon Air
17      YV           Mesa Airlines Inc.
18      EV      ExpressJet Airlines LLC
19      EV     ExpressJet Airlines Inc.
20      VX               Virgin America
21      US              US Airways Inc.
22      FL  AirTran Airways Corporation
23      MQ American Eagle Airlines Inc.

Code

total_arrived_flights_corrected <- AirlineDelays %>%
  group_by(carrier, airport) %>%
  summarise(arrived_flights = sum(arr_flights, na.rm = TRUE)) %>%
  summarise(total_flights = sum(arrived_flights))

print(total_arrived_flights_corrected)

# A tibble: 21 × 2
   carrier total_flights
   <chr>           <dbl>
 1 9E            1466178
 2 AA            7973061
 3 AS            1990957
 4 B6            2609697
 5 DL            8661561
 6 EV            2786903
 7 F9            1158943
 8 FL             143429
 9 G4             613357
10 HA             727265
# ℹ 11 more rows

Variables

year: The year of the data.
month: The month of the data.
carrier: Carrier code.
carrier_name: Carrier name.
airport: Airport code.
airport_name: Airport name.
arr_flights: Number of arriving flights.
arr_del15: Number of flights delayed by 15 minutes or more.
carrier_ct: Carrier count (delay due to the carrier).
weather_ct: Weather count (delay due to weather).
nas_ct: NAS (National Airspace System) count (delay due to the NAS).
security_ct: Security count (delay due to security).
late_aircraft_ct: Late aircraft count (delay due to late aircraft arrival).
arr_cancelled: Number of flights canceled.
arr_diverted: Number of flights diverted.
arr_delay: Total arrival delay.
carrier_delay: Delay attributed to the carrier.
weather_delay: Delay attributed to weather.
nas_delay: Delay attributed to the NAS.
security_delay: Delay attributed to security.
late_aircraft_delay: Delay attributed to late aircraft arrival.

Table 1: Summary of Flight Arrivals, Delays, Cancellations, and Diversions

Code

# Summarize the dataset by carrier
carrier_summary <- AirlineDelays %>%
  group_by(carrier) %>%
  summarise(
    total_flights = sum(arr_flights, na.rm = TRUE),
    total_delayed_flights = sum(arr_del15, na.rm = TRUE)  # Total delayed flights (15 minutes or more)
  )

# Calculate total metrics for August
total_arrived_flights <- sum(AirlineDelays$arr_flights, na.rm = TRUE)
total_cancelled_flights <- sum(AirlineDelays$arr_cancelled, na.rm = TRUE)
total_diverted_flights <- sum(AirlineDelays$arr_diverted, na.rm = TRUE)

# Total delayed flights (where arr_del15 > 0)
total_delayed_flights <- sum(AirlineDelays$arr_del15, na.rm = TRUE)

# Breakdown of reasons for delays
delay_reasons <- AirlineDelays %>%
  summarise(
    carrier_delay = sum(carrier_ct, na.rm = TRUE),
    weather_delay = sum(weather_ct, na.rm = TRUE),
    nas_delay = sum(nas_ct, na.rm = TRUE),
    security_delay = sum(security_ct, na.rm = TRUE),
    late_aircraft_delay = sum(late_aircraft_ct, na.rm = TRUE)
  )

# Calculate percentages for delay reasons
delay_percentages <- delay_reasons / total_delayed_flights * 100

# Create the summary table with detailed sections for delays
summary_table <- data.frame(
  Characteristic = c("Total Months of Data (August)", 
                     "Total Carriers", 
                     "Total Arrived Flights (Count Data)",
                     "Total Delayed Flights (15+ min)", 
                     "  - Carrier Delays",                 
                     "  - Weather Delays",                 
                     "  - NAS Delays",                     
                     "  - Security Delays",                
                     "  - Late Aircraft Delays",           
                     "Total Cancelled Flights", 
                     "Total Diverted Flights",
                     "Cancelled Flights (%)", 
                     "Diverted Flights (%)"),
  Value = c(1,  # Total months (August)
            length(unique(AirlineDelays$carrier)), 
            round(total_arrived_flights, 2), 
            round(total_delayed_flights, 2),  
            round(delay_reasons$carrier_delay, 2),  
            round(delay_reasons$weather_delay, 2),  
            round(delay_reasons$nas_delay, 2),      
            round(delay_reasons$security_delay, 2), 
            round(delay_reasons$late_aircraft_delay, 2), 
            round(total_cancelled_flights, 2), 
            round(total_diverted_flights, 2),
            round((total_cancelled_flights / total_arrived_flights) * 100, 2), 
            round((total_diverted_flights / total_arrived_flights) * 100, 2))
)

# Format the delay reasons to include percentages in parentheses
summary_table$Characteristic[5:9] <- paste0(
  summary_table$Characteristic[5:9], 
  " (", round(delay_percentages, 2), "%)"
)

# Display the summary table
kable(summary_table, caption = "Table 1: Summary of Flight Arrivals, Delays, Cancellations, and Diversions (August Data)", 
      format.args = list(big.mark = ",", scientific = FALSE))

Table 1: Summary of Flight Arrivals, Delays, Cancellations, and Diversions (August Data)
Characteristic	Value
Total Months of Data (August)	1.00
Total Carriers	21.00
Total Arrived Flights (Count Data)	62,146,805.00
Total Delayed Flights (15+ min)	11,375,095.00
- Carrier Delays (31.34%)	3,565,080.59
- Weather Delays (3.39%)	385,767.94
- NAS Delays (29.21%)	3,322,432.52
- Security Delays (0.24%)	26,930.39
- Late Aircraft Delays (35.82%)	4,074,891.00
Total Cancelled Flights	1,290,923.00
Total Diverted Flights	148,007.00
Cancelled Flights (%)	2.08
Diverted Flights (%)	0.24

Summary of Table 1:

Dataset 1 takes a deep dive into the airline data, offering interesting insights into how different airlines performed. The dataset is made up of information regarding flight arrivals and delays across various U.S. airports, focusing specifically on data from August 2023. It includes a total of 171,666 observations with a variety of relevant variables related to airline performance. Overall, there were a whopping 62,146,805 arriving flights, showing just how busy the skies were. However, not everything went smoothly; 11,375,095 of those flights were delayed by at least 15 minutes, which is a significant number and definitely impacts travelers.

When I looked at the reasons for these delays, it became clear that most were due to issues on the airlines’ end, with 3,565,081 flights (31.34%) delayed because of carrier-related problems. Weather played a role too, causing delays for 385,768 flights (3.39%). National Airspace System delays accounted for another 3,322,433 flights (29.21%). Security delays were relatively minor, affecting just 26,930 flights (0.24%), while late arrivals of other aircraft were responsible for 4,074,891 flights (35.82%).

On the cancellation front, there were 1,290,923 flights that were called off, which is about 2.08% of all arriving flights. Additionally, 148,007 flights were diverted, a small fraction as well.

These numbers highlight the challenges that airlines face, especially with delays stemming mainly from their own operations. While weather and air traffic issues also contribute, it’s clear that improving airline efficiency could make a real difference. Overall, the relatively low rates of cancellations and diversions suggest that, despite the delays, the month appears to have been fairly stable for airline operations.

Table 2: Flight Summary by Airline

Code

# Summarize the dataset by carrier
summary_by_airline <- AirlineDelays %>%
  group_by(carrier, carrier_name) %>%
  summarise(
    total_flights = sum(arr_flights, na.rm = TRUE),
    delayed_flights = sum(arr_del15, na.rm = TRUE),
    cancelled_flights = sum(arr_cancelled, na.rm = TRUE)
  ) %>%
  ungroup() %>%
  mutate(
    on_time_flights = total_flights - delayed_flights,
    on_time_percentage = round((on_time_flights / total_flights) * 100, 0),
    delayed_percentage = round((delayed_flights / total_flights) * 100, 0),
    cancelled_percentage = round((cancelled_flights / total_flights) * 100, 0)
  )

# Create a new table with selected columns
table1 <- summary_by_airline %>%
  select(carrier, carrier_name, on_time_flights, on_time_percentage, 
         delayed_flights, delayed_percentage, 
         cancelled_flights, cancelled_percentage)

# Display the summary table
kable(table1, caption = "Table 2: Flight Summary by Airline (August 2023)", 
      format.args = list(big.mark = ",", scientific = FALSE))

Table 2: Flight Summary by Airline (August 2023)
carrier	carrier_name	on_time_flights	on_time_percentage	delayed_flights	delayed_percentage	cancelled_flights	cancelled_percentage
9E	Endeavor Air Inc.	1,265,619	86	200,559	14	31,742	2
AA	American Airlines Inc.	6,434,800	81	1,538,261	19	167,510	2
AS	Alaska Airlines Inc.	1,673,227	84	317,730	16	25,877	1
B6	JetBlue Airways	1,961,778	75	647,919	25	59,509	2
DL	Delta Air Lines Inc.	7,462,677	86	1,198,884	14	85,395	1
EV	ExpressJet Airlines Inc.	2,097,091	80	540,578	20	85,844	3
EV	ExpressJet Airlines LLC	119,562	80	29,672	20	8,198	5
F9	Frontier Airlines Inc.	867,540	75	291,403	25	22,063	2
FL	AirTran Airways Corporation	118,203	82	25,226	18	1,812	1
G4	Allegiant Air	463,414	76	149,943	24	24,303	4
HA	Hawaiian Airlines Inc.	637,471	88	89,794	12	4,850	1
MQ	American Eagle Airlines Inc.	218,151	78	61,132	22	15,074	5
MQ	Envoy Air	1,679,060	81	394,625	19	74,306	4
NK	Spirit Air Lines	1,195,368	78	330,708	22	34,307	2
OH	PSA Airlines Inc.	1,094,689	83	231,648	17	43,897	3
OO	SkyWest Airlines Inc.	5,737,246	83	1,167,810	17	138,798	2
QX	Horizon Air	171,737	86	28,682	14	3,608	2
UA	United Air Lines Inc.	4,433,496	81	1,030,741	19	86,820	2
US	US Airways Inc.	649,641	83	134,516	17	12,167	2
VX	Virgin America	237,303	79	64,605	21	3,156	1
WN	Southwest Airlines Co.	10,061,654	80	2,460,563	20	275,976	2
YV	Mesa Airlines Inc.	744,238	81	170,060	19	30,366	3
YX	Republic Airline	1,447,745	84	270,036	16	55,345	3

Flight Status Proportions by Airline

Code

# Summarize the dataset by carrier
summary_by_airline <- AirlineDelays %>%
  group_by(carrier, carrier_name) %>%
  summarise(
    total_flights = sum(arr_flights, na.rm = TRUE),
    delayed_flights = sum(arr_del15, na.rm = TRUE),
    cancelled_flights = sum(arr_cancelled, na.rm = TRUE),
    .groups = 'drop'  # Prevent grouping warnings
  )

# Calculate proportions
summary_by_airline <- summary_by_airline %>%
  mutate(
    delayed_proportion = delayed_flights / total_flights,
    cancelled_proportion = cancelled_flights / total_flights
  ) %>%
  select(carrier, carrier_name, total_flights, delayed_proportion, cancelled_proportion)

# Reshape data for plotting
summary_long <- summary_by_airline %>%
  pivot_longer(cols = c(delayed_proportion, cancelled_proportion),
               names_to = "Flight_Status",
               values_to = "Proportion")

# Create the bar plot
ggplot(summary_long, aes(x = reorder(carrier_name, -Proportion), y = Proportion, fill = Flight_Status)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Flight Status Proportions by Airline (August 2023)",
       x = "Airline",
       y = "Proportion of Flights",
       fill = "Flight Status") +
  scale_y_continuous(labels = scales::label_percent()) +  # Format y-axis as percentage
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Proportion of Flight Status

Code

# Summarize the dataset by carrier
summary_by_airline <- AirlineDelays %>%
  group_by(carrier, carrier_name) %>%
  summarise(
    total_flights = sum(arr_flights, na.rm = TRUE),
    delayed_flights = sum(arr_del15, na.rm = TRUE),
    cancelled_flights = sum(arr_cancelled, na.rm = TRUE),
    .groups = 'drop'  # Prevent grouping warnings
  )

# Calculate total counts for pie chart
total_counts <- summary_by_airline %>%
  summarise(
    total_flights = sum(total_flights),
    delayed_flights = sum(delayed_flights),
    cancelled_flights = sum(cancelled_flights)
  ) %>%
  mutate(on_time_flights = total_flights - (delayed_flights + cancelled_flights)) %>%
  select(-total_flights) %>%  # Remove total_flights if not needed in the pie chart
  pivot_longer(cols = everything(), names_to = "Flight_Status", values_to = "Count")

# Create the pie chart
ggplot(total_counts, aes(x = "", y = Count, fill = Flight_Status)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y") +
  labs(title = "Proportion of Flight Status (August 2023)",
       fill = "Flight Status") +
  theme_minimal() +
  theme(axis.title.x = element_blank(), axis.title.y = element_blank(), axis.text.x = element_blank())

Reasons for Flight Delays

Code

# Summarize reasons for delays
delay_reasons <- AirlineDelays %>%
  summarise(
    Carrier_Delay = sum(carrier_ct, na.rm = TRUE),
    Weather_Delay = sum(weather_ct, na.rm = TRUE),
    NAS_Delay = sum(nas_ct, na.rm = TRUE),
    Security_Delay = sum(security_ct, na.rm = TRUE),
    Late_Aircraft_Delay = sum(late_aircraft_ct, na.rm = TRUE)
  )

# Reshape the data for plotting
delay_reasons_long <- pivot_longer(delay_reasons, cols = everything(), 
                                    names_to = "Reason", values_to = "Count")

# Create the bar chart
ggplot(delay_reasons_long, aes(x = Reason, y = Count, fill = Reason)) +
  geom_bar(stat = "identity") +
  labs(title = "Reasons for Flight Delays (August 2023)",
       x = "Reason for Delay",
       y = "Number of Delays") +
  scale_y_continuous(labels = scales::comma) +  # Format y-axis labels as whole numbers
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Steps in Bayesian Linear Regression

Model Specification: Define the linear relationship between the dependent and independent variables.
Choose Priors: Select prior distributions for the model parameters, reflecting any existing knowledge about their values.
Data Collection: Gather relevant data for the variables in the model.
Model Fitting: Use computational methods, such as Markov Chain Monte Carlo (MCMC), to estimate the posterior distributions of the parameters based on the observed data.
Result Interpretation: Analyze the posterior distributions to understand the relationships between variables, including estimating means and credible intervals.

Defining the Model

Choosing Focus

I chose to focus on weather delays because, even though they’re less common, they pose unique challenges that airlines can’t control. Unlike delays due to carrier or NAS issues, weather is unpredictable and can have a big impact on safety and scheduling. By analyzing weather’s effect on delays, I hoped to uncover patterns that might help airlines anticipate and better prepare for these types of disruptions, especially during certain seasons or in specific regions. This focus could lead to insights that improve the travel experience and help airlines handle weather-related delays more effectively.

The Bayesian Linear Regression Model

The model can be written as

\[ \begin{align*} Y_i|\beta_0, \beta_1, \sigma &\overset{\text{ind}}{\sim} N (\mu_i, \sigma^2) && \text{with } && \mu_i = \beta_0 + \beta_1X_i \\ \beta_{0} &\sim N(m_0, s_0^2)\\ \beta_1 &\sim N(m_1, s_1^2)\\ \sigma &\sim \text{Exp}(l) \end{align*} \]

Model Specification

Let: \[ Y_i | \beta_0, \beta_1, \sigma \sim \text{N}(\mu_i, \sigma^2) \] with \[ \mu_i = \beta_0 + \beta_1 \cdot X_i \]

where:

$Y_i$ represents the arrival delay (in minutes) for the ( i )-th flight.
$X_i$ represents the number of weather-related delay incidents (weather count) for the ( i )-th flight.
$\mu_i = \beta_0 + \beta_1 X_i$ is the expected (mean) arrival delay, which depends on the number of weather incidents for each flight.
$\sigma^2$ is the variance of the errors, capturing the variability in arrival delays beyond what is explained by weather incidents.
$(\sim \text{ind})$ indicates conditional independence of each arrival delay given the parameters $\beta_0$, $\beta_1$, and $\sigma$.

This model specification allows us to assess the relationship between weather-related incidents and arrival delays by estimating the effect of weather on delays through the parameter $\beta_1$. The model assumes that each flight’s delay is conditionally independent, given the model parameters, which means that the delay of one flight does not depend on another once we account for weather-related factors.

Prior Selection

Regression Parameters

Intercept: The prior for the intercept $\beta_0$ is specified as a normal distribution: \[ \beta_0 \sim \text{N}(m_0, s_0^2) \]
Slope: The prior for the slope $\beta_1$, representing the effect of weather incidents on arrival delays, is also specified as a normal distribution: \[ \beta_1 \sim \text{N}(m_1, s_1^2) \]
Error: The prior for the error term $\sigma$ is specified as an exponential distribution: \[ \sigma \sim \text{Exp}(l) \]

In this setup: - $( m_0 )$ and $(m_1)$ are the means for the intercept and slope priors, reflecting initial beliefs about their central values. - $( s_0^2 )$ and $( s_1^2 )$ are the variances for the intercept and slope priors, allowing flexibility in the range of plausible values for each parameter. - $( l )$ is the rate parameter for the exponential prior on the error term $( \sigma )$, which controls the expected variability in arrival delays beyond the effect of weather incidents.

These priors were chosen to be relatively non-informative, allowing the data to inform the posterior estimates while still incorporating baseline expectations.

Tuning Hyperparameters

The priors for the model parameters were tuned as follows:

\[ \begin{align*} \beta_0 &\sim N(0, 5^2) \\ \beta_1 &\sim N(0, 5^2) \\ \sigma &\sim \text{Exp}(1) \end{align*} \]

These values were chosen to represent relatively non-informative priors, allowing the data to primarily guide the posterior estimates.

The prior for the intercept $( \beta_0 )$ is a normal distribution with a mean of $0$ and variance of $( 5^2 )$, reflecting an initial belief centered around no significant baseline effect.
The prior for the slope $( \beta_1 )$ is also a normal distribution with a mean of $0$ and variance of $(5^2)$, representing no strong prior belief about the relationship between weather incidents and arrival delays.
The prior for the error term $( \sigma )$ is an exponential distribution with a rate parameter of $1$, implying a positive variance with more flexibility for higher variability.

The Updated Model

The Bayesian linear regression model is updated with the following parameter specifications:

\[ \begin{align*} Y_i | \beta_0, \beta_1, \sigma &\overset{\text{ind}}{\sim} N (\mu_i, \sigma^2) \quad \text{with} \quad \mu_i = \beta_0 + \beta_1 X_i \\ \beta_0 &\sim N(0, 5^2) \\ \beta_1 &\sim N(0, 5^2) \\ \sigma &\sim \text{Exp}(1) \end{align*} \]

Where: - $( Y_i )$ represents the arrival delay (in minutes) for the ( i )-th flight. - $( X_i )$ represents the number of weather-related delay incidents for the ( i )-th flight. - $( \mu_i )$ is the expected (mean) arrival delay, modeled as a linear function of weather incidents. - $( \beta_0 )$, $( \beta_1 )$, and $( \sigma )$ are parameters estimated through the Bayesian model, with priors specified above.

This updated model formulation allows us to understand the influence of weather-related incidents on arrival delays while incorporating prior beliefs and accounting for uncertainty in the parameter estimates.

Running the Model

Output

Code

# Load necessary packages
library(brms)

# Clean the data: Handle missing values for relevant variables (arrival delay and weather-related incidents)
AirlineDelays1 <- AirlineDelays %>%
    filter(!is.na(arr_delay), !is.na(weather_ct))

# Define the single-predictor Bayesian model
single_predictor_model <- brm(
  arr_delay ~ weather_ct, 
  data = AirlineDelays1,  # Use the cleaned dataset here
  family = gaussian(),
  prior = c(
    prior(normal(0, 5), class = "Intercept"),
    prior(normal(0, 5), class = "b"),
    prior(exponential(1), class = "sigma")
  ),
  chains = 4, 
  iter = 2000, 
  warmup = 1000, 
  seed = 123
)

Running /opt/R/4.3.2/lib/R/bin/R CMD SHLIB foo.c
using C compiler: ‘gcc (GCC) 11.4.1 20231218 (Red Hat 11.4.1-3)’
gcc -I"/opt/R/4.3.2/lib/R/include" -DNDEBUG   -I"/opt/R/4.3.2/lib/R/library/Rcpp/include/"  -I"/opt/R/4.3.2/lib/R/library/RcppEigen/include/"  -I"/opt/R/4.3.2/lib/R/library/RcppEigen/include/unsupported"  -I"/opt/R/4.3.2/lib/R/library/BH/include" -I"/opt/R/4.3.2/lib/R/library/StanHeaders/include/src/"  -I"/opt/R/4.3.2/lib/R/library/StanHeaders/include/"  -I"/opt/R/4.3.2/lib/R/library/RcppParallel/include/"  -I"/opt/R/4.3.2/lib/R/library/rstan/include" -DEIGEN_NO_DEBUG  -DBOOST_DISABLE_ASSERTS  -DBOOST_PENDING_INTEGER_LOG2_HPP  -DSTAN_THREADS  -DUSE_STANC3 -DSTRICT_R_HEADERS  -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION  -D_HAS_AUTO_PTR_ETC=0  -include '/opt/R/4.3.2/lib/R/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp'  -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1   -I/usr/local/include    -fpic  -g -O2  -c foo.c -o foo.o
In file included from /opt/R/4.3.2/lib/R/library/RcppEigen/include/Eigen/Core:19,
                 from /opt/R/4.3.2/lib/R/library/RcppEigen/include/Eigen/Dense:1,
                 from /opt/R/4.3.2/lib/R/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp:22,
                 from <command-line>:
/opt/R/4.3.2/lib/R/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:679:10: fatal error: cmath: No such file or directory
  679 | #include <cmath>
      |          ^~~~~~~
compilation terminated.
make: *** [/opt/R/4.3.2/lib/R/etc/Makeconf:191: foo.o] Error 1

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 0.000834 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 8.34 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 10.796 seconds (Warm-up)
Chain 1:                4.829 seconds (Sampling)
Chain 1:                15.625 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 0.000818 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 8.18 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 2: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 2: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 2: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 2: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 11.126 seconds (Warm-up)
Chain 2:                4.641 seconds (Sampling)
Chain 2:                15.767 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 0.000801 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 8.01 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 3: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 3: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 3: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 3: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 3: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 3: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 3: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 3: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 3: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 3: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 3: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 13.242 seconds (Warm-up)
Chain 3:                5.059 seconds (Sampling)
Chain 3:                18.301 seconds (Total)
Chain 3: 

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 4).
Chain 4: 
Chain 4: Gradient evaluation took 0.000769 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 7.69 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 4: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 4: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 4: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 4: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 4: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 4: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 4: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 4: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 4: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 4: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 9.549 seconds (Warm-up)
Chain 4:                5.387 seconds (Sampling)
Chain 4:                14.936 seconds (Total)
Chain 4:

Summary of Model Output

Code

# Summarize the model output
summary(single_predictor_model)

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: arr_delay ~ weather_ct 
   Data: AirlineDelays1 (Number of observations: 171426) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept  -2116.53      7.67 -2131.41 -2100.91 1.00     2103     2096
weather_ct  1041.97      2.66  1036.73  1047.15 1.00     2000     1859

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma  8676.19     15.52  8646.95  8706.92 1.00     3708     3157

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Data Preparation

Prior to analysis, the dataset was cleaned and preprocessed. Missing values were identified and handled appropriately to ensure the integrity of the model. Specifically, I filtered out any observations where the dependent variable (arrival delay) or independent variable (weather delays) were missing. The cleaning process ensured that the dataset only included relevant entries for the analysis.

Variables Used in Model

The dependent variable in this analysis was the total arrival delay (arr_delay), measured in minutes. The independent variable was weather count (weather_ct), which is the number of delays attributed to weather-related issues.

Bayesian Linear Regression Model

Bayesian Linear Regression (BLR) is a powerful statistical method that combines the principles of linear regression with Bayesian inference. This approach allows us to model the relationship between a dependent variable (e.g., flight arrival delays) and one or more independent variables (e.g., weather-related delays) while incorporating uncertainty and prior beliefs about these relationships. To analyze the impact of the independent variable on arrival delays, a Bayesian linear regression model was implemented using the brms package in R (R Core Team 2023). The model was formulated as follows:

\[ \text{arr\_delay} \sim \text{weather\_ct}\]

This model specification allows us to assess the relationship between arrival delays and the specified delay reasons while accounting for the uncertainty inherent in the data.

Key Concepts

Linear Regression: The basic form of linear regression models the relationship between the dependent variable $ y $ and independent variables $ x_1, x_2, \ldots, x_n $ using a linear equation: $$ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_n x_n + \epsilon $$ Here, $ \beta_i $ represents the coefficients that indicate the strength and direction of the relationships, and $ \epsilon $ is the error term.
Bayesian Inference: In Bayesian statistics, we update our beliefs about parameters using Bayes’ theorem: $$ P(\beta | \text{data}) = \frac{P(\text{data} | \beta) \cdot P(\beta)}{P(\text{data})} $$ This formula allows us to derive the posterior distribution of the parameters after observing the data, taking into account our prior beliefs $ P(\beta) $ and the likelihood of the data given those parameters $ P(\text{data} | \beta) $.

Model Fitting

The Bayesian model was fitted using Markov Chain Monte Carlo (MCMC) methods, with the following parameters:

Chains: 4
Iterations: 2000
Warmup: 1000

These settings were chosen to ensure adequate convergence and to provide a stable estimate of the posterior distribution.

Prior Distributions

Prior distributions for the model coefficients were specified as normal distributions with a mean of 0 and a standard deviation of 5, reflecting a relatively uninformative prior that allows the data to inform the posterior distributions.

Model Assessment

Model diagnostics were conducted to evaluate convergence and the quality of the fitted model. The summary of the model output, along with trace plots, was examined to ensure that the MCMC chains had stabilized and adequately explored the parameter space.

Why Bayesian?

Advantages of Bayesian Linear Regression

(Zyphur and Oswald 2015)

Incorporation of Prior Knowledge

-Allows for the inclusion of expert knowledge through prior distributions, enhancing model accuracy.
Uncertainty Quantification

-Provides a full posterior distribution for each parameter, offering a nuanced understanding of uncertainty.
Expanded Hypotheses

-Allows for a broader range of testable hypotheses.

-Results interpreted intuitively, avoiding reliance on NHST.
Automatic Meta-Analyses

-Combines prior findings with new data seamlessly.

-Facilitates integration of existing research into new studies.
Improved Handling of Small Samples

-Prior findings enhance the robustness of results.

-Reduces issues associated with small-sample studies.
Complex Model Estimation

-Capable of estimating models where traditional methods struggle.

-Handles model complexity effectively, expanding analytical possibilities.

Summary

Bayesian methods have some great advantages that make them a good fit for this analysis. They allow us to bring in prior knowledge, making the results more informed. Plus, they make it easier to understand what the findings mean, especially when we’re working with smaller datasets. Bayesian approaches also do a fantastic job of capturing uncertainty, which is really important when we’re dealing with something as complex as airline delays, where a lot of different factors are at play. Overall, these features help us get a clearer picture of what’s going on in the data.

Analysis and Results

Trace Plots and Posterior Distributions

Code

plot(single_predictor_model)

Interpretation of Each Component

Below, we interpret the trace plots and posterior distributions for the parameters in our Bayesian model. These plots were generated to assess the convergence and distribution of each parameter in the model.

Posterior Distributions

On the left side of the image, we see the posterior distributions of the following parameters: - b_Intercept: The intercept term, centered around -2120. - b_weather_ct: The effect of weather counts on arrival delay, centered around 1040. - sigma: The residual standard deviation, centered around 8700.

These histograms represent the estimated range of values for each parameter after sampling, helping us understand the most likely values.

Trace Plots

On the right side, the trace plots show the sampled values of each parameter across four chains: - X-axis: Represents the iteration steps. - Y-axis: Represents the parameter values for each iteration.

Each color corresponds to one of the four chains. The chains appear well-mixed, with no discernible patterns, suggesting convergence.

Convergence Check

To assess convergence, we can visually inspect these plots and check the $(\hat{R})$ values. Ideally, $(\hat{R})$ values close to 1 indicate good convergence across chains. While these trace plots suggest convergence, the actual $(\hat{R})$ values can be confirmed by examining the model’s numerical summary. In our model, the $(\hat{R})$ value is exactly 1.00.

Extracting Values

Code

# Load necessary libraries
library(brms)

# Assuming your model is named 'single_predictor_model'
model_summary <- summary(single_predictor_model)

# Extract effective sample sizes
bulk_ess <- model_summary$fixed$Bulk_ESS
tail_ess <- model_summary$fixed$Tail_ESS

# Calculate statistics from posterior samples
posterior_samples <- as.data.frame(posterior_samples(single_predictor_model))
mean_delay <- mean(posterior_samples$`b_weather_ct`)
median_delay <- median(posterior_samples$`b_weather_ct`)
sd_delay <- sd(posterior_samples$`b_weather_ct`)
credible_interval <- quantile(posterior_samples$`b_weather_ct`, probs = c(0.025, 0.975))

# Print the results
cat("Effective Sample Size (Bulk):", bulk_ess, "\n")

Effective Sample Size (Bulk): 2102.722 2000.139

Code

cat("Effective Sample Size (Tail):", tail_ess, "\n")

Effective Sample Size (Tail): 2095.692 1858.849

Code

cat("Mean Arrival Delay (minutes):", mean_delay, "\n")

Mean Arrival Delay (minutes): 1041.966

Code

cat("Median Arrival Delay (minutes):", median_delay, "\n")

Median Arrival Delay (minutes): 1041.971

Code

cat("Standard Deviation of Arrival Delay:", sd_delay, "\n")

Standard Deviation of Arrival Delay: 2.660956

Code

cat("95% Credible Interval for Mean Arrival Delay:", credible_interval, "\n")

95% Credible Interval for Mean Arrival Delay: 1036.731 1047.15

Table 3: Model Parameters and Estimates

Parameter	Estimate	Standard Error	95% Credible Interval
Intercept	-2116.53	7.67	[-2131.41, -2100.91]
Weather Count	1041.97	2.66	[1036.73, 1047.15]
Sigma	8676.19	15.52	[8646.95, 8706.92]

Table 4: Model Diagnostics and Fit Statistics

Statistic	Value
Number of Observations	171,426
Model Family	Gaussian
Formula	`arr_delay ~ weather_ct`
Iterations	2000
Warmup	1000
Chains	4
Effective Sample Size (Bulk) [Intercept, Weather Count]	[2102.722, 2000.139]
Effective Sample Size (Tail) [Intercept, Weather Count]	[2095.692, 1858.849]
Mean Arrival Delay (minutes)	1041.966
Median Arrival Delay (minutes)	1041.971
Standard Deviation of Arrival Delay	2.660956
95% Credible Interval for Mean Arrival Delay	[1036.731, 1047.15]

Posterior Predictive Check: Observed vs. Predicted Arrival Delays

Code

# Generate posterior predictive samples for a random subset of observations
set.seed(123)
sample_indices <- sample(1:nrow(AirlineDelays1), size = 5000)  # Adjust the sample size as needed
posterior_predictive <- posterior_predict(single_predictor_model, newdata = AirlineDelays1[sample_indices, ])

# Extract the observed arrival delays for the sampled subset
observed_arrival_delays <- AirlineDelays1$arr_delay[sample_indices]

# Calculate the mean predicted delay for each observation in the subset
mean_predicted_delays <- apply(posterior_predictive, 2, mean)

# Create a data frame for plotting
ppc_data <- data.frame(
  Type = c(rep("Observed", length(observed_arrival_delays)), rep("Predicted", length(mean_predicted_delays))),
  Arrival_Delay = c(observed_arrival_delays, mean_predicted_delays)
)

# Plot with log-transformed x-axis
ggplot(ppc_data, aes(x = Arrival_Delay, fill = Type)) +
  geom_density(alpha = 0.5) +
  labs(title = "Posterior Predictive Check: Observed vs. Predicted Arrival Delays",
       x = "Arrival Delays (minutes, log scale)",
       y = "Density") +
  theme_minimal() +
  scale_fill_manual(values = c("Observed" = "lightgray", "Predicted" = "blue")) +
  scale_x_log10()

Regression Line Plot

Code

# Load necessary libraries
library(ggplot2)
library(dplyr)

# Create a new data frame for the observed data
observed_data <- AirlineDelays %>%
  select(weather_ct, arr_delay)

# Generate predictions from the model
# Create a grid of weather_ct values for plotting the fitted line
weather_seq <- seq(min(AirlineDelays$weather_ct, na.rm = TRUE), max(AirlineDelays$weather_ct, na.rm = TRUE), length.out = 100)
prediction_data <- data.frame(weather_ct = weather_seq)

# Get the fitted values from the Bayesian model
predicted_values <- posterior_predict(single_predictor_model, newdata = prediction_data)

# Calculate the mean predicted value for each weather incident
mean_predicted <- apply(predicted_values, 2, mean)
prediction_data$arr_delay <- mean_predicted

# Create the plot
ggplot() +
  geom_point(data = observed_data, aes(x = weather_ct, y = arr_delay), alpha = 0.3) +  # observed data
  geom_line(data = prediction_data, aes(x = weather_ct, y = arr_delay), color = "blue", size = 1) +  # fitted line
  labs(title = "Bayesian Regression: Arrival Delays vs. Weather Incidents",
       x = "Number of Weather Incidents",
       y = "Arrival Delays (minutes)") +
  theme_minimal()

Posterior Distribution of Weather Count Coefficient

Code

# Density plot for the posterior distribution of weather count coefficient
weather_posterior <- as.data.frame(posterior_samples(single_predictor_model))$b_weather_ct
ggplot(data.frame(weather_posterior), aes(x = weather_posterior)) +
  geom_density(fill = "lightblue") +
  labs(title = "Posterior Distribution of Weather Count Coefficient",
       x = "Weather Count Coefficient (Posterior)",
       y = "Density") +
  theme_minimal()

Summary

Code

# Load necessary libraries
library(dplyr)

# Load the dataset (replace the file path if needed)
AirlineDelays <- read.csv("Airline_Delay_Cause.csv")

# Show the first few rows of the dataset, especially `weather_ct`
head(AirlineDelays$weather_ct)

# Check the structure of the `weather_ct` variable
str(AirlineDelays$weather_ct)

# Get summary statistics for `weather_ct`
summary(AirlineDelays$weather_ct)

Model Overview

The Bayesian linear regression model aimed to assess the relationship between weather-related incidents (`weather_ct`) and arrival delays (`arr_delay`) using data from the Airline Delays dataset for August 2023. A Gaussian model was employed, with the following regression formula:$$\text{arr\_delay} \sim \text{weather\_ct}$$The output from the model indicates a significant positive relationship between the number of weather incidents and the length of arrival delays.

Key Findings

Intercept: The estimated intercept value is -2116.53 with a 95% credible interval of [-2131.41, -2100.91]. This negative intercept indicates that in the absence of weather-related issues, arrival delays might be significantly shorter, which aligns with the expectation that weather incidents cause an increase in delays.

Weather Count Coefficient: The regression coefficient for weather_ct was estimated at 1041.97 with a standard error of 2.66, and a 95% credible interval of [1036.73, 1047.15]. This result demonstrates a substantial effect of weather-related incidents on arrival delays. An increase in weather-related incidents by one unit correlates with an estimated increase in arrival delay by approximately 1042 minutes on average. Despite the relatively smaller number of delays attributed to weather, this strong coefficient suggests that when weather incidents do occur, they tend to cause significant disruptions in arrival schedules.

Uncertainty Measures: The estimated standard deviation of the model’s residuals $(\sigma)$ is 8676.19, indicating variability in the observed arrival delays beyond the effects of the weather count variable. This could be due to other unmeasured factors affecting delays or the inherent randomness in arrival times.

The convergence diagnostic measures for each parameter $(\hat{R})$ were equal to 1.00, indicating successful convergence of the model using the No-U-Turn Sampler (NUTS). Effective sample sizes for each parameter were also large, suggesting that the posterior estimates are reliable and well-sampled.

Interpretation

The model suggests a statistically significant relationship between weather-related incidents and arrival delays. Despite being less frequent compared to other causes, weather incidents lead to much larger disruptions when they occur. This highlights how important weather is when it comes to managing airline schedules and reducing arrival delays.

Discussion and Future Research

This Bayesian linear regression analysis provides valuable insights into how weather-related incidents affect flight arrival delays. The findings show a significant positive relationship between the number of weather incidents and the length of arrival delays, emphasizing that weather can lead to considerable disruptions. However, several aspects of this analysis could benefit from further refinement and expanded research.

One major area for improvement is the inclusion of additional predictor variables. This analysis focused solely on weather-related incidents, but considering other factors, like carrier delays, NAS (National Airspace System) issues, or even socioeconomic variables, could give a fuller picture of what drives flight delays. It would also be useful to explore interactions between these factors, which might reveal more complex and interdependent relationships.

Another key consideration is the dataset itself. This analysis used data from August 2023, but incorporating data from multiple months or years could help identify seasonal trends and longer-term patterns. Adding real-time data could also be valuable, allowing for insights into how delays change dynamically over time and providing more actionable information for decision-makers.

In addition, the Bayesian linear regression model used in this analysis could be expanded to more sophisticated Bayesian models in future research. For instance, hierarchical models could account for variations between different airports or airlines, offering a more nuanced view of delays. Techniques like variational inference could also be explored to efficiently work with larger datasets and more complex models.

Regarding outliers, it was assumed in this analysis that the outliers in arrival delays were valuable and informative rather than anomalies to be removed. This decision may have influenced the results, highlighting an area for future exploration. Future studies could look at using Bayesian methods specifically designed to manage outliers or incorporate domain-specific knowledge to better distinguish and handle extreme values. Conducting a comparison of results with and without outliers would provide clearer insights into their impact.

There were also several assumptions made with respect to the key predictor variable, weather_ct. It was assumed that the reported count data was accurate and consistently measured across flights, that weather incidents affected delays independently of one another, and that the impact was consistent across different airports and times. Additionally, a linear relationship was assumed between weather_ct and arrival delays, which may oversimplify the actual dynamics. Addressing these assumptions in future research could enhance the validity of the results.

Conclusion

This Bayesian linear regression analysis highlights the significant impact of weather-related incidents on flight arrival delays. Although weather issues are less frequent than other delay causes, they have a disproportionately large effect on delay times. This finding suggests that special attention must be given to weather management and forecasting, alongside addressing other common delay factors such as carrier and NAS-related issues.

Using a Bayesian approach allows for effectively accounting for uncertainty, providing credible intervals for all parameter estimates and offering a deeper understanding of how weather-related incidents affect arrival delays. This probabilistic framework supports more informed decision-making, especially in resource allocation and policy-making aimed at reducing disruptions in airline operations.

Bayesian linear regression serves as an effective tool for exploring complex relationships in the aviation industry, particularly regarding the impact of weather. By integrating prior knowledge and addressing uncertainty, this analysis provides insights that are both informative and practical. For future research, expanding the model to include additional variables, refining data sources, exploring advanced Bayesian models, and developing interactive visualizations are promising avenues. These directions have the potential to enhance the applicability of this work, contributing to more effective strategies for mitigating flight delays and improving airline efficiency.

References

Bayes, T. 1763. “An Essay Towards Solving a Problem in the Doctrine of Chances. 1763.”

Blei, David M, Alp Kucukelbir, and Jon D McAuliffe. 2017. “Variational Inference: A Review for Statisticians.” J. Am. Stat. Assoc. 112 (518): 859–77.

Caldwell, Allen, Daniel Kollár, and Kevin Kröninger. 2009. “BAT – the Bayesian Analysis Toolkit.” Comput. Phys. Commun. 180 (11): 2197–2209.

Carpenter, Bob, Andrew Gelman, Matthew D Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus A Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. 2017. “Stan: A Probabilistic Programming Language.” J. Stat. Softw. 76 (1).

Gabry, Jonah, Daniel Simpson, Aki Vehtari, Michael Betancourt, and Andrew Gelman. 2019. “Visualization in Bayesian Workflow.” J. R. Stat. Soc. Ser. A Stat. Soc. 182 (2): 389–402.

Gelman, Andrew, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari, and Donald B Rubin. 2013. Bayesian Data Analysis. Chapman; Hall/CRC.

Koehrsen, Will. 2018. “Introduction to Bayesian Linear Regression.” https://towardsdatascience.com/introduction-to-bayesian-linear-regression-e66e60791ea7.

R Core Team. 2023. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Schoot, Rens van de, Sarah Depaoli, Ruth King, Bianca Kramer, Kaspar Märtens, Mahlet G Tadesse, Marina Vannucci, et al. 2021. “Bayesian Statistics and Modelling.” Nat. Rev. Methods Primers 1 (1).

“What Is Linear Regression?” 2021. https://www.ibm.com/topics/linear-regression.

Zyphur, Michael J, and Frederick L Oswald. 2015. “Bayesian Estimation and Inference.” J. Manage. 41 (2): 390–420.