BUSI3122-maple代写|学霸联盟

BUSI3122-maple代写

时间：2022-11-03

University of Nottingham Ningbo China

Academic Year 2021/22 Autumn Semester
BUSI 3122 Introduction to Data Science
Joseph Yu

[Coursework]

Yihao SHEN, 20125305
Long LIN, 20125139
Xinyu CHEN, 20125057
Yelin SHAN, 20124956

[Word Count: 1357]

1
Section A
The client of our predictive model is the insurance company running flight delay insurance business. Flight
delay insurance is a commercial insurance that the insurer pays the security deposit on a contractual basis
whenever the case. Salas (2021) researched that the average delayed departure rate of major U.S. airlines
was around 18% from 2000 to 2020. Meanwhile, it was indicated by Lu et al. (2021) that flight delay
prediction is important for the pricing and operation of flight delay insurance. If cannot accurately predict
the delay rate, the insurance company will set an unprofitable premium, later leading to a loss. To effectively
reduce such loss, insurance companies nowadays strengthen cooperation with financial institutions, airlines,
travel websites. Utilizing the information from each commercial organization, such as>
insurance companies tried to reasonably design the insurance clauses and use the prediction model to
calculate the insurance rate (Xu, 2013). Integrated information will also be used in this paper to forecast
the flight delay more effectively. The target variable of this model is whether the flight is delayed: “0”
indicates that the flight takes off on time or is delayed within 15 minutes, and “1” indicates that the flight
is delayed for more than 15 minutes. To emphasize, factors of airline and airport are key contributors to
flight delays, according to Lock (2021). Therefore, the following paper will first reveal the influence of
these great influence factors and continue to explore other related factors.

Section B
The dataset has 24 attributes, including 23 features variables and 1 target variable. These feature variables
could be roughly classified into six groups:

Then, relationships between features and target variable are explored. Ensuring reasonable comparison, we
standardized all feature variables to better compare. Conducting logistic regression on standardized data,
the coefficient of each feature variable on target variable was collected as below:

Furthermore, the graphs (Figure 3) intuitively illustrating the relationships are drawn. Starting from the
largest absolute value in table above, the “AVG_MONTHLY_PASS_AIRLINE” has remarkable impact on
the delay as the probability of delay in the low level of passenger number is higher than in the high level.
The same pattern could be found in feature “AIRLINE_FLIGHTS_MONTH”, a practical meaning—how
many flights certain airlines have. The feature “SEGMENT_NUMBER” indicates the flight frequency of
given aircraft, and a negative relationship between it and delay could be found. Further, “Evening” is the
2
period that has highest delay rate, 23.51%, as the bar chart indicates.

By conducting the correlation analysis among different numerical features, the relationship between
different features can be revealed. The table below selects the feature correlation with higher absolute value.
For instance, “AVG_MONTHLY_PASS_AIRPORT” and “AIRLINE_AIRPORT_FLIGHTS_MONTH”
contains the same information. Additionally, the increase of the number of planes will lead to the increase of
the number of flights taking off at the same time; the size of an airplane will affect the number of flights
and the ground crew, and it will be affected by the flight distance. Latitude and longitude information is
directly related to weather factors and airport location, and many weather factors are related to each other
(e.g., a correlation between snowfall and snow depth).

Section C
This part projects to test three models—Logistic Regression, kNN and Random Forest—to choose the best
performer for business solution. In the modelling process, some less informative features are skipped to
benefit the computing process through SELECT COLUMNS function.

Figure 5 shows feature selection process and result for three models. For the categorical feature selection,
the prediction performance for three models with different ignored categorical feature is compared, and the
table indicates that there will be few differences in inputting particular categorical features. Consequently,
all the categorical features will be input to three models. For numeric features, 6 displayed distribution
graphs show a high contract ratio that have little contribution to the “1” and “0” differentiation.
Consequently, the remaining parameters will be more informative to make better contributions to prediction.
3

Logistic regression is a model estimating the probability of class members over a categorical class. Through
model analysis, the weight of attributes can be obtained, to roughly understand which factors are significant
to target. The kNN is a model mainly concerned about the neighbors. The main advantages of kNN are its
intuitive implementation and no assumption requirement. The profiles of flights shared many similarities
that could provide practical meaning for using the model. The “Number of neighbors” was set to 6. Random
Forest model is considered because of easiness and no overfitting. Although hard to interpret the result, the
model itself is usually better than any other models empirically. In terms of model setting, we choose “Do
not split subsets smaller than 5” and set the “Number of trees” at 30 after attempts to keep the intermediate
subset size.

To compare these models, some confusion matrix related index presented at “Prediction” and “Test and
Score” node will be preliminary applied in the evaluation. Figure 6 reveals that “Random Forest” model
has the best performance among the confusion matrix related information. And the model comparison
according to “Test and Score” indicates that Random Forest has much higher possibility on AUC and CA
performance on surpassing other two models.

ROC and AUC curves also explain the performance level. According to ROC curves and their derivatives
AUC indexes, it’s clear to see that Random Forest has a more salient ROC curve and greater AUC value,
4
demonstrating its superiority in modelling term.

However, due to the unbalance factual cost of target, expected cost is also a crucial criterion to evaluate
models. The following figure shows our expected profit (loss) analysis based on the confusion matrixes.
For insurance company, the largest cost will be incurred when the False Negative outcome—clients provide
$3 lower insurance price (Kahler, 2019) but a compensation of $50 (American Express, 2020), and for the
cost of False Positive is assumed to be 0. After the calculation, it can be concluded that the expected cost is
the lowest when using Random Forest model.

Section D
As analyzed above, the Random Forest model is chosen as the final model to assist the client’s business
with its highest expected profit. The intrinsic working scheme in the Random Forest avoids overfitting
issues that may lead to only matching the exact same data. Especially under the circumstance when the
flight dataset is just slightly different from each other. Moreover, this model can also benefit the client’s
pricing strategy, operation process, marketing targets, and customer retention. For example, customers who
purchased flights with higher delay potential can be charged a premium in advance to save costs. Also,
5
flights insurances with higher delay rates can allocate more customer service staffing in advance to
rationalize human resources. It may be beneficial to particularly focus on verifying customer information
on these flights insurance orders to avoid arbitrage. Besides, promotions of insurances on lower delay rates
flights may also help increase revenue.

The Orange version we use is 3.30.1 and the instruction for the new data is as shown in Figure 9. The
variability and temporality of some data are worth noticing. Meaning, with the changing nature of the data,
the update process will be required consistently. Moreover, the target variable will be affected significantly
by all the conditions at the previous airport (Cheng et al., 2019). Yet, our dataset only provided general and
simple elements of previous airports. Therefore, additional data related to various aspects of the previous
airport, such as weather factors and technical issues, may be beneficial for modeling improvement. The
repetitiveness of some data can be another limitation requiring action as mentioned in Section C. We also
recommend the client invest in the flight attendance rate and returned ticket rate for further potential
analysis to sustain the high profitability. Hence, information, such as flight discount rate and average
numbers of customers per flight monthly, would be worth obtaining. Furthermore, since the insurance
company will pay a different percentage to customers according to how long their flight has been delayed,
getting detailed time periods of the delay and separating them into various time periods (e.g., 15min-1h;
1h-4h; 4h-6h; over 6h, etc.) is also an effective approach.

6
Reference
American Express., (2020). Worldwide Travel Inconvenience Insurance. Available at:
https://www.americanexpress.com/content/dam/amex/us/network/documents/fnbodocs/fnbo_telite_docs/Travelite_Flight
%20Delay%20Insurance_Terms.pdf [Accessed 8 December 2021].

Cheng, S. et al. (2019). Study of Flight Departure Delay and Causal Factor Using Spatial Analysis. Journal of Advanced
Transportation, 2019, pp.1-11.

Kahler, M., (2019). Flight Insurance Against Delays and Cancellations. Trip Savvy. Available at:
https://www.tripsavvy.com/flight-insurance-guide-4126743 [Accessed 8 December 2021].

Lock, S. (2021) Share of total minutes flights were delayed in the United States from 2004 to 2019*, by cause. Available
at: https://www.statista.com/statistics/481333/leading-causes-of-flight-delay-in-the-us/ [Accessed 9 December 2021].

Lu, M.D et al. (2021) Flight Delay Prediction Using Gradient Boosting Machine Learning Classifiers. Available at:
https://www.proquest.com/docview/2535728417?pq-origsite=gscholar&fromopenview=true [Accessed 9 December
2021].

Salas, E. (2021) Share of late departures of major U.S. air carriers from 2000 to 2020. Available at:
https://www.statista.com/statistics/186280/percentage-of-late-departures-by-us-air-carriers-since-1988/ [Accessed 9
December 2021].

Xu, L. (2013) Current Status and Suggestions on Flight Delay Insurance. Available at:
https://en.cnki.com.cn/Article_en/CJFDTotal-MHFX201306007.htm [Accessed 9 December 2021].