BUSI3122-maple代写
时间:2022-11-03
University of Nottingham Ningbo China  


Academic Year 2021/22 Autumn Semester  
BUSI 3122 Introduction to Data Science  
Joseph Yu  

[Coursework]  







Yihao SHEN, 20125305  
Long LIN, 20125139  
Xinyu CHEN, 20125057  
Yelin SHAN, 20124956  



[Word Count: 1357]  



1  
Section A  
The client of our predictive model is the insurance company running flight delay insurance business. Flight  
delay insurance is a commercial insurance that the insurer pays the security deposit on a contractual basis  
whenever the case. Salas (2021) researched that the average delayed departure rate of major U.S. airlines  
was around 18% from 2000 to 2020. Meanwhile, it was indicated by Lu et al. (2021) that flight delay  
prediction is important for the pricing and operation of flight delay insurance. If cannot accurately predict  
the delay rate, the insurance company will set an unprofitable premium, later leading to a loss. To effectively  
reduce such loss, insurance companies nowadays strengthen cooperation with financial institutions, airlines,  
travel websites. Utilizing the information from each commercial organization, such as>  
insurance companies tried to reasonably design the insurance clauses and use the prediction model to  
calculate the insurance rate (Xu, 2013). Integrated information will also be used in this paper to forecast  
the flight delay more effectively. The target variable of this model is whether the flight is delayed: “0”  
indicates that the flight takes off on time or is delayed within 15 minutes, and “1” indicates that the flight  
is delayed for more than 15 minutes. To emphasize, factors of airline and airport are key contributors to  
flight delays, according to Lock (2021). Therefore, the following paper will first reveal the influence of  
these great influence factors and continue to explore other related factors.  


Section B  
The dataset has 24 attributes, including 23 features variables and 1 target variable. These feature variables  
could be roughly classified into six groups:  

Then, relationships between features and target variable are explored. Ensuring reasonable comparison, we  
standardized all feature variables to better compare. Conducting logistic regression on standardized data,  
the coefficient of each feature variable on target variable was collected as below:  

Furthermore, the graphs (Figure 3) intuitively illustrating the relationships are drawn. Starting from the  
largest absolute value in table above, the “AVG_MONTHLY_PASS_AIRLINE” has remarkable impact on  
the delay as the probability of delay in the low level of passenger number is higher than in the high level.  
The same pattern could be found in feature “AIRLINE_FLIGHTS_MONTH”, a practical meaning—how  
many flights certain airlines have. The feature “SEGMENT_NUMBER” indicates the flight frequency of  
given aircraft, and a negative relationship between it and delay could be found. Further, “Evening” is the  
2  
period that has highest delay rate, 23.51%, as the bar chart indicates.  

By conducting the correlation analysis among different numerical features, the relationship between  
different features can be revealed. The table below selects the feature correlation with higher absolute value.  
For instance, “AVG_MONTHLY_PASS_AIRPORT” and “AIRLINE_AIRPORT_FLIGHTS_MONTH”  
contains the same information. Additionally, the increase of the number of planes will lead to the increase of  
the number of flights taking off at the same time; the size of an airplane will affect the number of flights  
and the ground crew, and it will be affected by the flight distance. Latitude and longitude information is  
directly related to weather factors and airport location, and many weather factors are related to each other  
(e.g., a correlation between snowfall and snow depth).  



Section C  
This part projects to test three models—Logistic Regression, kNN and Random Forest—to choose the best  
performer for business solution. In the modelling process, some less informative features are skipped to  
benefit the computing process through SELECT COLUMNS function.  

Figure 5 shows feature selection process and result for three models. For the categorical feature selection,  
the prediction performance for three models with different ignored categorical feature is compared, and the  
table indicates that there will be few differences in inputting particular categorical features. Consequently,  
all the categorical features will be input to three models. For numeric features, 6 displayed distribution  
graphs show a high contract ratio that have little contribution to the “1” and “0” differentiation.  
Consequently, the remaining parameters will be more informative to make better contributions to prediction.  
3  

Logistic regression is a model estimating the probability of class members over a categorical class. Through  
model analysis, the weight of attributes can be obtained, to roughly understand which factors are significant  
to target. The kNN is a model mainly concerned about the neighbors. The main advantages of kNN are its  
intuitive implementation and no assumption requirement. The profiles of flights shared many similarities  
that could provide practical meaning for using the model. The “Number of neighbors” was set to 6. Random  
Forest model is considered because of easiness and no overfitting. Although hard to interpret the result, the  
model itself is usually better than any other models empirically. In terms of model setting, we choose “Do  
not split subsets smaller than 5” and set the “Number of trees” at 30 after attempts to keep the intermediate  
subset size.  

To compare these models, some confusion matrix related index presented at “Prediction” and “Test and  
Score” node will be preliminary applied in the evaluation. Figure 6 reveals that “Random Forest” model  
has the best performance among the confusion matrix related information. And the model comparison  
according to “Test and Score” indicates that Random Forest has much higher possibility on AUC and CA  
performance on surpassing other two models.  


ROC and AUC curves also explain the performance level. According to ROC curves and their derivatives  
AUC indexes, it’s clear to see that Random Forest has a more salient ROC curve and greater AUC value,  
4  
demonstrating its superiority in modelling term.  


However, due to the unbalance factual cost of target, expected cost is also a crucial criterion to evaluate  
models. The following figure shows our expected profit (loss) analysis based on the confusion matrixes.  
For insurance company, the largest cost will be incurred when the False Negative outcome—clients provide  
$3 lower insurance price (Kahler, 2019) but a compensation of $50 (American Express, 2020), and for the  
cost of False Positive is assumed to be 0. After the calculation, it can be concluded that the expected cost is  
the lowest when using Random Forest model.  



Section D  
As analyzed above, the Random Forest model is chosen as the final model to assist the client’s business  
with its highest expected profit. The intrinsic working scheme in the Random Forest avoids overfitting  
issues that may lead to only matching the exact same data. Especially under the circumstance when the  
flight dataset is just slightly different from each other. Moreover, this model can also benefit the client’s  
pricing strategy, operation process, marketing targets, and customer retention. For example, customers who  
purchased flights with higher delay potential can be charged a premium in advance to save costs. Also,  
5  
flights insurances with higher delay rates can allocate more customer service staffing in advance to  
rationalize human resources. It may be beneficial to particularly focus on verifying customer information  
on these flights insurance orders to avoid arbitrage. Besides, promotions of insurances on lower delay rates  
flights may also help increase revenue.  

The Orange version we use is 3.30.1 and the instruction for the new data is as shown in Figure 9. The  
variability and temporality of some data are worth noticing. Meaning, with the changing nature of the data,  
the update process will be required consistently. Moreover, the target variable will be affected significantly  
by all the conditions at the previous airport (Cheng et al., 2019). Yet, our dataset only provided general and  
simple elements of previous airports. Therefore, additional data related to various aspects of the previous  
airport, such as weather factors and technical issues, may be beneficial for modeling improvement. The  
repetitiveness of some data can be another limitation requiring action as mentioned in Section C. We also  
recommend the client invest in the flight attendance rate and returned ticket rate for further potential  
analysis to sustain the high profitability. Hence, information, such as flight discount rate and average  
numbers of customers per flight monthly, would be worth obtaining. Furthermore, since the insurance  
company will pay a different percentage to customers according to how long their flight has been delayed,  
getting detailed time periods of the delay and separating them into various time periods (e.g., 15min-1h;  
1h-4h; 4h-6h; over 6h, etc.) is also an effective approach.  













6  
Reference  
American Express., (2020). Worldwide Travel Inconvenience Insurance. Available at:  
https://www.americanexpress.com/content/dam/amex/us/network/documents/fnbodocs/fnbo_telite_docs/Travelite_Flight 
%20Delay%20Insurance_Terms.pdf [Accessed 8 December 2021].  

Cheng, S. et al. (2019). Study of Flight Departure Delay and Causal Factor Using Spatial Analysis. Journal of Advanced  
Transportation, 2019, pp.1-11.  

Kahler, M., (2019). Flight Insurance Against Delays and Cancellations. Trip Savvy. Available at:  
https://www.tripsavvy.com/flight-insurance-guide-4126743 [Accessed 8 December 2021].  

Lock, S. (2021) Share of total minutes flights were delayed in the United States from 2004 to 2019*, by cause. Available  
at: https://www.statista.com/statistics/481333/leading-causes-of-flight-delay-in-the-us/ [Accessed 9 December 2021].  

Lu, M.D et al. (2021) Flight Delay Prediction Using Gradient Boosting Machine Learning Classifiers. Available at:  
https://www.proquest.com/docview/2535728417?pq-origsite=gscholar&fromopenview=true [Accessed 9 December  
2021].  

Salas, E. (2021) Share of late departures of major U.S. air carriers from 2000 to 2020. Available at:  
https://www.statista.com/statistics/186280/percentage-of-late-departures-by-us-air-carriers-since-1988/ [Accessed 9  
December 2021].  

Xu, L. (2013) Current Status and Suggestions on Flight Delay Insurance. Available at:  
https://en.cnki.com.cn/Article_en/CJFDTotal-MHFX201306007.htm [Accessed 9 December 2021]. 
essay、essay代写