ENGG2112-engg代写|学霸联盟

ENGG2112-engg代写

时间：2023-10-22

—————————————————————————
Project Report for ENGG2112
Road traffic accidents prediction
—————————————————————————
Faculty of Engineering
October 3, 2023
Executive Summary
1. Introduction
Road traffic accidents are a widespread problem that not only threatens lives but also
causes loss of economic productivity and social disruption. The complexity of road traffic
accidents often results from the complex interaction of multiple variables, such as road
conditions, driver behavior and environmental factors. Developing models to predict the
severity of these accidents and identify root causes can help develop effective preventive
measures. These efforts have the potential to reduce road deaths, injuries and economic
losses while enhancing public safety and traffic management.According to This assignment,
It provides a platform of study for team members with backgrounds in software engineering
and electrical and electronic engineering. The software engineering skills provide
opportunities for data analysis, machine learning model development, and application
implementation. At the same time, electronics and electrical engineering expertise can
provide valuable insights into intelligent transportation systems.
2. Objectives and Problem Statement
1. Build accurate models to classify the severity of traffic accidents.
2. Making predictions based on the model built.
3. Choose essential factors (based on indicators such as variable importance and statistical
significance) that affect the severity of traffic accidents. According to their respective effect,
give suggestions to relevant departments and drivers.
4. Evaluate if there are significant differences of traffic jams happening in four states of the
United states.
3. Methodology
3.1 Data-preprocessing
It is clear that there are many missing data in the dataset. To fulfill these blanks, for factor
type variables, replacing missing entries with the mode and for continuous type variables,
replace NAs with mean value. Also, random forest techniques will also be attempted to fulfill
blanks.
3.2 Classification
This section is to accomplish objectives 1,2 and 3.
3.2.1 Discriminant analysis
We will perform linear discriminant analysis and quadratic discriminant analysis, which are
based on multivariate gaussian distribution and bayes theorem. However, as most of the
independent variables are of factor type, this is theoretically not a good method. Therefore, it
will be performed as a baseline compared with the result of other methods that are more
suitable for factor type variables.
3.2.2 K Nearest Neighbor
One method we can use is K Nearest Neighbor. As the KNN is suitable for handling both
discrete features such as location coordinates, road classifications, speed limits, and
weather conditions.Specifically, The attributes selected for prediction include ’Location
Coordinates,’ ’Weather Conditions,’ ’Speed Limit,’ and ’Time,’ aiming to predict the ’Accident
Severity’ or identify ’High-Risk Areas. Then the dataset is split into training and test subsets,
following an 80-20 ratio. Initially, a K-value of 5 is selected, and further improvement will be
through cross-validation for better performance. After training, the model is assessed using
the test subset with evaluation metrics including accuracy, and F1 score.
3.2.3 Naïve Bayes
In the Na¨ıve Bayes approach, we operate under the assumption that the features are
independent of each other. For classification, we focus on the target variable ′Accident
severity′ which is categorized into three distinct levels: slight, serious, and fatal. Contributing
features selected for the classification model include road classification, weather conditions,
time of day, and speed limit. These attributes serve as the foundation for predicting accident
severity, enabling us to employ Na¨ıve Bayes as an effective method for risk assessment.
and we partition the data into a 70-30 split for training and testing to evaluate if the model is
accurate.
3.2.4 Logistic regression
This is a method more suitable for factor type variables by creating indicator variables. It is
also a good choice for making multi-class classification (in this case the three levels of
severity of traffic accidents).
3.2.5 Random forest
This is an ensemble learning method and has been proved to be quite efficient and accurate.
Also, its nature allows us to easily evaluate the variable importance and therefore identify
important variables for predicting as well as preventing traffic jams.
3.3 Comparing the performance in four areas
This section is to evaluate the frequency and severity of traffic jams in four areas in the
United kingdom (i.e. objective 4). The main methodology used in this section would be
analysis of variance (ANOVA), which is a popular method for comparing the mean level of
different classes.
3.4 Simulation Environment
The simulations made use of a 80-20 training-testing split, and were run on PyCharm and
the anaconda. Random cross-validation of ten instances was performed, with the final model
chosen through averaging of the four most accurate models.
4.2 Issues Faced
The coding of the simulations was not entirely problem-free. We faced the following issues in
chronological order, and dealt with them as described.
• The data preprocessing problem:
• The model which fitting is not quite well on few sample problem:
5 Potential for Wider Adoption
It is encouraging that this short project has produced results that appear to be competitive
against the methods proposed by highly regarded research groups. We envisage that future
work might include:
• by using the imbalance-learn model to processing the data
•
6 Final results
6.1 Random Forests:
The overall accuracy reached 0.86, indicating a generally good performance by the model.
For category 0(Fatal), despite an accuracy of 0.87, the recall is only 0.13, meaning that a
significant portion of the actual class 0 samples were not correctly classified.
For categories 1(severe) and 2(slight), the model demonstrated relatively high accuracy,
especially with a recall of 0.99 for category 2, showing its superior performance on these
categories.
6.2 KNN:
The overall accuracy stands at 0.56, a decline compared to the Random Forests model.
Across all three categories, both precision and recall appear relatively balanced, though
neither reaches particularly high levels. It might because attributed to the KNN's
underperformance on high-dimensional data.
6.3. Naive Bayes:
:
The overall accuracy is the lowest among the three models, standing at 0.35.
Similar to Random Forests, category 0 has a lower recall, while category 2 boasts a higher
one.Besides, despite its cross-validation score of 0.35, its performance on the test set is
comparatively weaker. Which means n this dataset, The Naive Byte’s model should not be
used.