COMP9417-无代写
时间:2023-07-31
COMP9417 Machine Learning Project: Feedback Prize -
Predicting Effective Arguments
Jingyang Ma* Kanghong Yu* Yinglu Yang* Yuan Liu* Yuxin Meng*
z5380704 z5413510 z5390317 z5351688 z5348583
1. Introduction
The secret to success is writing. Argumentative writing, in particular, develops civic
engagement and critical thinking skills and can be strengthened through repeated
practice. However, only 13% of eighth-grade teachers mandate weekly persuasive
writing assignments for their pupils. In addition, black and Hispanic students are more
likely than their white peers to write at a "below basic" level due to resource limitations.
With the help of automated feedback tools, teachers can more easily grade the writing
assignments given to students, which will also contribute to their writing skills.
Although there are many automated writing feedback tools available, they all have
drawbacks, especially when it comes to writing that presents an argument. Many times,
existing tools are unable to evaluate an argument's quality in terms of its organization,
support, and idea development. Additionally, many of these writing tools are
prohibitively expensive for teachers, which has a significant negative impact on already
underfunded schools. GSU and Learning Institution Labs have teamed up to encourage
data scientists to enhance automated writing assessments in order to better prepare all
students. Moreover, this public effort might serve as a catalyst for more user-friendly
and high-quality automated writing tools. If they are successful, students will get more
feedback on the argumentative part of their writing and use this skill in a variety of
academic fields.
The Kaggle competition can be found as a feedback prize - Predicting Effective
Argument. The aim of this competition is to classify the argumentative components of
student writing as "effective", "adequate" or "ineffective". To reduce bias, we will
develop a model that has been trained on data that is typical of the American population
of students in grades 6 through 12. Students will benefit from improved feedback on
their argumentative writing thanks to models developed from this competition. With
automated guidance, students can finish more tasks and eventually develop into
competent, self-assured writers. The dataset, provided by Kaggle, contains
argumentative essays written by students in grades 6-12 in the United States. These
essays are annotated by expert reviewers on common discourse elements in
argumentative writing.
In this project, we pre-processed the dataset. Building machine learning classification
models and deep learning classification models. In order to accurately estimate the
estimated value of "effective," "adequate," or "ineffective." required for the task. We
measure how good a model is by analysing the accuracy of each model.
2. Implementation
This section, we mainly divide into two parts. The first part is building a classification
model based on a simple machine learning algorithm. The second part is to build a
classification model through deep learning algorithms. Implement different feature
extraction and data cleaning in the two parts respectively, and analyse which model is
more suitable for the project by comparison. Finally, the final model to be used is
determined by comparing the accuracy and output results.
2.1 Basic machine learning algorithms model
2.1.1 Data Exploration
First of all, we need to have some knowledge of the dataset before we process it, so we
perform some general analysis of the data. There are 36765 examples in the training set,
each with 5 columns discourse_id, essay_id, discourse_text, discourse_type and
discourse_effectiveness. There are no missing data in the dataset. All three types of
discourse effectiveness appear in different types of discourse, as in the following figure,
indicating that the type of argument alone does not determine the final argumentative
effect.
Figure 1 Statistic of discourse_type and discourse_effectivenes
There are cases where the discourse text and discourse type are the same but the
discourse effectiveness is different, as shown in the first line and last line in the figure
below, so it shows that the fact that the sentences are in different articles also affects
the final prediction.
Figure 2 Same text and type, different effectiveness
Furthermore, the majority of the discourse texts are statistically less than 200 words and
can be applied to all the next models.
Figure 3 word count of discourse_text
2.1.2 Data processing
The most common approach is a word package which tokenises the words in a
document and represents them as a vector of word counts, with each position in the
vector representing a different token. In our project, we use scikit-learn's
TfidfVectorizer function to compute the Tfidf value (feature word) for each text,
considering not only the number of occurrences of the feature word in the text, but also
the concentrated distribution of the word in the document (i.e. the idf value).
TfidfVectorizer has several parameters we experimented with - notable parameters
include:
• ngram_range: tuple(min_n, max_n). The lower and upper range of n-
values of n-grams to be extracted, all values of n in the interval min_n <=
n <= max_n
• norm: 'l1', 'l2', or None, optional. Norms are used to normalize term
vectors. None for non-normalization
• smooth_idf: boolean, optional. Smooth the idf weights by adding 1 to the
document frequency, adding an extra document to prevent division by zero
Firstly, by reading the dataset, we see that there are many features of data redundancy
in the target text, such as punctuation, capital letters, numbers and other special symbols.
When removed in the usual way, a number of tense abbreviations are found. This makes
the cleaned data prone to meaningless single letters. Therefore, when cleaning the data,
we cleaned it according to the part of speech so that they could adequately separate
meaningful words and word combinations during the word segmentation process. The
figure below shows the comparison of the cleaned data.
Figure 4 Comparison of the orignal data and cleaned data
Then, we will use train_test_split in sklearn to divide X without discourse_effectiveness
and Y with only discourse_effectiveness as parameters to divide a certain proportion of
test data and train data. After doing this, we need to do one hot encoding for discourse
type and clean text using TFIDF vectorize. Then use sparse.hstack to combine the two
features. Use these two combined features as x_train and x_test.
Meanwhile, we apply labelencoder to encode the label, which is relatively
straightforward to do on a computer.
2.1.3 Machine learning algorithms
2.1.3.1 Logistic regression
The principle of logistic regression is to use a logistical function to map the results of
linear regression from (-∞, +∞) to (0, 1).
The hyperparameters we considered in Logistic Regression were class weight, max_iter
and the inverse of regularization strength, C. C is used to avoid the problem of
overfitting and the tested value was from 1 to 2, at 0.1 increment. The sample data is
imbalanced since the “adequate” dominates more than half, we set the value of class
weight to None or “balanced”. The max_iter controls the iteration rate of the model
and stop the model when it is appropriate. The larger max_iter is, the smaller the step
size is and the longer the model iteration time is; otherwise, the larger the step size is,
the shorter the model iteration time is.
Then we used grid search to find the optimal parameter in logistic regression. The
highest accuracy of logistic regression is 0.66 with the best parameters, C is 1, max_iter
is 500 and class weight is None.
2.1.3.2 Naive Bayes
Naïve Bayes algorithm is based on Bayes Theorem, it assumes that each feature is
independent given the value of label and predicts by computing the probability of each
instance for a given label and choosing the one with the highest probability.
In our project, we used two classic naive Bayes variants used in text classification,
Multinomial Naive Bayes (MNB) and Bernoulli Naive Bayes (BNB). The main
difference between BNB and MNB is that BNB is based on the occurrence of a word,
while MNB is based on word frequency. For both BNB and MNB, we changed the
hyperparameter Laplace Smoothing α from 0 to 1 and the step is 0.1, which assigns a
small probability to unseen words, to find the best BNB and MNB model. The highest
accuracy of MNB is 0.643 when alpha is 0.1, and the one of BNB is 0.631 when alpha
is 1.
Figure 5 The accuracy of MNB and BNB with different alpha
2.1.3.3 SVM
Support Vector Machine (SVM) is a kind of supervised learning method, which can be
widely used in statistical classification and regression analysis. It maps the vector to a
higher dimensional space and establishes a hyperplane of maximum spacing in that
space. There are two hyperplanes parallel to each other on either side of the hyperplane
separating the data. The separation of the hyperplanes maximizes the distance between
the two parallel hyperplanes.
The hyperparameter we considered in SVM is the decision_function_shape. As our
project is a multi-classification problem, we set it to one-vs-one (‘ovo’), which is
always used as a multi-class strategy to train models. The accuracy of SVM is 0.656.
2.1.3.4 LGBM
We found that the performances of single models were poor, so we decided to try some
ensemble methods. The main idea of GBDT (Gradient Boosting Decision Tree) is to
use Decision Tree iterative training to get the optimal model, which has the advantages
of good training effect and not easy overfitting. LGBM (Light Gradient Boosting
Machine) is a framework to implement GBDT algorithm. It supports efficient parallel
training, and has the advantages of faster training speed, lower memory consumption,
better accuracy, distributed support and rapid processing of massive data.
There are a few hyperparameters we are able to change in LGBM. We set the boosting
type as the default, gbdt and it is clear that the object is multiclass. The max_depth
stands for the maximum depth of the tree and it avoid the problem of overfitting. We
changed the max_depth from 3 to 7, at 1 increment. In addition, we changed the
subsample, which is the sampling ratio of training samples, from 0.1 to 0.1 at 0.1
increment. It is also used to avoid overfitting and speed up the computation.
After running grid search, we got the best accuracy 0.646.
2.1.4 Model comparison
Figure 6 The accuracy of different machine learning models
From the above figure, it is clear that all traditional machine learning algorithms don’t
have good performances in this classification problem, their accuracy is just above 0.6.
This is mainly because words in a text are closely related, we cannot take a single word
as an independent feature. Hence, we are going to try some deep learning methods, like
LSTM and Bert.
2.2 Deep learning models
2.2.1 Recurrent neural networks
2.2.1.1 Long short-term memory (LSTM)
Long short-term memory (LSTM) [1], has a recurrent network structure designed to
capture long-range dependencies in sequential data.
The significant difference between LSTM and RNN is the additional content layer,
which is the top part of the diagram. This part can be considered as long-term memory.
It was controlled by the forget gate = (ℎ−1 + + ). The output of is
between 0 and 1. = 1 represents “completely keep”, while = 0 means “drop this
value”. The middle part of the diagram is to update the old cell state into the new cell
state. The update values () between −1 and +1 are calculated in tanh, and the input
gate () is used to determine the rate at which these updated values should be multiplied
before being added to the previous cell state. Then, the new cell state = ⊙ −1 +
⊙ . The output part is similar to RNN. The only difference is that the output
depends on both current cell state () and previous hidden layer (ℎ−1), where output
ℎ = ∗ tanh().
Figure 7 source from https://edstem.org/au/courses/8538/lessons/20862/slides/148710
It could solve the problem of vanishing gradient to some extent by using a novel
additive gradient structure that includes direct access to the activations of the forget
gate, allowing the network to encourage desired behaviour from the error gradient via
frequent gate updates at each time step of the learning process [2]. It can handle
sequences of 100 words, but can still be tricky for sequences of 1000 words or longer
[3]. In addition, the network has become complex with additional features, such as
forget gate, input gate and output gate. When the training size becomes large, it requires
a lot of resources and time to train, as each unit of the network has 4 linear layers. And
it can only wait until the previous hidden state has finished training before the current
state can be calculated.
2.2.1.2 Gate Recurrent unit (GRU)
Gate recurrent unit [4] can be seen as an improved version of standard RNN, which has
a great performance on remember previous state. It aimed to solve vanishing gradient
problem which comes with a standard recurrent neural network. GRU could also be
thought of as a variant of the LSTM because they are both constructed similarly and, in
certain situations, give similarly outstanding outcomes. GRU uses two gates to solve
the vanishing gradient problem, update gate and reset gate respectively. The update
vector is calculated by = (
() +
()ℎ−1), where is the input word vector
for time step t, ℎ−1 which holds the information of previous state, W and U is the
weight of input and previous state respectively. The reset vector () is same as the
update vector. Then, the current state that is transferred to next state is calculate by
ℎ = ⨀ℎ−1 + (1 − )⨀ℎ
′ , where is the update vector and ℎ
′ = tanh( +
⨀ℎ−1). Essentially, they are two vectors that determine what information should
be sent to the output. They are unique in that they can be trained to retain knowledge
from long ago without being washed away by time or to discard information that is
unrelated to the prediction.
2.2.2 BERT
Bidirectional Encoder Representations from Transformers (BERT) is a new model
published by Google [6], which has an outstanding performance on natural language
processing. BERT is intended to pre-train deep bidirectional representations from
unlabelled text by conditioning on both left and right contexts simultaneously across all
layers. Therefore, the pre-trained BERT model may be finetuned with just one extra
output layer to generate cutting-edge models for a wide range of tasks, such as question
answering and language inference, without requiring significant task-specific
architectural changes.
There are two steps in BERT: pre-training and fine-tuning. BERT’s model architecture
is a multi-layer bidirectional Transformer encoder based on the original implementation
[5]. The authors of BERT state that in order for BERT to better handle diverse
downstream tasks, their input representation can unambiguously represent both single
or a pair of sentences in one token sequence. Therefore, in their work, a “sentence” can
be two sentences packed together. In their work, the initial token of each sequence is
always a special classification token ([CLS]). For classification problems, the
aggregated sequence representation is the final hidden state corresponding to this token.
Sentence pairs are packed into a sequence. They distinguish sentences in two ways.
First, they isolate them with a unique token ([SEP]). Second, for each token, they add
a learned embedding indicating whether it belongs to sentence A or sentence B, as the
figure 9.
Figure 8 Overall pre-training and fine-tuning procedures for BERT [6]
It is worth noting that BERT's with training process is trained on data without
punctuation, so when we apply BERT to our data, the data pre-processing also removes
all punctuation. Each downstream task has a separate fine-tune model with same pre-
trained parameters. We do not employ the ideas of the BERT pre-training phase in our
project; instead, we just borrow the BERT pre-trained model straight, thus we will not
go through them in depth here.
For fine-tuning part, they suggested to plug in the task-specific inputs and outputs into
BERT and fine-tune all the parameters end-to-end for each task. At input, sentences can
be a degenerate text-class pair in text categorisation. The [CLS] representation is sent
to an output layer for classification like entailment or sentiment analysis. It can be
proved that BERT has a significate improvement in dealing with NLP challenges by
displaying the outcomes of their application of BERT fine-tuning to distinct NLP
problems.
2.3 Models implementation and data pre-processing
2.3.1 Hyperparameter tuning using cross validation
Holdout method
The simplest and most straightforward way to validate is to split the dataset into two
parts, one for training and the other for validation. As we did in our machine learning
algorithms and deep learning methods, we split the training set into 2 parts and the
validation dataset is of size 0.8. We implemented it by using the train_test_split function
in scikit learn.
The disadvantage of the Holdout method is obvious. It is sensitive to the splitting of the
data and therefore different splits will lead to different optimal models.
K-fold cross-validation
The main idea of the K-fold cross-validation method is that it randomly splits the
sample data into K (generally equal) parts, with K-1 parts randomly selected each time
as the training set and the remaining 1 part as the test set. When this round is completed,
K-1 portions are again randomly selected as the training data. Finally, we select the best
model and parameters.
The biggest problem with K-fold is that the execution is very slow, as it is run k times
for each pair of potential hyperparameters and is difficult to parallelise. Furthermore, it
can be a disaster when the dataset is very large.
Figure 9 K-Fold
GridSearchCV and RandomizedSearchCV
In traditional machine learning algorithms, we used GridSearchCV to find the best
parameters. While we made use of RandomizedSearchCV in the deep learning part.
GridSearchCV can be split into two parts, grid search and cross-validation. Gird search
aims to find the parameter with the highest accuracy on the validation set from all
potential pairs of parameters. cv, a parameter in the GridSearchCV function, is used to
set the cross-validation parameter and its default value is 5. When dealing with large
datasets or dozens of parameters, the computational cost is very high and may face
dimensional disasters.
Unlike GridSearchCV, RandomizedSearchCV chooses fixed number of values from a
specified distribution rather than attempting all possible values. It works on the
principle that if the random sample is large enough, it will get the global extremes or
close values. Its results may be slightly inaccurate, but still acceptable. Furthermore,
the computation cost is smaller than GridSearchCV.
2.3.2 BERT and data pre-processing
We mainly use two ways to pre-process the data. Firstly, we use LabelEncoder( ) to
transform “discourse_type” to N int labels and treat it as a new feature when we train
the model. Secondly, we extract the content from the text document and concatenate
the whole essay and current sentence as a new feature. Because Bert can focus on the
context from the left part and right part not only on the content before the current status.
We implement this project based on pytorch_lightning and a pre-trained BERT network.
First, use pandas to read in the training set and test set, and then use the get_essay
function to splice sentences with the same essay_id. The significance of this is that we
hope to observe the dataset from a more global perspective, not just a single sentence,
but the entire essay. In this way, the network may learn better with context.
Additionally, we use LabelEncoder to digitize the labels. The reason for this is not that
the computer cannot learn non-digital labels, but that non-digital labels have a higher
information redundancy for the computer, which is an unnecessary burden on training.
We design an EssayDataset class to encapsulate the preprocessed dataset and tokenize
the paragraph content using tokenizer.encode_plus in __getitem__ (it is based on a pre-
trained BERT network automatic tokenizer). Although we could design the model to
read the data letter by letter, obviously that training would be very inefficient. We pre-
tell those letters as a whole through tokenizer, which is equivalent to encapsulating the
data into a higher dimension for training, rather than starting from everything unknown.
As mentioned in the previous section when we introduced the network model, some
special tokens are added to the BERT model to help better read features such as [CLS]
and [SEP]. In our data encapsulation, we go a step further by making "dense_feature"
= discourse_type. The reason for this is that our discourse_type is a higher-level feature
that should logically parallel the entire text feature.
After that, we build our own FeedBackModel. First, the data is input into a pre-trained
BERT network, and then dropout is performed. After that, the concat function is used
to merge the dense_feature into the current feature layer, and then the feature layer is
mapped to the target label as the output.
For class Classifier, it is a packaged classifier that inherits from
pytorch_lightning.LightningModule. We only need to fill in the FeedBackModel we
just designed and the training-related hyperparameters to use.
2.3.3 LSTM and data pre-processing
LSTM has a long-term memory function, it is simple to implement and has a good
effect on natural language items that need to memorize context. We use the LSTM
model in keras in this project.
Before using the model, we imported the word vector of glove.6b.100d, which has 40w
words and a dimension of 100 dimensions. When processing the data, the students'
sentence types and corresponding texts determined the components of their arguments,
according to the data analysis. This is because the word vectors correspond to words.
Therefore, we first clean the text data and clean the text with punctuation, numbers,
different parts of speech and capitalization. Then we concatenate the text and text types
as features. Concatenate this feature (text + text type as the training set). First use
LabelEndoer() to encode argumentative components (discourse_effectiveness) to 0, 1,
2. Then set it to the corresponding y label. Use the tokenizer to tokenize the text of the
training set and convert it to the corresponding index.
According to the analysis, the average length of each text is about 500, and the
maximum length maxlen is set to 512. The pad_sequences in keras is introduced, and
the text with a length of less than 512 is filled with 0. Sentences exceeding 512 are
truncated. Then obtain the index corresponding to the word through the word_index in
the tokenizer. The size of the dictionary is len(word_index) +1, and 1 is added because
0 is to be filled. Read the glove.6b.100d file, and find the recognized word in this word
vector, construct a new embedding matrix.
Next, start building the model and introduce some important parameters to the model.
First, load the model according to Sequential() in keras, and add embedding layers to
the model. The first thing passed in is the dictionary size and dictionary dimension. In
the current project, the size of the dictionary is 27746, and the number of words in the
training set corresponding to the identified features is 27746 words. The input size is
512, which is the sentence length of each input. Add a layer of dropout after the
embedding layer to prevent overfitting. Then a double-layer LSTM is used. The
dimension of the first layer of LSTM is selected as 128, and the dimension of the second
layer of LSTM is selected as 64. After it is connected to a layer of dense layers, the
dimension of the fully connected layer is 256 and the activation is Relu. Then the next
layer Go through Dropout again to prevent overfitting. Finally, three argumentative
components are output, the output layer is of dimension 3, and the activation is softmax.
Since the model needs to give the probability corresponding to the three values, the loss
here uses sparse_categorical_crossentropy. Using the Adam optimizer, the indicator for
evaluating the quality of the model is the accuracy rate. We use 30% of the data in the
training set as the validation set to evaluate the quality of the model. The above is the
implementation process of LSTM.
2.3.4 GRU and data pre-processing
After implementing the previous LSTM model, we are ready to implement a GRU
model with similar performance to the LSTM model. The GRU model is a variant of
the LSTM model, and in some natural language projects, the GRU model will even
perform better. Because it only uses the update gate and reset gate to judge the
importance of the text. It will run faster than LSTM on large amounts of data. We used
the same hyperparameters as LSTM to compare the performance of LSTM and GRU
models. Because it is a variant of LSTM, it should also have a good effect on this project.
We use the GRU model in keras in this project. Before using the model, we imported
the word vector of glove.6b.100d, there are 40w words in the word vector, and the
dimension is 100-dimensional. The preprocessing of the data is the same as the
preprocessing in LSTM. We first clean the text data and clean the text with punctuation,
numbers, different parts of speech and capitalization. Then we concatenate the text and
text types as features. Concatenate this feature (text + text type as the training set). First
use LabelEndoer() to encode argumentative components (discourse_effectiveness) to 0,
1, 2. Then set it to the corresponding y label. The process of converting words in a text
into word vectors is described in detail in LSTM, and the implementation steps here are
consistent with LSTM. The only thing we changed was the reference to the two-layer
GRU model, and the other hyperparameters also referenced the same hyperparameters
as LSTM. Finally, the corresponding results are obtained.
2.4 Model result and comparison
According to the analysis of three deep models, we use the word vectors of
glove.6b.100d in LSTM and GRU. Because we import the external word vector, since
the external word vector is not trained for the current project's paper set, its internal
weights may be different. Because our dataset is very small, and we just made a simple
splicing of text and text type, but we know that text type plays an important role in
determining the argumentative components of the paper, but it may not be possible
through ordinary splicing A good way to show this. This also leads to the poor
performance of LSTM and GRU in current projects.
From the perspective of model structure, Bert is ahead of LSTM (GRU) in that BERT
adopts the stacking structure of transformers and introduces its self-attention
mechanism. The attention mechanism maps the temporal structure of LSTM to the one-
dimensional space, the amount of calculation is greatly reduced, and Bert will be much
faster when the computing power is the same. Second, BERT applies the attention
mechanism after time t instead of just before time t, which means that BERT can focus
on the overall information of the text. So in this competition, we treat the whole article
as a new feature and train it together with the current sentence and sentence category to
achieve a better effect. From this point of view, the data volume of our BERT project
is much larger than that of LSTM (GRU), but the difference in the final time can be
controlled within two hours, which also shows the advantage of BERT's fast speed
mentioned above.
The figure 11, 12 and 13 in appendix are about the performance of BERT, LSTM and
GRU. As can be seen from the first picture, the accuracy rate of BERT reaches the
highest 73%. The accuracy of LSTM and GRU is around 67%.
3. Conclusions, learning and future work
In this project, we tested traditional machine learning algorithms and deep learning
algorithms and used cross-validation to select the optimal hyperparameters. These are
some of the works we have done in the Kaggle project Feedback Prize - Predicting
Effective Arguments. After comparison, the traditional machine learning model
(logistic regression) performed the best among the five machine learning models, with
an accuracy rate of 66%. In the application of deep learning models, the performance
of GRU and LSTM is comparable, with an accuracy rate of about 67%. In the BERT
model with the introduction of the attention mechanism, the performance is even better,
and the accuracy rate is about 73%. So we finally chose the BERT model as our final
model. Our rank score in kaggle is 0.839.
In the case of introducing external word vectors, if the word vector of the current item
is not trained, it may lead to poor performance of LSTM and GRU. BERT may have
better performance if it extracts better features.
The dataset is not big enough so the accuracy is not good as we expected. So, for future
work, we may train the model with a bigger dataset. Due to time limitations, we are not
allowed to try more parameters so we may try various hyperparameters to improve the
model in the future. We can look for more datasets in the future to train targeted word
vectors, which will improve the performance of LSTM and GRU models. We have not
tried more complex model structures, and we will do so in the future as circumstances
allow. During data pre-processing, we know that text type plays a big role in
determining argumentative components. We noticed that in the data, some of the same
texts have different performances in different types of papers, so the features may also
be related to the id of the papers. And simply splicing type and text does not increase
the relative weight of type in the model. It may be forgotten or filtered out by the model,
but it is a very important feature in the process of judging argumentative components.
In the future we will consider some of the above considerations to improve the accuracy
of our model.
Reference:
[1]. Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural computation,
1997, 9(8): 1735-1780.
[2]. https://medium.datadriveninvestor.com/how-do-lstm-networks-solve-the-
problem-of-vanishing-gradients-a6784971a577
[3]. https://towardsdatascience.com/the-fall-of-rnn-lstm-2d1594c74ce0
[4]. Cho K, Van Merriënboer B, Gulcehre C, et al. Learning phrase representations
using RNN encoder-decoder for statistical machine translation[J]. arXiv preprint
arXiv:1406.1078, 2014.
[5]. Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J]. Advances in
neural information processing systems, 2017, 30.
[6]. Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of deep bidirectional
transformers for language understanding[J]. arXiv preprint arXiv:1810.04805, 2018.
Appendix
Figure 10 The tested values of hyperparameters with their corresponding best accuracy in all machine
learning algorithms. The values in red are the best parameters.
Figure 11 Result of BERT. The top left is the accuracy of the validation set, the top right is the loss of the
validation set, the bottom left is the accuracy of the training set, and the bottom right is the loss of the
training set.
Figure 12 Result of LSTM. The top left is the accuracy of the validation set, the top right is the loss of the
validation set, the bottom left is the accuracy of the training set, and the bottom right is the loss of the
training set.
Figure 13 Result of GRU. The top left is the accuracy of the validation set, the top right is the loss of the
validation set, the bottom left is the accuracy of the training set, and the bottom right is the loss of the
training set.