COMP30027-python代写|学霸联盟

COMP30027-python代写

时间：2023-05-21

COMP30027 Report – Book Rating
Predictions
Anonymous
1. Introduction
In the tremendously developed world now, the
platform on the realm of literature has
migrated from physical books to online
platform. These platforms have provided a
treasure trove of variety of books for
booklovers. From the reviews of thousands of
readers, we will be able to study and analyse
the important information such as book
ratings, descriptions, publishers etc.
In recent years, machine learning techniques
can be used to predict book rating which can
assist authors, publishers and marketers
identifying potential audience and tailoring
marketing strategies to maximise reader
engagement.
The aim of this report is to analyse different
features, such as titles of the books, the
authors, descriptions and other features and
build a supervised machine learning model to
predict the rating of books. The names of
authors, descriptions and titles will be
extracted for sentiment analysis. This project
will be divided into sections using correlations
attributes and sentiment analysis of ‘Text’
containing name of books, authors,
descriptions, publishers as well as the
language of the books, to attempt to predict
book rating with 3 different levels: 3, 4 or 5.
The report will try to train classifiers using
different techniques and analyse the results
with regards to the attributes.
2. Methodology
2.1 Data Pre-processing
Different features are given in the training and
testing csv files. Upon manual inspection, the
data consists of unwanted stop words, words in
different languages, non-words etc. in order to
enhance the performance of the classifier,
pre-processing methods were carried as shown
below.
2.1.1 Case-folding
Raw data that has been extracted contains
alphabetical features that are in both upper and
lower cases. In this step, all the characters that
are in upper-case are converted into
lower-case.
2.1.2 Removing punctuation and
numbers
After case-folding process, there are numerical
values and punctuations exists such as ‘
’
and ‘’. these non-ASCII characters,
symbols convey no values and meaning in the
data, thus they can be considered as less
valuable information.
2.1.3 Removing stop words
Common English stop words are removed as
these words do not convey and specific
meaning. By removing words that contain
low-level information, dataset size has been
reduced thus the training time required will be
eventually reduces as fewer number of tokens
are involved.
2.1.4 Missing values
Upon inspection of the raw data, there are
around 17202 data without specifying which
language is the book written, and there are
around 148 data that do not have a publisher
provided. By filling in the missing values with
‘Unknown’, it ensures that we have a complete
dataset to work with, which prevents
unnecessary errors and biases.
2.1.5 Lemmatizing
In this process, words are being restored into
their base form. With the help of
WordNetLemmatizer from NLTK package,
words in multiple tense or form are converted
into base form, which eventually reduces the
complexity of training process.
3. Feature Selection – Count
Vectorizer
After pre-processing the raw data, the
Count Vectorizer has obtained 23063
distinct variables in the training dataset.
The count vectorizer helps to converts the
text into vectors that represent word counts
and representing them as sparse matrixes.
However, in order to be more precise, the
text data has been split into two parts:
training and testing part using
train_test_split module from sklearn.
SelectKBest has been used as another
feature section function for inspection
purpose. It is observed that the top 20
words found in text files has somewhat
similarity. Majority of them are rather in
spiritual and abstract topics like ‘bible’,
‘god’, ‘poems’ and ‘spiritual’.
Figure 1 top2 words relating rating_label.
4. Training
4.1 Baseline model – 0R
0r is one of the most well-known and
simply baseline model for machine
learning. This algorithm returns the most
common class among all classes, it is
extremely simple to implement and makes
minimal assumptions. It sets a relatively
good baseline for another model to
compare with. 0R is implemented by using
DummyClassifer from the sklearn library.
4.2 Multinomial
Naïve Bayes classification such as multinomial
is a kind of Naïve Bayes algorithm which is
most suitable for text classification, and it is
commonly used for document classification and
sentiment analysis. The simplicity and speed of
this algorithm is considered fast training and
prediction. It computes with great efficiency.
Even though this algorithm has a problem with
assuming feature independence, it has relatively
good and solid performance. The accuracy of
which is relatively high as well.
4.3 Linear SVC
Support Vector Machine is a type of
supervised machine learning method
Sulistyono et al. (2021). The Linear SVC
model is used primarily for classifying
instances into one of the two classes based on
the input data.
5. Results
The heatmaps and classification reports for
different models are presented in Figure 1
to 7. It is observed that the MultinomialNB
has the most efficiency and suitable. It has
a relatively high accuracy and precision
across all rating. The baseline model on
the hand shows a high accuracy but it
tends to not perform the algorithm on the
books with rating 3 and 5, which makes
the classification report less useful.
Furthermore, the MultinomialNB and
LinearSVC tends to have the highest score
on books with rating 4, this will be
discussed in section later.
Figure 2 0R heatmap
Figure 3 Classification Report-0R
Figure 4. heatmap of multinomial
Figure 5. classification report for multi
Figure 6. validation curve for LinearSVC
Figure 7 LinearSVC heatmap
Figure 8 classification report LinearSVC
6. Critical Analysis
a) Pre-processing
From the pre-processing process, it is
observed that there are quite few data that
are in other languages other than English.
Packages such as langdetect and
googletrans has been used to detect which
exact language is the book description in,
and therefore translate it into English to
make the dataset more complete. However,
this method is not ideal enough to
transform the data into useful information.
Considering the quantity of those data, it
will not have a major impact even if we
neglect them.
b) Multinomial classifier Error
Analysis
The heatmap of the Naïve Bayes
MultinomialNB classifier in Figure 3
shows that there are several books that
there are a lot of books that are supposed
to be in rating 3 but they are predicted as 4.
Upon inspection, there are around 709
such books. Furthermore, due to the lack
of data from the rating label 3 and 5 in
comparison to most of the rating 4.
c) Parameter Tuning
Parameters contribute differently in the
LinearSVC model, as the max iteration has
been set to 40000. The default max
iteration is set to be 1000, which may
cause a warning ‘CovergenceWarning’
when the algorithm reaches the maximum
iteration without meeting the convergence
condition, thus adjusting the max iteration
to 40000 to allow the algorithm to have
the sufficient iteration to reach
convergence successfully without
warning.
Furthermore, a validation curve has been
drawn for the model as a graphical
representation of the performance as
shown in Figure 5. The test accuracy
reaches the highest when C in log scale is
between 10^4 to 10^3, and it has minor
change as C increases. However, there is a
drastic drop in the accuracy when C
reaches 1.
7. Conclusions
There are a lot more algorithms that we can
use in machine learning to predict book
rating with regards to the name, authors and
book descriptions. From all the observations
of the results from the data, the
MultinomialNB is the most suitable model in
this case as it is designed to handle tasks like
text classification and normally used to
classify documents and sentiment analysis.
The baseline model although having high
accuracy but only having the rating 4 books.
The LinearSVC model is working fine as a
classifier, but they are not as efficient in
solving classifications tasks.
8. References
Christopher M Bishop and Nasser M
Nasrabadi. 2006. Pattern recognition and
machine learning, volume 4. Springer.
SULISTYONO, A., MULYANI, S., YOSSY,
E. H. & KHALIDA, R. 2021. Sentiment
Analysis on Social Media (Twitter) about
Vaccine-19 Using Support Vector Machine
Algorithm. 2021 4th International Seminar on
Research of Information Technology and
Intelligent Systems (ISRITI), Research of
Information Technology and Intelligent
Systems (ISRITI), 2021 4th International
Seminar on. IEEE.
Word count: 1353