2021/4/17 Assignment 3 | COMP9321 21T1 | WebCMS3
https://webcms3.cse.unsw.edu.au/COMP9321/21T1/resources/59350 1/8
Resources / Assignment 3
Assignment 3
Introduction
In this assignment you will be using the Movie dataset provided and the machine learning algorithm you have
learned in this course in order to find out: knowing only things you could know before a film was released ,
what the rating and revenue of the film would be. the rationale here is that your client is a movie theater that
would like to decide how long should they reserve the movie theater for to show a movie when it is released.
Datasets
In this assignment, you will be given two datasets training.csv (https://github.com/mysilver/COMP9321-Data-
Services/raw/master/20t1/assign3/training.csv) and validation.csv (https://github.com/mysilver/COMP9321-
Data-Services/raw/master/20t1/assign3/validation.csv) .
You can use the training dataset (but not validation) for training machine learning models, and you can use
validation dataset to evaluate your solutions and avoid over-fitting.
Please Note:
This assignment specification is deliberately left open to encourage students to submit innovative
solutions.
You can only use Scikit-learn to train your machine learning algorithm
Your model will be evaluated against a third dataset (available for tutors, but not for students)
You must submit your code and a report
The due date is 21/04/2021 18:00
Part-I: Regression (10 Marks)
In the first part of the assignment, you are asked to predict the "revenue" of movies based on the information in
the provided dataset. More specifically, you need to predict the revenue of a movie based on a subset (or all)
of the following attributes (**make sure you DO NOT use rating** ):
cast,crew,budget,genres,homepage,keywords,original_language,original_title,overview,production_companies,
production_countries,release_date,runtime,spoken_languages,status,tagline
Part-II: Classification (10 Marks)
Using the same datasets, you must predict the rating of a movie based on a subset (or all) of the following
attributes (**make sure you DO NOT use revenue** ):
cast,crew,budget,genres,homepage,keywords,original_language,original_title,overview,production_companies,
production_countries,release_date,runtime,spoken_languages,status,tagline
Specification Make Submission Check Submission Collect Submission
2021/4/17 Assignment 3 | COMP9321 21T1 | WebCMS3
https://webcms3.cse.unsw.edu.au/COMP9321/21T1/resources/59350 2/8
Submission
You must submit two files:
A python script z{id}.py
A report named z{id}.pdf
Python Script and Expected Output files
Your code must be executed in CSE machines using the following command with three arguments:
$ python3 z{id}.py path1 path2
path1 : indicates the path for the dataset which should be used for training the model (e.g.,
~/training.csv)
path2 : indicates the path for the dataset which should be used for reporting the performance of the
trained model (e.g., ~/validation.csv); we may use different datasets for evaluation
For example, the following command will train your models for the first part of the assignment and use the
validation dataset to report the performance:
$ python3 YOUR_ZID.py training.csv validation.csv
Your program should create 4 files on the same directory as the script:
z{id}.PART1.summary.csv
z{id}.PART1.output.csv
z{id}.PART2.summary.csv
z{id}.PART2.output.csv
For the first part of the assignment:
" z{id}.PART1.summary.csv " contains the evaluation metrics (MSE, correlation) for the model trained in the
first part of the assignment. Use the given validation dataset to compute the metrics. The file should be
formatted exactly as follow:
zid,MSE,correlation
YOUR_ZID,6.13,0.73
MSE : the mean_squared_error in the regression problem
correlation : The Pearson correlation coefficient in the regression problem (a floating number
between -1 and 1)
" z{id}.PART1.output.csv " stores the predicted revenues for all of the movies in the evaluation dataset (not the
training dataset), and the file should be formatted exactly as:
movie_id,predicted_revenue
1,7655555
2,75875765
...
For the second part of the assignment:
" z{id}.PART2.summary.csv " contains the evaluation metrics (average_precision, average_recall, accuracy -
the unweighted mean ) for the model trained in the second part of the assignment. Use the given validation
dataset to compute the metrics. The file should be formatted exactly as:
2021/4/17 Assignment 3 | COMP9321 21T1 | WebCMS3
https://webcms3.cse.unsw.edu.au/COMP9321/21T1/resources/59350 3/8
zid,average_precision,average_recall,accuracy
YOUR_ZID,0.69.71,0.89
average_precision : the average precision for all classes in the classification problem (a number
between 0 and 1)
average_recall : the average recall for all classes in the classification problem (a number between 0
and 1)
" z{id}.PART2.output.csv " stores the predicted ratings for all of the movies in the evaluation dataset (not the
training dataset) and it should be formatted exactly as follow:
movie_id,predicted_rating
1,1
2,4
...
Marking Criteria
For EACH of the parts, you will be marked based on:
(3 marks) Your code must run and perform the designated tasks on CSE machines without problems
and create the expected files.
(3 marks) How well your model (trained on the training dataset) performs in the test dataset
(2 marks) You must correctly calculate the evaluation metrics (e.g., average_precision - 2 decimal
places ) in the output files (e.g., z{id}.PART2.summary.csv)
(2 marks) One page report containing:
Performance of your model on the validation dataset and how you evaluated the performance and
improved it (e.g., relying on feature selection, switching from one machine learning model to a
more suitable one,...etc.)
Problems you have faced in predicting (e.g., JSON formatted columns, keywords, missing data)
and how you tried to solve the problems.
The minimum Pearson correlation coefficient value in the regression model is 0.3 in the test dataset (not
validation). As listed above, you will be marked on different aspects (e.g., report); and your submission
will be compared to the rest of the students to adjust marks and be fair to all. Do your best in improving
your models and make sure you do not overfit because you will be marked based on a third dataset,
called "test dataset". In the classification problem, your accuracy should be more than a baseline. The
baseline model labels all movies with the most frequent class (e.g., assuming all movie rates are 3).
You will be penalized if your models take more than 3 minutes to train and generate output.
Your assignment will not be marked (zero marks) if any of the following occur:
If it generates hard-coded predictions
If it also uses the second dataset (test/validation) to train the model
If it does not run on CSE machines with the given command (e.g., python3 zid.py
training_dataset.csv test_dataset.csv)
Do NOT hard-code the dataset names
FAQ
Can we define our own feature set?
Yes, you can define any features; make sure your features do not rely on the validation (or test) datasets
2021/4/17 Assignment 3 | COMP9321 21T1 | WebCMS3
https://webcms3.cse.unsw.edu.au/COMP9321/21T1/resources/59350 4/8
Resource created 20 days ago (Sunday 28 March 2021, 07:53:59 AM), last modified 8 days ago (Friday 09 April 2021, 08:38:22
AM).
What is the difference between validation and test datasets?
The validation dataset is provided for you to tune your models; the test dataset will not be provided to
students, instead, it will be used to evaluate your model.
For the average precision/recall functions, should we use the unweighted ('macro') mean or the
weighted mean?
use the unweighted ('macro') mean
Should we calculate metrics to 1 Decimal Place?
2 Decimal Places
Can we use any machine learning algorithm?
Yes, as long as it is provided in sklearn.
What python modules can we use for developing our solutions?
You can use any modules presented in the lab activities; if it is a one that not in the labs, you may get
permission by asking ...
How should we calculate the Pearson correlation coefficient?
It is calculated between your predictions and the real values for the validation (or test) dataset.
Plagiarism
This is an individual assignment . The work you submit must be your own work. Submission of work partially or
completely derived from any other person or jointly written with any other person is not permitted. The
penalties for such offense may include negative marks, automatic failure of the course, and possibly other
academic disciplines. Assignment submissions will be checked using plagiarism detection tools for both code
and the report and then the submission will be examined manually.
Do not provide or show your assignment work to any other person - apart from the teaching staff of this course.
If you knowingly provide or show your assignment work to another person for any reason, and work derived
from it is submitted, you may be penalized, even if the work was submitted without your knowledge or consent.
Pay attention to that is also your duty to protect your code artifacts . if you are using an online solution to
store your code artifacts (e.g., GitHub) then make sure to keep the repository private and do not share access
to anyone.
Reminder: Plagiarism is defined as (https://student.unsw.edu.au/plagiarism) using the words or ideas of others
and presenting them as your own. UNSW and CSE treat plagiarism as academic misconduct, which means
that it carries penalties as severe as being excluded from further study at UNSW. There are several online
sources to help you understand what plagiarism is and how it is dealt with at UNSW:
Plagiarism and Academic Integrity (https://student.unsw.edu.au/plagiarism)
UNSW Plagiarism Procedure (https://www.gs.unsw.edu.au/policy/documents/plagiarismprocedure.pdf)
Make sure that you read and understand this. Ignorance is not accepted as an excuse for plagiarism. In
particular, you are also responsible for ensuring that your assignment files are not accessible by anyone but
you by setting the correct permissions in your CSE directory and code repository, if using one (e.g., Github and
similar). Note also that plagiarism includes paying or asking another person to do a piece of work for you and
then submitting it as your own work.
UNSW has an ongoing commitment to fostering a culture of learning informed by academic integrity. All UNSW
staff and students have a responsibility to adhere to this principle of academic integrity. Plagiarism undermines
academic integrity and is not tolerated at UNSW.
Comments
(/COMP9321/21T1/forums/search?forum_choice=resource/59350)
2021/4/17 Assignment 3 | COMP9321 21T1 | WebCMS3
https://webcms3.cse.unsw.edu.au/COMP9321/21T1/resources/59350 5/8
(/COMP9321/21T1/forums/resource/59350)
Add a comment
Chengbin Zhang (/users/z5252388) about 4 hours ago (Sat Apr 17 2021 03:02:59 GMT+0800 (中国标
准时间))
Hi,
Just wondering can we assume that the data you used to run our code is similar to the data
you give. Like, for example, in training.csv only homepage & tagline contains empty values,
so can we assume that in the data you use, only homepage and tagline contains null values
and other attributes are clean ( no empty values or wrong values, like 2090-12-21)
Thanks
Reply
Mohammadali Yaghoubzadehfard (/users/z5138589) 6 minutes from now (Sat Apr 17 2021
07:13:30 GMT+0800 (中国标准时间))
Hi,
Please consider a constant value for each column in these cases, to make sure your
code does not throw exceptions in any case.
Reply
Austin Vuong (/users/z5205456) about 10 hours ago (Fri Apr 16 2021 21:37:03 GMT+0800 (中国标准
时间))
How should we handle nan values?
Reply
Mohammadali Yaghoubzadehfard (/users/z5138589) 7 minutes from now (Sat Apr 17 2021
07:14:43 GMT+0800 (中国标准时间))
It is up to you
Reply
Stefan Yin (/users/z5230358) about 10 hours ago (Fri Apr 16 2021 21:02:36 GMT+0800 (中国标准时
间))
Hi, just want to make sure that modules like
sklearn.ensemble.RandomForestRegressor (https://scikit-
learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html?
highlight=randomforestregressor#sklearn.ensemble.RandomForestRegressor)
sklearn.ensemble.GradientBoostingRegressor (https://scikit-
learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html?
highlight=gradientboostingregressor#sklearn.ensemble.GradientBoostingRegressor)
are allowed.
2021/4/17 Assignment 3 | COMP9321 21T1 | WebCMS3
https://webcms3.cse.unsw.edu.au/COMP9321/21T1/resources/59350 6/8
Reply
Mohammadali Yaghoubzadehfard (/users/z5138589) 7 minutes from now (Sat Apr 17 2021
07:14:59 GMT+0800 (中国标准时间))
Yes, they are fine
Reply
Austin Vuong (/users/z5205456) about 11 hours ago (Fri Apr 16 2021 19:43:55 GMT+0800 (中国标准
时间))
Please correct me if I'm wrong. If we are creating new features extracted from other columns
in the training dataset, can we do so in the validation set as well in the same way we did for
the training set? Since we have to put in the indpendent variables for the validation set when
we predict? This is for q1 regression.
Reply
Mohammadali Yaghoubzadehfard (/users/z5138589) 10 minutes from now (Sat Apr 17 2021
07:17:29 GMT+0800 (中国标准时间))
You cannot analyse the validation set and relay on it for selecting or tuning anything. You
should build your model, then apply any required preprocessing individually to each
sample of the validation dataset, and predict.
Reply
Kan-Lin Lu (/users/z3417618) about 12 hours ago (Fri Apr 16 2021 18:48:50 GMT+0800 (中国标准时
间)), last modified about 12 hours ago (Fri Apr 16 2021 19:07:59 GMT+0800 (中国标准时间))
Is it consider data leakage if we use Y value in selecting the features? or overfitting?
Reply
Kan-Lin Lu (/users/z3417618) about 13 hours ago (Fri Apr 16 2021 18:20:50 GMT+0800 (中国标准时
间))
Can we import category_encoders library?
Reply
Mohammadali Yaghoubzadehfard (/users/z5138589) about 13 hours ago (Fri Apr 16 2021
18:35:01 GMT+0800 (中国标准时间))
Yes
Reply
Ahmad El Majzoub (/users/z5292964) about 13 hours ago (Fri Apr 16 2021 17:49:37 GMT+0800 (中
国标准时间)), last modified about 13 hours ago (Fri Apr 16 2021 18:36:55 GMT+0800 (中国标准时间))
Hello,
Do we need to sort the output CSV by movie id or rating/revenues? or keep the order as it is.
Thanks
2021/4/17 Assignment 3 | COMP9321 21T1 | WebCMS3
https://webcms3.cse.unsw.edu.au/COMP9321/21T1/resources/59350 7/8
Reply
Mohammadali Yaghoubzadehfard (/users/z5138589) about 13 hours ago (Fri Apr 16 2021
18:36:25 GMT+0800 (中国标准时间))
Hi,
It is sorted by id. sklearn should give you what you need
Reply
Ahmad El Majzoub (/users/z5292964) about 13 hours ago (Fri Apr 16 2021 18:28:48 GMT+0800
(中国标准时间))
Also, what format do you want for the revenues? also 2 decimal place? In the specs, it
only mentions for the metrics
Reply
Mohammadali Yaghoubzadehfard (/users/z5138589) about 13 hours ago (Fri Apr 16 2021
18:37:16 GMT+0800 (中国标准时间))
It is up to you
Reply
Xiaolong Li (/users/z5155298) about 15 hours ago (Fri Apr 16 2021 15:53:49 GMT+0800 (中国标准时
间))
Hi, could we use scipy and ast library?
Reply
Mohammadali Yaghoubzadehfard (/users/z5138589) about 15 hours ago (Fri Apr 16 2021
16:05:37 GMT+0800 (中国标准时间))
Hi, No
Reply
Yukun Yin (/users/z5199930) about 14 hours ago (Fri Apr 16 2021 16:54:08 GMT+0800 (中国
标准时间))
How about using ast in ASS2?
how to mark?
Reply
Mengfei Wu (/users/z5268735) about 17 hours ago (Fri Apr 16 2021 14:34:13 GMT+0800 (中国标准时
间))
Hi, is the 3 min running time restriction only limited to running time of model algorithm or also
including data cleansing, preprocessing and data exploration?
Thx
Reply
2021/4/17 Assignment 3 | COMP9321 21T1 | WebCMS3
https://webcms3.cse.unsw.edu.au/COMP9321/21T1/resources/59350 8/8
Load More Comments
Mohammadali Yaghoubzadehfard (/users/z5138589) about 15 hours ago (Fri Apr 16 2021
16:05:53 GMT+0800 (中国标准时间))
Hi, All together
Reply
Aiden Lee (/users/z5291420) about 17 hours ago (Fri Apr 16 2021 14:24:02 GMT+0800 (中国标准时
间))
Hi, is one page rule strict?
It would be just 1.5 page or 2 pages as it should contain all below contents...
(2 marks) One page report containing:
Performance of your model on the validation dataset and how you evaluated the
performance and improved it (e.g., relying on feature selection, switching from one machine
learning model to a more suitable one,...etc.)
Problems you have faced in predicting (e.g., JSON formatted columns, keywords, missing
data) and how you tried to solve the problems.
Reply
Mohammadali Yaghoubzadehfard (/users/z5138589) about 15 hours ago (Fri Apr 16 2021
16:07:06 GMT+0800 (中国标准时间))
Just try to keep it brief, informative, useful, and to-the-point
Reply
学霸联盟