Python代写-MATH5836|学霸联盟

Python代写-MATH5836

时间：2021-11-29

Sample - Assessment 4 (Final Exam)
Introduction
MATH5836 Final Examination - Practice (3 Hours)
Instructions
There are three parts to this examination:
Part A (Quiz) : 10 marks
Part B (Short answers): 24 marks
Part C (Optional Questions): 16 marks
All answers must be submitted online using the provided instructions in the respective
questions.
Answer all the questions in Part A and B, and one question in Part C
Questions may not be worth equal marks.
Questions may be answered in any order.
Ensure you submit all answers.
R users are free to use their laptops in case Ed has any missing libraries.
Note that since this is an open book exam, plagiarism rules will apply if you refer to online
sources without proper referencing.
Note that you need to get 40% minimum in the �nal exam to pass the course; i.e you will
need to get at least 20/50 marks to pass the course.
Part A: Online quiz (10 marks)
Question 1 Submitted Nov 23rd 2021 at 9:14:28 pm
Question 2
You need to answer 10 multiple choice questions in this section. Each question is worth 1 mark.
All the questions are compulsory.
Which activation function in the output layer of a neural network would be most suited for a
multiclass classi�cation problem?
Softmax
ReLu
Hyberbolic tangent
Linear
None of the above
Which of the following statements is correct?
Adam is generally faster than SGD.
Achieving excellent training performance on the training dataset implies that you have an
excellent model.
It is best to randomly assign the number of hidden neurons irrespective of the dataset.
Keras employs scikit-learn in its core framework.
None of the above
Question 3
Question 4
Question 5
What would be a major di�erence between the role of a data scientist and a data engineer?
They do not have any di�erences in roles at major companies.
Data scientists typically use machine learning models to develop solutions and compile
reports while data engineers work with databases/datasets to organise, process and
visualise data.
Data engineers are database managers and data scientists are programmers.
They both do similar work, but data scientists present mostly while data engineers develop
models.
None of the above.
What would be the best model for highly non-linear and chaotic time series prediction problem?
Linear regression model
Logistic regression model
Neural network model with sigmoid activation function in output layer
Neural network model with linear activation function in output layer
All of the above
Given ROC and AUC (0.7) in the below �gure, which of the following statements is true?
Question 6
AUC of 0.7 implies that 70 % of the chance that the model will distinguish between positive
and negative classes.
The AUC shows that it is perfectly able to distinguish between positive and negative classes.
The AUC shows that model is predicting negative class as a positive class and vice versa.
None of the above
Assume you trained a decision tree model that o�ers very low training error, but very large test error.
Select a most likely cause for this problem?
Decision tree is over�tting.
Decision tree is under�tting.
There is too little training data.
There is too little test data.
Question 7
Question 8
Which one of the following statements is true?
In bagging, models are trained sequentially, and the aim is to reduce erros in every
subsequent steps.
In boosting, models are trained in parallel independent of each other and the outcomes are
combined.
In stacking, models are trained in parallel independent of each other and the outcomes are
combined.
In bagging, models are trained in parallel independent of each other and the outcomes are
combined.
None of the above.
Suppose you want to cluster the following data set into two clusters. Which one of the following
algorithm is the most suitable for your task?
K-Means Algorithm
DBSCAN Algorithm
Agglomerative Clustering Algorithm
Random Forest Algorithm
Question 9
Question 10
Which one of the following sentences is correct?
Model-based collaborative �ltering uses descriptions of items for recommendations, and is
similar to Amazon-style recommender systems.
Collaborative �ltering works well even with a very limited past recommendations.
Memory-based collaborative �ltering uses descriptions of items for recommendations, and
is similar to Amazon-style recommender systems.
Model-based collaborative �ltering uses well-understood techniques from information
retrieval.
None of the above.
Which one of the following statements is not true about Principal Component Analysis (PCA)?
PCA is an unsupervised method.
PCA searches for the directions that data have the smallest variance.
Maximum number of principal components <= number of features.
All principal components are orthogonal to each other.
Part B: Short answer questions (24 marks)
Please provide brief answers to the eight questions (next six slides) in this section and submit them.
All the questions are compulsory.
Part B: Q1 (3 marks)
If a Decision Tree is over�tting the training set, is it a good idea to try decreasing max_depth ? Brie�y
explain your answer.
Type your response in the Challenge workspace (in the �le answer.txt) and then click on the Submit
button at the bottom right of screen.
Part B: Q2 (3 marks)
Brie�y explain the most important di�erence between the AdaBoot and the Gradient Boosting
methods.
Type your response in the Challenge workspace (in the �le answer.txt) and then click on the Submit
button at the bottom right of screen.
Part B: Q3 (3 marks)
Brie�y describe two techniques to select the right number of clusters when using K-Means. (1.5
marks)
In the content for the course, what methods would be most suitable for class imbalance problems
and why? (1.5 Marks)
Type your response in the Challenge workspace (in the �le answer.txt) and then click on the Submit
button at the bottom right of screen.
Part B: Q4 (3 marks)
In multi-layer perceptron, does increasing the number of hidden layers improve performance? (1
Mark)
Explain your answer with reference to any dataset example from lessons or assignment. (2 Marks)
Type your response in the Challenge workspace (in the �le answer.txt) and then click on the Submit
button at the bottom right of screen.
Part B: Q5 (3 marks)
Explain the key di�erences of Neural collaborative �ltering with Matrix Factorisation for
recommender systems. (2 Marks)
Which would be better for Youtube and why? (1 Mark)
Type your response in the Challenge workspace (in the �le answer.txt) and then click on the Submit
button at the bottom right of screen.
Part B: Q6 (3 marks)
What are the major similarities and di�erences between Adam and AdaGrad? (1 Mark)
If given a regression or classi�cation problem, which one would perform better and why? (1 Mark)
How would you evaluate them? (1 Mark)
Type your response in the Challenge workspace (in the �le answer.txt) and then click on the Submit
button at the bottom right of screen.
Part B Q7 (3 marks)
Explain what advantages Langevin MCMC has over random-walk MCMC method? (1.5 Marks)
Why does the Q-ratio in Langevin MCMC not cancel out and what is done to ensure that detailed
balance condition holds? (1.5 Marks)
Type your response in the Challenge workspace (in the �le answer.txt) and then click on the Submit
button at the bottom right of screen.
Part B Q8 (3 marks)
Explain the di�erence between SVD and Matrix Factorization for recommender systems. Why SVD is
not used?
Type your response in the Challenge workspace (in the �le answer.txt) and then click on the Submit
button at the bottom right of screen.
Part C: Practical Question (16 marks)
For Part C questions, you have option to do one of the three questions.
Part C: Option 1 (16 marks)
Ensure you do Part C: 1A, 1B and 1C
Part C: Option 1A (10 marks)
For the following tasks, you need to write a python code in the �le answer.py and submit your
solution.
Note: dataset here is quite large and may not run on Ed. In your actual exam, you will be given
dataset with less than 50 features and less than 2000 instances, that will easily run in Ed.
You can use your laptop during the exam. In case if code does not run on Ed, you can run in your laptop and
upload with screenshot and outputs. No marks will be deducted.
Task-1:
Train a Random Forest classi�er on the data set loaded in the given �le answer.py and then evaluate
the resulting model on the test set.
You need to explore the possible alternative approaches before selecting the most appropriate model
for the data set.
In your comments, provide brief justi�cations, with clearly articulated reasons, for the alternatives
you explored to build the model you submitted.
Your best model must be saved in the variable named best_model_task1 .
Task-2:
Next, use PCA to reduce the data set’s dimensionality, with an explained variance ratio of 95%. Train
a new Random Forest classi�er on the reduced data set and evaluate the classi�er on the test set.
You need to explore the possible alternative approaches before selecting the most appropriate
model for the data set.
In your comments, provide brief justi�cations, with clearly articulated reasons, for the alternatives
you explored to build the model you submitted.
Your best model must be saved in the variable named best_model_task2 .
Task-3:
Compare the performance of the above two models, and brie�y explain the di�erence in your
comments.
How to submit
Type your solution (python code and comments) in the Challenge workspace (in the �le answer.py )
and then click on the Submit button at the bottom right of screen.
Part C: Option 1B (6 marks)
For the following tasks, you need to write a python code in the �le answer.py and submit your
solution.
The starter code in the �le answer.py loads the dataset, and then splits it into a training set, a
validation set, and a test set.
Your task is to cluster the dataset using K-Means. You need to use silhouette scores to select a
suitable number of clusters and store that value in the variable named best_k and the
corresponding model should be stored in the variable named best_model .
In your comments, provide brief justi�cations, with clearly articulated reasons, for the alternatives
you explored to build the model you submitted.
How to submit
Type your solution (python code and comments) in the Challenge workspace (in the �le answer.py )
and then click on the Submit button at the bottom right of screen.
Part C: Option 2 (16 marks)
Part C: Option 2A ( 6 marks)
Data processing for machine learning: Given the attached dataset, use either R or Python with the
needed libraries to process the data. 1. normalise the input features between 0 and 1 and use a one-
hot-encoding for the target classes.
Type your response in the Challenge workspace (in the �le process.r or process.py) and then click on
the Submit button at the bottom right of screen.
Part C: Option 2B (10 marks)
Machine learning using neural networks (Note that the code and examples from Week 3 and
Week 4 would be available for you to tackle this problem)
You can use either R or Python with needed libraries for this task.
Using the processed data from the previous step, use the neural network with 60/40 percent train
and test split.
1. Using Adam/SGD, report performance on the training and test dataset (percentage correctly
classi�ed and RMSE)
2. ROC and AUC graph
3. Write a paragraph to interpret your results.
Use either (in the �le nnmodel.r or nnmodel.py) with answer.txt and then click on the Submit button
at the bottom right of screen.
Part C: Option 3 (16 marks)
Part C: Option 3A (6 marks)
Programming question
For the following tasks, you need to write working Python (or R) code, along with the required
comments, in the �le answer.py (or answer.r ) and submit your solution.
R users are free to use their laptops in case Ed has any missing libraries. You must upload
your outputs in comments.txt and explain.
Analyse the given dataset and report any important features you discover in it. Use Adaboost with a
60/40 train test split (�rst 60 percent for training)
In your comments, interpret your results. You can use additional classi�ers if you wish to provide a
comparison.
Dataset: https://archive.ics.uci.edu/ml/datasets/Early+stage+diabetes+risk+prediction+dataset.
Part C: Option 3B (5 marks)
Note that since this is an open book exam, plagiarism rules will apply if you refer to online
sources and copy, without proper referencing.
Machine learning project scoping
Suppose that Australian scientists have discovered a system that can determine if someone has an
infectious disease by 3D scan of the face. Scientists need a system that automatically processes the
scan and then a machine learning algorithm decides if its a positive case.
1. Highlight the major components you would have in the system. (1 marks)
2. Discuss the key components that will require machine learning and what machine learning
methods will you use (1.5 marks)
3. Discuss how you will create or use existing data for training set (1 mark)
4. Discuss how you will test your machine learning component (1 mark)
5. How would you extend it further (1 mark)
Note that since this is an open book exam, plagiarism rules will apply if you refer to online
sources and copy, without proper referencing.
Part C: Option 3C (5 marks)
Ethics in AI and data science
Suppose that the government rolls an app that can be installed in computers to assist video
surveillance using AI to monitor people with past COVID-19 infections. The software will be used in
shops and public places.
1. What privacy and ethical issues need to be considered during implementation? (2.5 Marks)
2. List the advantages and disadvantages of such a system (2.5 Marks)
Note that since this is an open book exam, plagiarism rules will apply if you refer to online
sources and copy, without proper referencing.