COMP9417-无代写
时间:2024-03-28
COMP9417 Project: Multitask Machine Learning (MML)
March 4, 2024
Project Description
As a Data Scientist at Predictive Solutions Inc., you have become comfortable working with any type of dataset that
comes your way. Your newest client, a medical researcher in an undisclosed branch at the local hospital, is interested
in utilizing machine learning to understand data obtained from a recent clinical trial they conducted. In this particular
dataset, there are n = 1000 observations and p = 111 features To ensure privacy of patient data, the features have been
anonymized (that is, the features are generically labelled X1, X2, . . . ,.) The features are a mix of binary, categorical
and continuous valued data that which contain information about each patient. In this problem, the outcome is
multivariate, which means that there are multiple target variables to predict as opposed to the usual case in which we
have a single target variable. Each target is a specific medical condition. This sort of problem is known as Multitask
Learning. The data will be released on March 25, 2022.
Description of the Data
The client has provided you with the following data sets in numpy.array format: X train, X test, Y train. You will
need to use best practices to come up with a model that generates predictions for X test which will be submitted for
evaluation and will count towards your final grade. The X variable is comprised of tabular data, and each feature is
of dimension 1000 × 111. The Y variable is comprised of tabular data of dimension 1000 × 11, so that there are 11
binary targets (tasks) that need to be predicted. The loss function used for this problem is the average binary cross
entropy loss, i.e. if Yij denotes the j-th target for the i-th observation, and Yˆij is the corresponding prediction from
your model, then the total loss is:
1
n
n∑
i=1
1
11
11∑
j=1
LXE(Yij , Yˆij)
,
where
LXE(Yij , Yˆij) = −Yij log(Yˆij)− (1− Yij) log(1− Yˆij)
is the usual binary cross entropy loss.
Important Aspects
The following problems should be considered and discussed in detail in your report:
• Data: Perform an extensive exploratory data analysis (EDA). This should include a pre-processing step in which
the data is cleaned. You should pay particular attention to the following questions:
1. Which features are most likely to be predictive of each target variable?
2. What, if any, are the relationship between target variables?
• Research: Provide a summary of the multi-task learning literature. Be sure to explain rigorously some of the
algorithms that are used. It is a good idea to pick one or two areas to explore further here. The report should
be well written and well referenced.
• Modelling: The approach to modelling is open ended and you should think carefully about the types of models
you wish to deploy. It is generally a bad idea to build a large number of generic models. Instead, you should
think carefully about the models you want to use and how best to build them. Regardless of the models you
choose, you need to:
1
1. Construct a model that performs well in terms of the loss function described earlier.
2. Compare your model to the naive approach to multi-task learning in which you would construct 11 separate
models.
• Discussion: provide a detailed discussion of the problem, your approach and your results. Explain whether your
final approach was better than the naive one, and why you think that might be the case. Discuss what you could
have improved on.
Overview of Guidelines
• The deadline to submit the report is 5pm April 22. The deadline to submit your predictions, your code (and
the documentation), and a 2-min presentation is 5pm April 19 for both the Internal Challenge project (MML)
and External Challenge (Berrijam project).
• Submission will be via the Moodle page
• You must complete this work in a group of 4-5, and this group must be declared on Moodle under Group Project
Member Selection
• The project will contribute 30% of your final grade for the course.
• Recall the guidance regarding plagiarism in the course introduction: this applies to all aspects of this project as
well, and if evidence of plagiarism is detected it may result in penalties ranging from loss of marks to suspension.
• Late submissions will incur a penalty of 5% per day from the maximum achievable grade. For example,
if you achieve a grade of 80/100 but you submitted 3 days late, then your final grade will be 80 − 3 × 5 = 65.
Submissions that are more than 5 days late will receive a mark of zero. The late penalty applies to all group
members.
Project Proposal
Each group must submit their project choice and a 1 page proposal by Friday, March 15th, 5 PM. The plan should
not exceed 1 page and should include the following:
1. Approach: Briefly describe your approach and techniques you want to explore
2. Owners and Collaborators: Nominate the team member who will work on each part of the project.
3. 4 week Plan: A list of weekly milestones leading to the final project deliverable.
Your actual project may deviate from the proposal. Changes from the original plan will NOT impact team scores.
The goal of the proposal is to help teams self-organize.
Objectives
In this project, your group will use what they have learned in COMP9417 to construct a predictive model for the
specific task described above as well as write a detailed report outlining your exploration of the data and approach to
modelling. The report is expected to be a maximum of 12 pages long (with a single column, 1.5 line spacing), and easy
to read. The body of the report should contain the main parts of the presentation, and any supplementary material
should be deferred to the appendix. For example, only include a plot if it is important to get your message across.
The guidelines for the report are as follows:
1. Title Page: tile of the project, name of the group and all group members (names and zIDs).
2. Introduction: a brief summary of the task, the main issues for the task and a short description of how you
approached these issues.
3. Exploratory Data Analysis and Literature review: this is a crucial aspect of this project and should be done
carefully given the lack of domain information. Some (potential) questions for consideration: are all features
relevant? How can we represent the data graphically in a way that is informative? What is the distribution of
the targets? What are the relationships between the features? What are the relationships between the targets?
How has this sort of task been approached in the literature? etc.
2
4. Methodology: A detailed explanation and justification of methods developed, method selection, feature selection,
hyper-parameter tuning, evaluation metrics, design choices, etc. State which method has been selected for the
final test and its hyper-parameters.
5. Results: Include the results achieved by the different models implemented in your work, with a focus on the f1
score. Be sure to explain how each of the models was trained, and how you chose your final model.
6. Discussion: Compare different models, their features and their performance. What insights have you gained?
7. Conclusion: Give a brief summary of the project and your findings, and what could be improved on if you had
more time.
8. References: list of all literature that you have used in your project if any. You are encouraged to go beyond the
scope of the course content for this project.
You must follow this outline, and each section should be standalone. This means for example that you should not
display results in your methodology section.
Project implementation
Each group must implement a model and generate predictions for the provided test set. You are free to select the types
of models, features and tune the methods for best performance as you see fit, but your approach must be outlined in
detail in the report. You may also make use of any machine learning algorithm, even if it has not been covered in the
course, as long as you provide an explanation of the algorithm in the report, and justify why it is appropriate for the
task. You can use any open-source libraries for the project, as long as they are cited in your work. You can use all the
provided features or a subset of features; however you are expected to give a justification for your choice. You may run
some exploratory analysis or some feature selection techniques to select your features. There is no restriction on how
you choose your features as long as you are able to justify it. In your justification of selecting methods, parameters
and features you may refer to published results of similar experiments.
Code submission
Code files should be submitted as a separate .zip file along with the report, which must be .pdf format. Penalties
will apply if you do not submit a pdf file (do not put the pdf file in the zip).
Peer review
Individual contribution to the project will be assessed through a peer-review process which will be announced later,
after the reports are submitted. This will be used to scale marks based on contribution. Anyone who does not complete
the peer review by the 5pm Friday of Week 11 (26 April) will be deemed to have not contributed to the assignment.
Peer review is a confidential process and group members are not allowed to disclose their review to their peers.
Project help
Consult Python package online documentation for using methods, metrics and scores. There are many other resources
on the Internet and in literature related to classification. When using these resources, please keep in mind the guidance
regarding plagiarism in the course introduction. General questions regarding group project should be posted in the
Group project forum in the course Moodle page.