COMP9417 -无代写|学霸联盟

COMP9417 -无代写

时间：2025-04-15

COMP9417 Project: Can I Speak to the manager? Machine Learning for
Customer Feedback Classification
April 7, 2025
Project Description
In modern manufacturing, efficiently managing customer feedback is essential for improving products and addressing
concerns. This project focuses on developing a machine learning model to automatically classify customer comments
related to 28 different products and direct them to the appropriate departments within the company. The dataset
consists of 10,000 training instances, with each comment represented by 300 features extracted using natural
language processing (NLP) techniques. These features capture key linguistic and contextual elements to enhance
classification accuracy. The goal is to build a robust multiclass classification model that ensures each comment
is assigned to the correct department, streamlining the feedback management process. As a data scientist, your role
is to develop a solution that enhances response efficiency and optimizes workflow within the company by accurately
classifying and directing customer feedback to the appropriate departments. The dataset will be made available at
5pm Monday 31st March.
Description of the Data
The dataset is provided in CSV format and consists of three subsets:
• Training set:
Dtrain = {(xi, yi) | i = 1, . . . , 10000}
• Test set 1:
Dtest1 = {xj | j = 1, . . . , 1000}
• Test set 2:
Dtest2 = {(xk, yk) | k = 1, . . . , 202} ∪ {xk | k = 1, . . . , 1818}
Note that Test set 2 only needs to be used for the last part of the project on ”Unexpected Model Performance”
(see below). In Test set 2, you are given access to 202 labeled points and 1818 unlabeled points. Each instance
xi ∈ R300 represents a customer comment transformed into a feature vector of dimension 300 using NLP techniques.
The corresponding label yi is assigned from a set of 28 distinct categories:
yi ∈ C = {c1, c2, . . . , c28}
where each category ck (k = 1, . . . , 28) corresponds to a department responsible for handling feedback on a specific
product. The task is to learn a function:
f : R300 → C
that maps each feature vector xi to the correct category yi, ensuring accurate classification of customer feedback.
The final model will be evaluated based on its classification accuracy in assigning unseen test instances xj to the
correct category in C.
1
Important Aspects
The following problems should be considered and discussed in detail in your report:
• Data: Perform exploratory data analysis (EDA). This should include a pre-processing step in which the data is
cleaned. You should pay particular attention to the following questions:
1. Which features are most likely to be predictive of each target class?
2. How does the class imbalance affect the learning process, and what methods can mitigate this issue?
3. What are the appropriate evaluation metrics for this task, and why do accuracy-based metrics fail in
imbalanced classification?
• Research: Provide a summary of state-of-the-art methods for handling imbalanced multi-class classification
tasks. Be sure to rigorously explain some of the algorithms that are used. It is a good idea to pick one or two
areas to explore further. The report should be well-written and well-referenced.
• Modeling: The approach to modeling is open-ended, and you should think carefully about the types of models
you wish to deploy. Instead of building a large number of generic models, focus on well-justified choices and
their impact. Regardless of the models you choose, you need to:
1. Construct a model that performs well in terms of classification metrics suitable for imbalanced data.
2. Compare different strategies for handling class imbalance within machine learning models.
3. Investigate ensemble techniques to improve performance over individual models.
4. Evaluate models not only on overall performance but also on per-class metrics to assess minority class
performance.
• Discussion: Provide a detailed discussion of the problem, your approach, and your results. Explain whether
your final approach was better than a simple baseline classifier and justify why. Discuss limitations and potential
future improvements.
• Unexpected Model Performance in New Test Data Deployment: After deploying a customer feedback
classification model, the manufacturing departments report a significant drop in classification accuracy for newly
received feedback messages. Some departments are mistakenly receiving messages unrelated to their operations,
leading to inefficiencies in workflow and customer service.
This issue suggests that the data distribution encountered during deployment differs from the one used during
training, a phenomenon known as distribution shift.
To better understand and address this problem, a new test dataset, Dtest2, has been created to evaluate model
performance under these changing conditions. Your mission includes the following tasks:
1. Diagnose the Problem: Investigate potential reasons for the observed performance drop. Do you think there
is a distribution shift issue occurring? If so, what type of distribution shift is it?
2. Discuss why traditional machine learning models trained under one distribution may fail when applied to
data from a different distribution. Explore Techniques for Handling the distribution shift in this particular
problem.
3. Read about and implement methods to detect and mitigate distribution shifts. Summarize key techniques
and explain how they could be applied in this scenario to improve model generalization. For background
reading, you may refer to this resource: Distribution shift and defenses. You are strongly encouraged to
look further into the literature on this topic as part of the project.
Overview of Guidelines
• The deadline to submit the report, code and presentation is 5pm April 28th.
• You must complete this work in a group of 4-5, and this group must be declared on Moodle under Group Project
Member Selection.
• Submission will be via the Moodle page. Only one student in the group needs to make a submission.
• The project will contribute 20% of your final grade for the course.
2
• Recall the guidance regarding plagiarism in the course introduction: this applies to all aspects of this project as
well, and if evidence of plagiarism is detected it may result in penalties ranging from loss of marks to suspension.
• Late submissions will incur a penalty of 5% per day from the maximum achievable grade. For example,
if you achieve a grade of 80/100 but you submitted 3 days late, then your final grade will be 80 − 3 × 5 = 65.
Submissions that are more than 5 days late will receive a mark of zero. The late penalty applies to all group
members.
• All group members must submit a peer-review survey by 5pm 2nd May. Failure to complete the survey will
result in a 10% penalty to that student.
Objectives
In this project, your group will use what they have learned in COMP9417 to construct a predictive model for the
specific task described above, as well as write a detailed report outlining your exploration of the data and approach
to modeling. The report is expected to be a maximum of 6 pages long (12 pt font size with a single column, 1.5
line spacing), and easy to read. The body of the report should contain the main parts of the presentation, and any
supplementary material should be deferred to the appendix. For example, only include a plot if it is important to get
your message across. The guidelines for the report are as follows:
1. Title Page: tile of the project, name of the group and all group members (names and zIDs), and link to OneDrive
folder containing video presentation. The title page is not counted in the page count.
2. Introduction: a brief summary of the task, the main issues for the task and a short description of how you
approached these issues.
3. Exploratory Data Analysis and Literature review: this is a crucial aspect of this project and should be done
carefully given the lack of domain information. Some (potential) questions for consideration: are all features
relevant? What is the distribution of the targets? What are the relationships between the features? What are
the relationships between the targets? How has this sort of task been approached in the literature? etc.
4. Methodology: A detailed explanation and justification of methods developed, method selection, feature selection,
hyper-parameter tuning, evaluation metrics, design choices, etc. State which method has been selected for the
final test and its hyper-parameters.
5. Results: Include the results achieved by the different models implemented in your work, with a focus on the
weighted cross-entropy loss (see Predictions submission below). Be sure to explain how each of the models was
trained, and how you chose your final model.
6. Discussion: Compare different models, their features and their performance. What insights have you gained?
7. Conclusion: Give a brief summary of the project and your findings, and what could be improved on if you had
more time.
8. References: list of all literature that you have used in your project, if any. You are encouraged to go beyond the
scope of the course content for this project. References are not counted in the page count.
You must follow this outline, and each section should be standalone. This means for example, that you should not
display results in your methodology section.
Project implementation
Each group must implement a model and generate predictions for the provided test set. You are free to select the types
of models, features and tune the methods for best performance as you see fit, but your approach must be outlined
in sufficient detail in the report. You may also make use of any machine learning algorithm, even if it has not been
covered in the course, as long as you provide an explanation of the algorithm in the report and justify why it is
appropriate for the task. You can use any open-source libraries for the project, as long as they are cited in your work.
You can use all the provided features or a subset of features; however, you are expected to give a justification for your
choice. You may run some exploratory analysis or some feature selection techniques to select your features. There is
no restriction on how you choose your features as long as you are able to justify it. In your justification of selecting
methods, parameters and features you may refer to published results of similar experiments.
3
Video Presentation
Each team is required to submit a 2-minute video presentation that outlines the problem and the group’s approach
to modeling. The purpose is to provide a high-level summary of the project and highlight key insights, rather than
focusing on technical details or minutiae. Please ensure that the video is exactly 2 minutes long in real time; videos
played at faster speeds will incur penalties. Place your video presentation in a OneDrive folder, and include a link
to this folder on the title page of your submitted report. It is your responsibility to check that the video file is not
corrupted (double-check audio and video are working, and check that the link works).
Code and report submission
You should submit this on moodle under the moodle object Group Project - Reports. Only 1 member of the
group needs to submit. Please submit the code files as a separate .zip file alongside the report, which must be in
.pdf format. Your project should consist of multiple .py files (e.g., separate files for different models and/or specific
processing steps, etc.) containing well commented and easy to ready code, with a README that provides instructions
on how to run the code. While you may use Jupyter notebooks for exploratory data analysis, they should not be the
primary method for running your code (e.g., you can extract .py files from Jupyter notebook). Penalties will apply if
the .pdf file is not submitted separately (do not include the PDF within the zip file).
Predictions submission
You should submit this on moodle under the moodle object Group Project - Predictions. Only 1 member of the
group needs to submit. You are to submit predictions for both test set 1 (the 1000 unlabelled points) and test set 2
(the 1818 unlabelled points). These predictions should be provided in a zip file containing two .npy files. The zip file
should be named ‘GROUPNAME’.zip. The two .npy files should be named preds 1.npy and preds 2.npy, respectively.
preds 1.npy must be a NumPy array of predictions for test set 1, and should be of size 1000× 28. preds 2.npy must
be a NumPy array of predictions for test set 2, and should be of size 1818 × 28. Failure to follow these instructions
may lead to a grade of zero for your model predictive performance portion of the grading criteria. Your predictions
will be evaluated using a weighted cross-entropy loss, where the weights are determined by the inverse frequency of
the classes in the respective test sets. Mathematically, the weighted cross-entropy loss is given by:
L = −
N∑
i=1
wyi log pˆi
where:
• N is the number of samples,
• yi is the true class label of the i-th sample,
• pˆi is the predicted probability for the true class,
• wyi is the weight for class yi, typically defined as wyi =
1
fyi
, where fyi is the frequency of class yi in the test set.
This formulation ensures that less frequent classes receive higher weights, helping to mitigate class imbalance
issues. For your benefit, the following code snippet is provided to ensure your submission meets the requirements and
to provide familiarity with the loss function.
1 import numpy as np
2 import pandas as pd
3 import zipfile
4
5 # open zip file containing preds_1.npy and preds_2.npy
6 with zipfile.ZipFile('GROUPNAME.zip', 'r') as zip_ref:
7 zip_ref.extractall('extracted_files ') # Extract all files into the 'extracted_files '
folder
8
9 preds_1 = np.load('extracted_files/preds_1.npy')
10 preds_2 = np.load('extracted_files/preds_2.npy')
11
12 # Check if preds_1 is of size 1000 x28 and preds_2 is of size 1818 x28
13 if preds_1.shape != (1000, 28):
4
14 raise ValueError(f"preds_1 has size {preds_1.shape}, but expected 1000 x28")
15
16 if preds_2.shape != (1818, 28):
17 raise ValueError(f"preds_2 has size {preds_2.shape}, but expected 1818 x28")
18
19 def weighted_log_loss(y_true , y_pred):
20 """
21 Compute the weighted cross -entropy (log loss) given true labels and predicted
probabilities.
22
23 Parameters:
24 - y_true: (N, C) One -hot encoded true labels
25 - y_pred: (N, C) Predicted probabilities
26
27 Returns:
28 - Weighted log loss (scalar).
29 """
30
31 # Compute class frequencies
32 class_counts = np.sum(y_true , axis =0) # Sum over samples to get counts per class
33 class_weights = 1.0 / class_counts
34 class_weights /= np.sum(class_weights) # Normalize weights to sum to 1
35
36 # Compute weighted loss
37 sample_weights = np.sum(y_true * class_weights , axis =1) # Get weight for each sample
38 loss = -np.mean(sample_weights * np.sum(y_true * np.log(y_pred), axis =1))
39
40 return loss
41
42 # y_test_1_ohe is the one hot encoded array of true labels in test set 1
43 # y_test_2_ohe is the one hot encoded array of true labels in test set 2
44 # you do not have access to either , here are RANDOMLY generated ohe labels to ensure code runs
45 y_test_1_ohe = (np.arange (28) == np.random.choice (28, size =1000)[:, None]).astype(int)
46 y_test_2_ohe = (np.arange (28) == np.random.choice (28, size =1818)[:, None]).astype(int)
47
48 loss_1 = weighted_log_loss(y_test_1_ohe , preds_1)
49 loss_2 = weighted_log_loss(y_test_2_ohe , preds_2)
Peer review
Individual contributions to the project will be assessed through a peer-review process, which will be announced later,
after the reports are submitted. This will be used to scale the mark based on contribution, and 80% of the final group
project mark will be weighted based on individual contributions. Anyone who does not complete the peer review by
the 5pm 2nd May will be deemed to have not contributed to the assignment. Peer review is a confidential process and
group members are not allowed to disclose their review to their peers.
Project help
Consult Python package online documentation for using methods, metrics and scores. There are many other resources
on the Internet and in the classification literature. When using these resources, please keep in mind the guidance
regarding plagiarism in the course introduction. General questions regarding group project should be posted in the
Group project forum in the course Moodle page.
5

学霸联盟