MIE1624: -无代写-Assignment 2|学霸联盟

MIE1624: -无代写-Assignment 2

时间：2026-03-06

Page 1 of 8 | MIE1624: Introduction to Data Science, Analytics, and Artificial Intelligence – Assignment 2

MIE 1624 Introduction to Data Science, Analytics and AI – Winter 2026
Assignment 2
Due Date: 11:59pm, March 9, 2026
Submit via Quercus

Background:

For this assignment, your task is to analyze the provided dataset and answer the questions outlined in this
document. You will then write a 5-page report to present the results of your analysis. In your report, make
use of visual aids to effectively convey your findings. Explain how you arrive at the answers to the questions
and justify why your answers are reasonable for the given data/question. You must interpret your results in
the context of the dataset for your problem. You are also required to submit an IPython Notebook (.ipynb
file) containing all the code of the analysis you performed to answer the questions in the report. Please
ensure that the notebook is saved as a ‘.ipynb’ file, not as PDF or HTML.

In this assignment, we will work with the “2025 Stack Overflow Annual Developer Survey” dataset,
which was also used for Assignment 1. This survey, conducted annually by Stack Overflow for the past
15 years, aims to gather data on the current state of the software development community. Participants were
surveyed from 177 countries, working in various industries in different roles, including software
developers, analysts, and students. When preparing for a career in software development and data science,
considering which skills to acquire / which technologies to learn, is a multi-criteria decision-making
problem with several objectives (e.g., salary level, job satisfaction, and job work-life balance). In this
assignment, we will focus specifically on training, validating, and tuning a model that can predict a survey
respondent’s self-reported job satisfaction (target variable contained in column ‘QID26’ “How satisfied are
you in your current professional developer role?” or “JobSat” in the dataset). Classification is a supervised
machine learning approach used to assign a discrete value of one variable when given the values of others.
Many types of machine learning models can be used for training classification problems, such as logistic
regression, decision trees, kNN, SVM, random forest, gradient-boosted decision trees, and neural networks.
In this assignment, you are required to implement the ordinal logistic regression algorithm, but feel
free to experiment with other algorithms.

The original dataset contains the results of many different question types, including multiple-choice, text-
entry, and ranked-order questions. The total number of participants surveyed was 49123; however, not all
questions were answered by each participant. The results of the original survey are available in the file
survey_results_public.csv (found at https://survey.stackoverflow.co/2025/).

The original data (survey_results_public.csv) has been transformed to processed_data_assignment2.csv
file – This file should be read into the Jupyter Notebook and is the data that should be used for the
assignment. Rows with null values of ‘JobSat’ have been dropped, and the data has been limited to
primarily multiple-choice with select text-entry questions. Additionally, the salaries reported in the dataset
have been converted into Canadian Dollars (CAD).

Page 2 of 8 | MIE1624: Introduction to Data Science, Analytics, and Artificial Intelligence – Assignment 2

For this assignment, any subset of data in processed_data_assignment2.csv file can be used for data
exploration and for classification purposes. For example, you may discard some rows (data samples) for
data cleaning purposes. You must justify and explain why you are selecting a subset of the data, and how
it may affect the model.

Data is often split into training and testing data. The training data is typically further divided to create
validation sets, either by simply splitting, if enough data points exist, or by using cross-validation within
the training set. The model can be iteratively improved by tuning the hyperparameters or by feature
selection and feature engineering.

You may get started with this assignment using assignment2_template.ipynb file. The template contains
some basic data analysis procedures that may be helpful for you, i.e., reading the dataset, and the skeleton
for implementing ordinal logistic regression. Note that the filename must be properly renamed before
submission (see later sections for details).

Learning objectives:

1. Understand how to clean and prepare data for machine learning algorithms, including working with
multiple data types, incomplete data, and categorical data. Perform data
standardization/normalization, if necessary, prior to modeling.
2. Understand how to apply machine learning algorithms (ordinal logistic regression) to the task of
classification.
3. Improve on skills and competencies required to compare performance of classification algorithms,
including application of performance measurements, and visualization of comparisons.
4. Understand how to improve the performance of your model.
5. Improve on skills and competencies required to collate and present domain-specific, evidence-
based insights.

Questions:

The following sections should be included, but the order does not need to be followed. The discussion for
each section is included in that section’s marks.

1. [2 pt] Data cleaning:
While the data is made ready for analysis, several values are missing, some features are
categorical, and some questions allow for multiple responses.

For the data cleaning step, handle missing values however you see fit and justify your approach.
Suggestions include filling the missing values with a certain value (e.g., mode for categorical data)
or completely removing the features with missing values with sufficient justification/support for
this decision. Another method could be to fill them with a separate value indicating missingness,
e.g., “unknown” or “none”. Secondly, convert categorical data into numerical data by encoding,

Page 3 of 8 | MIE1624: Introduction to Data Science, Analytics, and Artificial Intelligence – Assignment 2

and explain why you used this particular encoding method. Lastly, handle columns where multiple
responses are recorded.

In your PDF report, for the features you cleaned, provide some insight on why you think the values
are missing and how your approach might impact the overall analysis. You can choose a subset of
features to use in later questions if you think that is reasonable and can work on data cleaning for
only those features; however, you should NOT discard features without sufficient justification!

Your submission must include the following:
● Data cleaning code that handles missing values and categorical features (in .ipynb file);
● Explanation about each of your data cleaning steps and justification of your approach (in
PDF report).
o You don't need to explain the data cleaning steps for each feature. You can group
the features with similar cleaning steps and explain the cleaning and encoding you
applied to each group, and why – that will be sufficient.

Hint: Take a close look at the dataset before attempting to answer this question. What is the meaning
of each column? Be cautious when you interpret missing values within the dataset.

2. [3.5 pts] Exploratory data analysis and feature selection:
In this question, you explore how feature selection and feature engineering are useful tools in
machine learning in the context of the tasks in this assignment. Conduct exploratory data analysis
to identify features that appear to be important. Then, apply feature engineering and then select
the features to be used for analysis either manually or through some feature selection algorithm
(e.g., regularized regression).

For the exploratory data analysis, visualize the order of feature importance. Based on the feature
importance plot, conclude which of the original attributes in the data are most related to a survey
respondent’s job satisfaction.

Not all features need to be used in the later analysis; features can be removed or added as desired
with sufficient justification. If the resulting number of features is very high, dimensionality
reduction can also be used (e.g., PCA is a dimensionality reduction technique, but it is not effective
for categorical features – think about what type of other techniques can be used). Use at least one
feature selection technique – describe the technique and provide justification for why you selected
that set of features.

Your submission must include the following:
● Exploratory data analysis code (in .ipynb file) that includes a visualization of the order
of feature importance, along with insights on the most important features for predicting
a respondent’s job satisfaction (in PDF report). (1 pt)
● Feature engineering/generation code that creates new feature(s) from existing ones. You
may incorporate domain knowledge or external data, applying feature generation

Page 4 of 8 | MIE1624: Introduction to Data Science, Analytics, and Artificial Intelligence – Assignment 2

techniques is also a valid approach. Features don't have to improve the model, but you
should explain why you believed that those might help with prediction. (0.5 pts)
● Implementation of the feature selection technique of your choice (in .ipynb). (1 pt)
● Explanation of the feature selection technique implemented above and justification of your
approach (in PDF report). (1 pt)

3. [3.5 pts] Model implementation:
3.1. Implement ordinal logistic regression algorithm. (The skeleton is provided in the
notebook.) (1 pt)
3.2. Perform 10-fold cross-validation on the training data. How does your model accuracy
compare across the folds? Report the average and variance of accuracy across folds. You
can use default hyperparameter values. (1 pt)
3.3. Identify one hyperparameter that has a direct impact on the bias-variance trade-off. Tweaking
this hyperparameter (add this hyperparameter to the OrdinalLogisticRegression
class), provide an analysis of the model performance based on bias-variance trade-off. It may
be helpful to include a table/plot to report the model performance. Conclude which value of
the hyperparameter is the best in terms of bias-variance trade-off. You can leave other
hyperparameters default. (1 pt)
3.4. Is scaling/normalization of features needed for our task? Apply scaling/normalization if
necessary, and justify the reason why scaling/normalization is (not) needed. If you are
applying scaling/normalization to the data, make sure you apply the technique on the testing
and training data separately. (0.5 pts)

4. [3 pts] Model tuning:
4.1. Selecting a proper criterion for determining the “best-performing” model requires choosing
appropriate performance metrics, such as precision, recall, and F1-score. A description of
these metrics is provided at the end of this file. Explain why accuracy may not be a suitable
metric for this problem and suggest a more appropriate alternative. (0.5 pts)
4.2. The ordinal logistic regression model has several hyperparameters that can be tuned to
improve performance (see the logistic regression model implementation in scikit-learn).
Select two hyperparameters for model tuning, explain what each hyperparameter does, and
justify your choices. Improve the performance of the ordinal logistic regression model and
select a final best-performing model, using grid search based on a metric (or metrics) chosen
in Question 4.1. (1.5 pts)
● You can choose any hyperparameters of this model, including the one you identified
in 3.3. Your goal is to improve the model performance through hyperparameter tuning.
● There is no requirement for a minimum model performance, as long as your model
implementation and tuning are done correctly and well explained.
4.3. Create the feature importance graph of your model to see which features were the most
determining in model predictions. Compare this graph with the feature importance graph
obtained in Section 2. (1 pt)

Page 5 of 8 | MIE1624: Introduction to Data Science, Analytics, and Artificial Intelligence – Assignment 2

Hint: How do you extract feature importance from your ordinal logistic regression model?
Think about a reasonable representation that highlights which features your model relied on
for prediction.

5. [3 pts] Testing & Discussion:
5.1. Use your best-performing model to make classifications on the test set. (Note that the test set
should not be used in any form during the training process, even as a validation set.). Report
the performance on the test set vs. the training set. (0.5 pts)
5.2. Assess the overall fit of the model and discuss whether it is overfitting or underfitting, along
with your reasoning. How would you further improve its performance on the training and/or
test set? (1 pt)
5.3. Plot the distribution of true target variable values and their predictions on both the training
set and test set. (0.5 pts)
5.4. What insight have you gained from the dataset and your trained classification model?
Generate a figure that summarizes at least one key finding of the model/analysis (1 pt)

Insufficient discussion will lead to the deduction of marks.

Recommended steps to get started:
1) Download assignment2_template.ipynb from Quercus.
2) Go to Google Colab (https://colab.google/), click on ‘Open Colab’, and upload
assignment2_template.ipynb file.
3) Start working on #TODO in the template (note that #TODO does not cover every step from Questions 1-
5; read each question carefully and make sure you answer all the questions.)

Submission:
1) Produce an IPython Notebook (.ipynb file) containing your implementation and the analyses you
performed to answer the questions for the given data set. Make sure you have brief comments for every
step of your analysis so that we know what analysis you did. Your Jupyter notebook should run on
Google Colab! Add the following two lines at the beginning of your notebook so you can work with
the provided data on Colab:

from google.colab import files
uploaded = files.upload()

When you check your code before submission, select the ‘Kernel’ tab and then ‘Restart & Run All’ to
make sure that all the code runs without errors. If your code does not run properly on Google Colab,
substantial marks will be deducted.

2) Produce a 5-page report explaining your response to each question for the given data set and detailing
the analysis you performed. When writing the report, make sure that you explain each step, what you
are doing, why it is important, and the pros and cons of that approach. You can have an appendix for

Page 6 of 8 | MIE1624: Introduction to Data Science, Analytics, and Artificial Intelligence – Assignment 2

additional figures and tables, and cite them in the report if you need more space. Figures and tables
should be properly formatted. While there is no specific grading criterion for writing quality,
unclear/inaccurate statements and figures that fail to convey the justification of your answers, may lead
to a partial deduction of points.

What to submit:
1. Submit via Quercus a Jupyter (IPython) notebook with the following naming convention:
lastname_studentnumber_assignment2.ipynb
Make sure that you comment on your code appropriately and describe each step in sufficient detail.
Respect the above convention when naming your file, making sure that all letters are lowercase and
underscores are used as shown. If a program cannot be evaluated because of errors or because
it varies from specifications, you will receive zero marks.
2. Submit a report in PDF (up to 5 pages + appendix) including the findings from your analysis. Use
the following naming conventions lastname_studentnumber_assignment2.pdf.

Tools:
● Software:
○ Python Version 3.X is required for this assignment. Make sure that your Jupyter notebook
runs on Google Colab (https://colab.research.google.com) portal. All libraries are allowed
but here is a list of the major libraries you might consider: Numpy, Scipy, Sklearn,
Matplotlib, Pandas.
○ No other tool or software besides Python and its component libraries can be used to touch
the data files. For instance, using Microsoft Excel to clean the data is not allowed.
○ Upload the required data file to your notebook on Google Colab – for example,
from google.colab import files
uploaded = files.upload()

● Data file:
○ processed_data_assignment2.csv: file to be read in notebook for this assignment
■ The data file cannot be altered by any means. The notebook will be run using the
local version of this data file. Do not save anything to file within the notebook and
read it back.

● Auxiliary files:
○ survey_results_public.csv: original survey responses.
○ survey_results_schema.csv: summary of questions.
○ 2025 Developer Survey Tool .pdf: the survey itself.

Late submissions will receive a standard penalty:

● up to one hour late - no penalty
● one day late - 15% penalty

Page 7 of 8 | MIE1624: Introduction to Data Science, Analytics, and Artificial Intelligence – Assignment 2

● two days late - 30% penalty
● three days late - 45% penalty
● more than three days late - 0 mark

Other requirements and tips:

1. A large portion of marks is allocated to analysis and justification. Full marks will not be awarded
for the code alone.
2. Output must be shown and readable in the notebook. The only files that can be read into the
notebook are the files posted in the assignment without modification. All work must be done within
the notebook.
3. Ensure the code runs in full before submitting. Open the code in Google Colab and navigate to
Runtime -> Restart runtime and Run all Cells. Ensure that there are no errors.
4. You may not want to re-run cross-validation (it can run for a very long time). When cross-validation
is finished, output (print) the results (optimal model parameters). Hard-code the results in the model
parameters and comment out the cross-validation code used to generate the optimal parameters.
5. You have a lot of freedom with how you want to approach each step and with whatever library or
function you want to use. As open-ended as the problem seems, the emphasis of the assignment is
for you to be able to explain the reasoning behind each step.
6. When evaluating the performance of your algorithm, keep in mind that there can be an inherent
trade-off between the results on various performance measures.

Policy for using LLMs:

You may complete any part of this assignment (including both code and report text generation) using an
LLM such as ChatGPT, Claude, or Gemini. Please specify how the LLM was used. For example, if
ChatGPT was used for code generation, structuring, idea generation, or troubleshooting, that should be
stated within the report. Additionally, if a LLM is used to generate large sections of code/ the writing of
your report, you can include examples of the prompts used within the appendix of the report as well as
comment on how you validated the LLM responses. There is no requirement that you implement exactly
what the LLM produced, and if you modified a response, you can comment on why/ how you changed it.
If you use ChatGPT, you could also provide a public link to your chat.

Page 8 of 8 | MIE1624: Introduction to Data Science, Analytics, and Artificial Intelligence – Assignment 2

Appendix

A Brief Introduction into the Most Common Performance Metrics: (source:
https://medium.com/@MohammedS/performance-metrics-for-classification-problems-in-machine-
learning-part-i-b085d432082b)

Accuracy: refers to the total number of correct predictions over the total number of predictions.
Precision: refers to the total number of true positive predictions over the total number of datapoints
predicted as positive.
Recall or Sensitivity: refers to the total number true positive predictions over total the number of all
datapoints with actual positive labels.
Specificity: refers to the total number of true negative predictions over the total number of all datapoints
with actual negative labels.
F1-score: refers to the harmonic mean of precision and recall.

学霸联盟