Page 1 of 8 | MIE1624: Introduction to Data Science, Analytics, and Artificial Intelligence – Assignment 2 MIE 1624 Introduction to Data Science, Analytics and AI – Winter 2026 Assignment 2 Due Date: 11:59pm, March 9, 2026 Submit via Quercus Background: For this assignment, your task is to analyze the provided dataset and answer the questions outlined in this document. You will then write a 5-page report to present the results of your analysis. In your report, make use of visual aids to effectively convey your findings. Explain how you arrive at the answers to the questions and justify why your answers are reasonable for the given data/question. You must interpret your results in the context of the dataset for your problem. You are also required to submit an IPython Notebook (.ipynb file) containing all the code of the analysis you performed to answer the questions in the report. Please ensure that the notebook is saved as a ‘.ipynb’ file, not as PDF or HTML. In this assignment, we will work with the “2025 Stack Overflow Annual Developer Survey” dataset, which was also used for Assignment 1. This survey, conducted annually by Stack Overflow for the past 15 years, aims to gather data on the current state of the software development community. Participants were surveyed from 177 countries, working in various industries in different roles, including software developers, analysts, and students. When preparing for a career in software development and data science, considering which skills to acquire / which technologies to learn, is a multi-criteria decision-making problem with several objectives (e.g., salary level, job satisfaction, and job work-life balance). In this assignment, we will focus specifically on training, validating, and tuning a model that can predict a survey respondent’s self-reported job satisfaction (target variable contained in column ‘QID26’ “How satisfied are you in your current professional developer role?” or “JobSat” in the dataset). Classification is a supervised machine learning approach used to assign a discrete value of one variable when given the values of others. Many types of machine learning models can be used for training classification problems, such as logistic regression, decision trees, kNN, SVM, random forest, gradient-boosted decision trees, and neural networks. In this assignment, you are required to implement the ordinal logistic regression algorithm, but feel free to experiment with other algorithms. The original dataset contains the results of many different question types, including multiple-choice, text- entry, and ranked-order questions. The total number of participants surveyed was 49123; however, not all questions were answered by each participant. The results of the original survey are available in the file survey_results_public.csv (found at https://survey.stackoverflow.co/2025/). The original data (survey_results_public.csv) has been transformed to processed_data_assignment2.csv file – This file should be read into the Jupyter Notebook and is the data that should be used for the assignment. Rows with null values of ‘JobSat’ have been dropped, and the data has been limited to primarily multiple-choice with select text-entry questions. Additionally, the salaries reported in the dataset have been converted into Canadian Dollars (CAD). Page 2 of 8 | MIE1624: Introduction to Data Science, Analytics, and Artificial Intelligence – Assignment 2 For this assignment, any subset of data in processed_data_assignment2.csv file can be used for data exploration and for classification purposes. For example, you may discard some rows (data samples) for data cleaning purposes. You must justify and explain why you are selecting a subset of the data, and how it may affect the model. Data is often split into training and testing data. The training data is typically further divided to create validation sets, either by simply splitting, if enough data points exist, or by using cross-validation within the training set. The model can be iteratively improved by tuning the hyperparameters or by feature selection and feature engineering. You may get started with this assignment using assignment2_template.ipynb file. The template contains some basic data analysis procedures that may be helpful for you, i.e., reading the dataset, and the skeleton for implementing ordinal logistic regression. Note that the filename must be properly renamed before submission (see later sections for details). Learning objectives: 1. Understand how to clean and prepare data for machine learning algorithms, including working with multiple data types, incomplete data, and categorical data. Perform data standardization/normalization, if necessary, prior to modeling. 2. Understand how to apply machine learning algorithms (ordinal logistic regression) to the task of classification. 3. Improve on skills and competencies required to compare performance of classification algorithms, including application of performance measurements, and visualization of comparisons. 4. Understand how to improve the performance of your model. 5. Improve on skills and competencies required to collate and present domain-specific, evidence- based insights. Questions: The following sections should be included, but the order does not need to be followed. The discussion for each section is included in that section’s marks. 1. [2 pt] Data cleaning: While the data is made ready for analysis, several values are missing, some features are categorical, and some questions allow for multiple responses. For the data cleaning step, handle missing values however you see fit and justify your approach. Suggestions include filling the missing values with a certain value (e.g., mode for categorical data) or completely removing the features with missing values with sufficient justification/support for this decision. Another method could be to fill them with a separate value indicating missingness, e.g., “unknown” or “none”. Secondly, convert categorical data into numerical data by encoding, Page 3 of 8 | MIE1624: Introduction to Data Science, Analytics, and Artificial Intelligence – Assignment 2 and explain why you used this particular encoding method. Lastly, handle columns where multiple responses are recorded. In your PDF report, for the features you cleaned, provide some insight on why you think the values are missing and how your approach might impact the overall analysis. You can choose a subset of features to use in later questions if you think that is reasonable and can work on data cleaning for only those features; however, you should NOT discard features without sufficient justification! Your submission must include the following: ● Data cleaning code that handles missing values and categorical features (in .ipynb file); ● Explanation about each of your data cleaning steps and justification of your approach (in PDF report). o You don't need to explain the data cleaning steps for each feature. You can group the features with similar cleaning steps and explain the cleaning and encoding you applied to each group, and why – that will be sufficient. Hint: Take a close look at the dataset before attempting to answer this question. What is the meaning of each column? Be cautious when you interpret missing values within the dataset. 2. [3.5 pts] Exploratory data analysis and feature selection: In this question, you explore how feature selection and feature engineering are useful tools in machine learning in the context of the tasks in this assignment. Conduct exploratory data analysis to identify features that appear to be important. Then, apply feature engineering and then select the features to be used for analysis either manually or through some feature selection algorithm (e.g., regularized regression). For the exploratory data analysis, visualize the order of feature importance. Based on the feature importance plot, conclude which of the original attributes in the data are most related to a survey respondent’s job satisfaction. Not all features need to be used in the later analysis; features can be removed or added as desired with sufficient justification. If the resulting number of features is very high, dimensionality reduction can also be used (e.g., PCA is a dimensionality reduction technique, but it is not effective for categorical features – think about what type of other techniques can be used). Use at least one feature selection technique – describe the technique and provide justification for why you selected that set of features. Your submission must include the following: ● Exploratory data analysis code (in .ipynb file) that includes a visualization of the order of feature importance, along with insights on the most important features for predicting a respondent’s job satisfaction (in PDF report). (1 pt) ● Feature engineering/generation code that creates new feature(s) from existing ones. You may incorporate domain knowledge or external data, applying feature generation Page 4 of 8 | MIE1624: Introduction to Data Science, Analytics, and Artificial Intelligence – Assignment 2 techniques is also a valid approach. Features don't have to improve the model, but you should explain why you believed that those might help with prediction. (0.5 pts) ● Implementation of the feature selection technique of your choice (in .ipynb). (1 pt) ● Explanation of the feature selection technique implemented above and justification of your approach (in PDF report). (1 pt) 3. [3.5 pts] Model implementation: 3.1. Implement ordinal logistic regression algorithm. (The skeleton is provided in the notebook.) (1 pt) 3.2. Perform 10-fold cross-validation on the training data. How does your model accuracy compare across the folds? Report the average and variance of accuracy across folds. You can use default hyperparameter values. (1 pt) 3.3. Identify one hyperparameter that has a direct impact on the bias-variance trade-off. Tweaking this hyperparameter (add this hyperparameter to the OrdinalLogisticRegression class), provide an analysis of the model performance based on bias-variance trade-off. It may be helpful to include a table/plot to report the model performance. Conclude which value of the hyperparameter is the best in terms of bias-variance trade-off. You can leave other hyperparameters default. (1 pt) 3.4. Is scaling/normalization of features needed for our task? Apply scaling/normalization if necessary, and justify the reason why scaling/normalization is (not) needed. If you are applying scaling/normalization to the data, make sure you apply the technique on the testing and training data separately. (0.5 pts) 4. [3 pts] Model tuning: 4.1. Selecting a proper criterion for determining the “best-performing” model requires choosing appropriate performance metrics, such as precision, recall, and F1-score. A description of these metrics is provided at the end of this file. Explain why accuracy may not be a suitable metric for this problem and suggest a more appropriate alternative. (0.5 pts) 4.2. The ordinal logistic regression model has several hyperparameters that can be tuned to improve performance (see the logistic regression model implementation in scikit-learn). Select two hyperparameters for model tuning, explain what each hyperparameter does, and justify your choices. Improve the performance of the ordinal logistic regression model and select a final best-performing model, using grid search based on a metric (or metrics) chosen in Question 4.1. (1.5 pts) ● You can choose any hyperparameters of this model, including the one you identified in 3.3. Your goal is to improve the model performance through hyperparameter tuning. ● There is no requirement for a minimum model performance, as long as your model implementation and tuning are done correctly and well explained. 4.3. Create the feature importance graph of your model to see which features were the most determining in model predictions. Compare this graph with the feature importance graph obtained in Section 2. (1 pt) Page 5 of 8 | MIE1624: Introduction to Data Science, Analytics, and Artificial Intelligence – Assignment 2 Hint: How do you extract feature importance from your ordinal logistic regression model? Think about a reasonable representation that highlights which features your model relied on for prediction. 5. [3 pts] Testing & Discussion: 5.1. Use your best-performing model to make classifications on the test set. (Note that the test set should not be used in any form during the training process, even as a validation set.). Report the performance on the test set vs. the training set. (0.5 pts) 5.2. Assess the overall fit of the model and discuss whether it is overfitting or underfitting, along with your reasoning. How would you further improve its performance on the training and/or test set? (1 pt) 5.3. Plot the distribution of true target variable values and their predictions on both the training set and test set. (0.5 pts) 5.4. What insight have you gained from the dataset and your trained classification model? Generate a figure that summarizes at least one key finding of the model/analysis (1 pt) Insufficient discussion will lead to the deduction of marks. Recommended steps to get started: 1) Download assignment2_template.ipynb from Quercus. 2) Go to Google Colab (https://colab.google/), click on ‘Open Colab’, and upload assignment2_template.ipynb file. 3) Start working on #TODO in the template (note that #TODO does not cover every step from Questions 1- 5; read each question carefully and make sure you answer all the questions.) Submission: 1) Produce an IPython Notebook (.ipynb file) containing your implementation and the analyses you performed to answer the questions for the given data set. Make sure you have brief comments for every step of your analysis so that we know what analysis you did. Your Jupyter notebook should run on Google Colab! Add the following two lines at the beginning of your notebook so you can work with the provided data on Colab: from google.colab import files uploaded = files.upload() When you check your code before submission, select the ‘Kernel’ tab and then ‘Restart & Run All’ to make sure that all the code runs without errors. If your code does not run properly on Google Colab, substantial marks will be deducted. 2) Produce a 5-page report explaining your response to each question for the given data set and detailing the analysis you performed. When writing the report, make sure that you explain each step, what you are doing, why it is important, and the pros and cons of that approach. You can have an appendix for Page 6 of 8 | MIE1624: Introduction to Data Science, Analytics, and Artificial Intelligence – Assignment 2 additional figures and tables, and cite them in the report if you need more space. Figures and tables should be properly formatted. While there is no specific grading criterion for writing quality, unclear/inaccurate statements and figures that fail to convey the justification of your answers, may lead to a partial deduction of points. What to submit: 1. Submit via Quercus a Jupyter (IPython) notebook with the following naming convention: lastname_studentnumber_assignment2.ipynb Make sure that you comment on your code appropriately and describe each step in sufficient detail. Respect the above convention when naming your file, making sure that all letters are lowercase and underscores are used as shown. If a program cannot be evaluated because of errors or because it varies from specifications, you will receive zero marks. 2. Submit a report in PDF (up to 5 pages + appendix) including the findings from your analysis. Use the following naming conventions lastname_studentnumber_assignment2.pdf. Tools: ● Software: ○ Python Version 3.X is required for this assignment. Make sure that your Jupyter notebook runs on Google Colab (https://colab.research.google.com) portal. All libraries are allowed but here is a list of the major libraries you might consider: Numpy, Scipy, Sklearn, Matplotlib, Pandas. ○ No other tool or software besides Python and its component libraries can be used to touch the data files. For instance, using Microsoft Excel to clean the data is not allowed. ○ Upload the required data file to your notebook on Google Colab – for example, from google.colab import files uploaded = files.upload() ● Data file: ○ processed_data_assignment2.csv: file to be read in notebook for this assignment ■ The data file cannot be altered by any means. The notebook will be run using the local version of this data file. Do not save anything to file within the notebook and read it back. ● Auxiliary files: ○ survey_results_public.csv: original survey responses. ○ survey_results_schema.csv: summary of questions. ○ 2025 Developer Survey Tool .pdf: the survey itself. Late submissions will receive a standard penalty: ● up to one hour late - no penalty ● one day late - 15% penalty Page 7 of 8 | MIE1624: Introduction to Data Science, Analytics, and Artificial Intelligence – Assignment 2 ● two days late - 30% penalty ● three days late - 45% penalty ● more than three days late - 0 mark Other requirements and tips: 1. A large portion of marks is allocated to analysis and justification. Full marks will not be awarded for the code alone. 2. Output must be shown and readable in the notebook. The only files that can be read into the notebook are the files posted in the assignment without modification. All work must be done within the notebook. 3. Ensure the code runs in full before submitting. Open the code in Google Colab and navigate to Runtime -> Restart runtime and Run all Cells. Ensure that there are no errors. 4. You may not want to re-run cross-validation (it can run for a very long time). When cross-validation is finished, output (print) the results (optimal model parameters). Hard-code the results in the model parameters and comment out the cross-validation code used to generate the optimal parameters. 5. You have a lot of freedom with how you want to approach each step and with whatever library or function you want to use. As open-ended as the problem seems, the emphasis of the assignment is for you to be able to explain the reasoning behind each step. 6. When evaluating the performance of your algorithm, keep in mind that there can be an inherent trade-off between the results on various performance measures. Policy for using LLMs: You may complete any part of this assignment (including both code and report text generation) using an LLM such as ChatGPT, Claude, or Gemini. Please specify how the LLM was used. For example, if ChatGPT was used for code generation, structuring, idea generation, or troubleshooting, that should be stated within the report. Additionally, if a LLM is used to generate large sections of code/ the writing of your report, you can include examples of the prompts used within the appendix of the report as well as comment on how you validated the LLM responses. There is no requirement that you implement exactly what the LLM produced, and if you modified a response, you can comment on why/ how you changed it. If you use ChatGPT, you could also provide a public link to your chat. Page 8 of 8 | MIE1624: Introduction to Data Science, Analytics, and Artificial Intelligence – Assignment 2 Appendix A Brief Introduction into the Most Common Performance Metrics: (source: https://medium.com/@MohammedS/performance-metrics-for-classification-problems-in-machine- learning-part-i-b085d432082b) Accuracy: refers to the total number of correct predictions over the total number of predictions. Precision: refers to the total number of true positive predictions over the total number of datapoints predicted as positive. Recall or Sensitivity: refers to the total number true positive predictions over total the number of all datapoints with actual positive labels. Specificity: refers to the total number of true negative predictions over the total number of all datapoints with actual negative labels. F1-score: refers to the harmonic mean of precision and recall.
学霸联盟