Page 1 of 10 INFS4203/7203 Project (20 marks) Semester 2, 2025 Due date: 13:00 on 20th October 2025 (Brisbane Time) Important Assignment Submission Guidelines: 1. All assignments must be submitted exclusively through the UQ Blackboard. No other forms of submission will be accepted. 2. Failure to submit an assignment appropriately before the due date will result in a penalty, as outlined in the ECP. 3. It is your responsibility to ensure that your assignment is successfully submitted before the designated deadline. 4. Please note that email submissions will not be accepted under any circumstances. This task has been designed to be challenging, authentic, and complex. While you may use generative AI and/or machine translation (MT) technologies, successful completion of this assessment will require critical engagement with the specific context and task, for which such tools will offer only limited support. Failure to appropriately reference the use of generative AI or MT tools may constitute student misconduct under the Student Code of Conduct. To pass this assessment, students must be able to demonstrate clear and independent understanding of their submission, beyond the use of AI or MT tools. Overview The assignment aims to assess your ability to apply data mining techniques to solve real-world problems. This is an individual task, and the completion should be based on your own design. You can choose either: • Data-oriented project: Apply the data mining techniques learned in this course to train a classifier (or an ensemble of classifiers) on the provided training data, aiming for strong performance on the test data. Your project report should include: predictions on the test data, and evaluation results on the training data using cross-validation. • Competition-oriented project: Select and participate in an external data mining or machine learning competition. This option is limited to 10 students, allocated on a first-come, first-served basis via Expression of Interest (EOI), subject to suitability of the competition to the course. You must submit a short result report of your Public Leader Board performance (screenshot + URL). For both options, you are required to submit all source code and a README file that documents how to reproduce your results. Page 2 of 10 Track 1: Data-oriented project 1. Dataset Description In this data-oriented project, your dataset named train.csv is designed to closely simulate real-world scenarios, reflecting the inherent complexities found in naturally occurring data. In real applications, an important challenge is to decide how and which data mining techniques should be applied to the given data in order to make reliable predictions. This dataset provides an excellent opportunity to study and develop robust solutions applicable to real-world data analysis. In this project, you will be provided with a dataset named train.csv. Except for the first row which gives the feature names, each row in the file corresponds to one data point. The dataset contains 10,853 training instances, each described by 43 attributes and one label column. • Among the 43 attributes: o 25 columns are numerical features, denoted as Num_Col1 to Num_Col25. o 18 columns are nominal (categorical) features, denoted as Nom_Col26 to Nom_Col43. In the file, nominal values have been replaced with string tokens of the form Cx_cN (where N is an integer index). • The final column, Target (Col44), is the label indicating the class for each data point: 0 or 1. • NaN denotes that the feature value is missing at the position. In addition, you will be provided with a test dataset named test_data.csv. The first row gives the feature names. Each subsequent row (rows 2–2,714) represents one test instance, for a total of 2,713 instances, each described by the same 43 attributes as in the training data (no label column). The labels for the test data will not be released and will be used by the teaching team for marking only. 2. Main Objective Your task is to classify the test data given in test_data.csv. During marking, the F1 score on the test data will be used for grading. For the purpose of F1 calculation, label “1” is treated as the positive class and label “0” as the negative class. Your primary objective is to develop a classifier, trained on the provided training data, that achieves the highest possible F1 score on the test data. Important restriction: You must use only the techniques covered in INFS4203/7203 (Weeks 2–8). This specifically excludes any advanced content introduced in those weeks or beyond. The use of techniques outside the specified scope creates unfairness to other students and will result in a zero mark. You are required to submit: • Result report: including the test predictions and evaluation results. • Code and README file: sufficient for reproducing your results. Important note on submission: Details on submission format and requirements are provided below. You must follow the specified guides to submit both files in the correct content and format, and with the Page 3 of 10 correct file names. Please carefully follow the submission guides. Failing to submit either file in the correct form or with the correct name will result in your submission not being accepted or marked. 3. Result Report Requirement You are required to submit a result report that includes: • Test result: Predictions on the test data (integer type). • Evaluation result: Accuracy and F1 on the training data, evaluated by cross-validation (float type). File naming and submission • The result report must be named as: sXXXXXXX.infs4203 (where sXXXXXXX is your student username: an “s” followed by seven digits). • The Submission Title in Turnitin must be identical to the file name. • Example: if your student username is s1234567, the file should be named s1234567.infs4203 and submitted with the Submission Title s1234567.infs4203. File content • The file must contain 2,714 rows in total: o Rows 1–2,713: each row corresponds to one test instance, in the same order as provided in test_data.csv, giving the predicted label (0 or 1, integer type). o Row 2,714: two values — the accuracy (first column) and F1 score (second column) on the training data, computed via cross-validation. Both must be reported in float type, rounded to the third decimal place. • Values in each row must be separated by commas, and each row must end with a comma. Reference example • A sample file result_report_example.infs4203 is provided for reference. Note: This file does not contain the ground truth. Important notes • Only your best prediction (based on cross-validation results) should be submitted. Multiple submissions of test predictions will not be marked. 4. Code and README File Requirements You must submit all source code and a README file, compressed into a single ZIP file. Page 4 of 10 README file The README should include: • Final choices: The preprocessing methods, classification model(s), and hyperparameters used to produce your reported test results. • Environment description: A clear specification of your coding environment (operating system, programming language and version, and additional installed packages). • Reproduction instructions: Step-by-step instructions for reproducing your reported results, including preprocessing, model selection, hyperparameter tuning, testing, and evaluation. • Additional justifications: Any extra explanations or references (including references to AI tools) for the methods you implemented. • File format: The README may be submitted in .md, or .txt. Training, evaluation, and testing code • Must include all code related to preprocessing, training, prediction on the test data, and generation of the result report. • The code must include a main function in a main file (e.g., main.py) that executes the overall process. • Random seeds must be fixed to ensure reproducibility. Preprocessing, model selection, and tuning procedures codes • Include detailed code for the procedures you used to select preprocessing techniques, models, and hyperparameters. • Additional explanations (if needed) may be placed in the README file. • Random seeds must again be fixed to guarantee reproducibility. Additional requirements • Include the provided training and test files in your submitted ZIP file to allow results to be reproduced during marking. • The generated result report file (the same one you submitted separately) must appear in the root directory of the ZIP. • Any programming language may be used. o If you use Python, you must submit .py files only. .ipynb notebooks will not be accepted. That means, if you use Jupyter Notebook or Google Colab, and you are working in Python, you must export your notebook as .py before submission. o If you use other programming languages (e.g., R, Java, C++), please submit your code files in the standard source format for that language (e.g., .R, .java, .cpp). Do not submit notebooks or binary/compiled files. Page 5 of 10 Submission • Together with the result report, submit your README file and all code. • Compress them into a single ZIP file named: sXXXXXXX.zip (where sXXXXXXX is your student username). • The Submission Title in Turnitin must match the file name. Style recommendation • For good practice, we recommend following the Google Style Guides. This is not mandatory, but adopting consistent style conventions will benefit your future career as a data scientist. 5. Submission Guide The previous section specifies the required file contents, formats and naming conventions. This section explains how and where to submit your files, as well as the late submission policy. • Final version: Only your last submitted version will be marked. • Deadline: All required files must be submitted before the due time. Otherwise, penalties will be applied according to the ECP: o A penalty of 10% of the maximum possible mark will be deducted per 24 hours after the due time, for up to 7 days. o Submissions received more than 7 days late will receive a mark of 0. Submission links: • Result report: Submit via the “Report submission” Turnitin link on Blackboard → Assessment → Project → Report submission. The Submission Title must be sXXXXXXX.infs4203. • Code and README (compressed file): Submit via the “Readme and code submission” Turnitin link on Blackboard → Assessment → Project → Readme and code submission. The Submission Title must be sXXXXXXX.zip. For the efficiency of marking, please carefully follow the required format and naming conventions for both files. Submissions that do not meet these requirements cannot be accepted or marked. 6. Marking Standard Submissions satisfying the following five conditions will be accepted and marked 1. The selected best pre-processing, model and hyperparameter can be reproduced from the submitted readme file and codes. 2. The classifiers used to do classification can be reproduced by the submitted readme file and codes. Page 6 of 10 3. The classifiers are generated by using only techniques delivered in INFS4203/7203 lectures. 4. The test and evaluation results can be reproduced by the submitted readme file and codes. 5. The test and evaluation results are generated by applying the learned classifiers to the data. When the above five conditions are satisfied, the result report will be marked according to the F1 result on the test data in the following way (rounded to nearest one decimal place) • For F1 less than or equal to 0.5: Mark = 0 • For F1 greater than 0.5 but less than 0.6: Mark = (F1-0.5)÷0.01 • For F1 greater than or equal to 0.6 but less than 0.65: Mark = 10 + (F1-0.6)÷0.005 • For F1 greater than 0.65: 20 • Please see the example below F1 Mark 0.50 0 0.52 2 0.54 4 0.56 6 0.58 8 0.60 10 0.61 12 0.62 14 0.63 16 0.64 18 0.65 20 Training time or prediction time will not be counted into marking. 7. Additional Notes For this assignment, your mark will be based on the submitted results and reproducibility of your code. Achieving strong results usually requires a comprehensive implementation strategy. You are expected to systematically explore and justify your choices in: • Preprocessing (e.g., handling missing values, normalization, outliers), including selection and tuning. • Model selection and rationale. • Hyperparameter tuning and training strategies. • Evaluation design. This comprehensive approach not only increases your chance of identifying the best-performing classifier but also prepares you for the presentation (a separate task later in the semester), where you will be assessed on how you reasoned about and communicated your solution strategy. Page 7 of 10 The following dimensions should guide your thinking: a. Pre-processing Techniques • Key considerations: Outlier detection, normalization, imputation, handling categorical features. • How to decide: Use cross-validation on the training data to compare results with and without specific techniques. Consider the distribution of your features and the impact of preprocessing on different classifiers. • Integration into strategy: Explicitly outline how preprocessing supports your overall model design and why you selected specific techniques to optimize predictive performance. b. Application of Classification Techniques • Baseline methods: Apply the four core techniques taught in class—decision tree, random forest, k-nearest neighbour, and naïve Bayes. • Hyperparameter tuning: Define reasonable ranges for hyperparameters (e.g., tree depth, number of neighbours). Use cross-validation to search these ranges systematically. Explain how the ranges were chosen and why they are relevant to optimization. • Ensembles: Consider combining classifiers to improve performance (e.g., majority voting). Reflect on why an ensemble may or may not be effective for your dataset. c. Model Evaluation • Evaluation metric: F1 are required. Justify why F1 is particularly important given your dataset. • How to evaluate: Report both the mean and standard deviation from cross-validation. This shows not just performance but also the stability of each model. • Comparative analysis: Use evaluation results to guide final model choice. Discuss why you preferred one model (or ensemble) over others. Other Hints for Achieving a Comprehensive Solution • Synergy of techniques: Some preprocessing choices may work better with specific classifiers. Use cross-validation to explore these combinations. • Comparing results: consider both performance (mean) and stability (standard deviation) when evaluating cross-validation results. • Beyond single models: Explore ensembles or even “ensembles of ensembles” (e.g., combining random forest and k-NN with majority voting). • Consistency: Ensure preprocessing steps used in training are exactly replicated during testing. (End of Track 1. See the next page for Track 2 specifications.) Page 8 of 10 Track 2: Competition-oriented project 1. Overview In this project, you are required to participate in an online data mining competition that aligns with the learning objectives of this course. • The competition must: o Offer monetary rewards (to ensure it is a genuine, competitive challenge). o Conclude no later than October 1, 2025. o Have a minimum of 10 competitors. • Entry-level Kaggle competitions labelled “Getting Started,” “Playground,” or “Community” are NOT eligible. • Availability for this track is limited to 10 students. Places are allocated on a first-come-first-served basis, subject to approval of your chosen competition by the teaching team. • To join this track, you must first complete the Expression of Interest (EOI) Form. Unlike the data-oriented track, in this competition-oriented track you may apply any data mining or machine learning techniques, including those beyond the scope of INFS4203/7203, provided they are appropriate to the chosen competition. Any programming language can be used. 2. Project Requirements You must submit the following: a. Result report o A brief report documenting your Public Leader Board results, including: A screenshot of your rank on the Public Leader Board. The URL of the Public Leader Board page. o Your Public Leader Board username must be your student username (sXXXXXXX). b. Code and README file o README file should include: Environment description (OS, hardware, programming language and version, additional packages). Reproduction instructions (how to run the code to generate your competition submission). Page 9 of 10 o Code should include: All tuning, training and testing code used to generate your final competition submission. A main function in a main file (e.g., main.py) to execute the overall process. o The README file and all code must be compressed into a single ZIP file. 3. File Format and Naming • Result report: must be named sXXXXXXX.pdf or sXXXXXXX.docx (where sXXXXXXX is your student username). • Code and README: must be compressed into a single ZIP file named sXXXXXXX.zip. • Files submitted in any other form or with incorrect names will not be accepted or marked. 4. Submission Guide • Only your last submitted version will be marked. • All required files must be submitted before the deadline. • Late submissions will incur penalties according to the ECP: o A penalty of 10% of the maximum mark per 24 hours after the due time, up to 7 days. o Submissions more than 7 days late will receive a mark of 0. • Submission links on Blackboard: o Result report → “Report submission” Turnitin link (Assessment → Project → Report submission). o Code and README ZIP → “Readme and code submission” Turnitin link (Assessment → Project → Readme and code submission). o The ZIP file must be under 100MB. If your file is larger, contact infs4203@eecs.uq.edu.au before the due time. 5. Marking Standard The following marking standard will apply, unless otherwise discussed with the teaching team for exceptionally challenging competitions. Page 10 of 10 You need to submit the evidence of your achievements in the public leading board by the end of the project deadline to earn your marks. Your username in the public leading board must be your student username (sxxxxxxx, each x represents a digit). If your targeted competition ends before the project deadline, you could show by cross-validation that you have achieved comparable performance to a particular competitor on the public leading board before the project deadline. Your project could then be assessed by the competitor’s corresponding rank percentage on the public leading board. You have to earn a public Leader Board top ranking index (your rank divided by the total number of competitors) by the project deadline Earned marks = max (20 – max (public_LB_top_ranking_index – 0.4, 0)*30, 0) That is, you earn 20 marks when having Public Leader Board top ranking to be within top 40% of all competitors. End of Specification for Project
学霸联盟