INFS4203/7203 -无代写|学霸联盟

INFS4203/7203 -无代写

时间：2025-10-10

Page 1 of 10

INFS4203/7203 Project (20 marks)
Semester 2, 2025

Due date:
13:00 on 20th October 2025 (Brisbane Time)
Important Assignment Submission Guidelines:
1. All assignments must be submitted exclusively through the UQ Blackboard. No other forms of
submission will be accepted.
2. Failure to submit an assignment appropriately before the due date will result in a penalty, as
outlined in the ECP.
3. It is your responsibility to ensure that your assignment is successfully submitted before the
designated deadline.
4. Please note that email submissions will not be accepted under any circumstances.
This task has been designed to be challenging, authentic, and complex. While you may use generative
AI and/or machine translation (MT) technologies, successful completion of this assessment will require
critical engagement with the specific context and task, for which such tools will offer only limited support.
Failure to appropriately reference the use of generative AI or MT tools may constitute student misconduct
under the Student Code of Conduct. To pass this assessment, students must be able to demonstrate clear
and independent understanding of their submission, beyond the use of AI or MT tools.

Overview
The assignment aims to assess your ability to apply data mining techniques to solve real-world problems.
This is an individual task, and the completion should be based on your own design. You can choose either:
• Data-oriented project: Apply the data mining techniques learned in this course to train a classifier
(or an ensemble of classifiers) on the provided training data, aiming for strong performance on
the test data. Your project report should include: predictions on the test data, and evaluation
results on the training data using cross-validation.
• Competition-oriented project: Select and participate in an external data mining or machine
learning competition. This option is limited to 10 students, allocated on a first-come, first-served
basis via Expression of Interest (EOI), subject to suitability of the competition to the course. You
must submit a short result report of your Public Leader Board performance (screenshot + URL).
For both options, you are required to submit all source code and a README file that documents how to
reproduce your results.

Page 2 of 10

Track 1: Data-oriented project
1. Dataset Description
In this data-oriented project, your dataset named train.csv is designed to closely simulate real-world
scenarios, reflecting the inherent complexities found in naturally occurring data. In real applications, an
important challenge is to decide how and which data mining techniques should be applied to the given
data in order to make reliable predictions. This dataset provides an excellent opportunity to study and
develop robust solutions applicable to real-world data analysis.
In this project, you will be provided with a dataset named train.csv. Except for the first row which gives
the feature names, each row in the file corresponds to one data point. The dataset contains 10,853
training instances, each described by 43 attributes and one label column.
• Among the 43 attributes:
o 25 columns are numerical features, denoted as Num_Col1 to Num_Col25.
o 18 columns are nominal (categorical) features, denoted as Nom_Col26 to Nom_Col43. In
the file, nominal values have been replaced with string tokens of the form Cx_cN (where
N is an integer index).
• The final column, Target (Col44), is the label indicating the class for each data point: 0 or 1.
• NaN denotes that the feature value is missing at the position.
In addition, you will be provided with a test dataset named test_data.csv. The first row gives the feature
names. Each subsequent row (rows 2–2,714) represents one test instance, for a total of 2,713 instances,
each described by the same 43 attributes as in the training data (no label column). The labels for the test
data will not be released and will be used by the teaching team for marking only.

2. Main Objective
Your task is to classify the test data given in test_data.csv. During marking, the F1 score on the test data
will be used for grading. For the purpose of F1 calculation, label “1” is treated as the positive class and
label “0” as the negative class.
Your primary objective is to develop a classifier, trained on the provided training data, that achieves the
highest possible F1 score on the test data.
Important restriction: You must use only the techniques covered in INFS4203/7203 (Weeks 2–8). This
specifically excludes any advanced content introduced in those weeks or beyond. The use of techniques
outside the specified scope creates unfairness to other students and will result in a zero mark.
You are required to submit:
• Result report: including the test predictions and evaluation results.
• Code and README file: sufficient for reproducing your results.
Important note on submission: Details on submission format and requirements are provided below. You
must follow the specified guides to submit both files in the correct content and format, and with the
Page 3 of 10

correct file names. Please carefully follow the submission guides. Failing to submit either file in the correct
form or with the correct name will result in your submission not being accepted or marked.

3. Result Report Requirement
You are required to submit a result report that includes:
• Test result: Predictions on the test data (integer type).
• Evaluation result: Accuracy and F1 on the training data, evaluated by cross-validation (float type).
File naming and submission
• The result report must be named as: sXXXXXXX.infs4203 (where sXXXXXXX is your student
username: an “s” followed by seven digits).
• The Submission Title in Turnitin must be identical to the file name.
• Example: if your student username is s1234567, the file should be named s1234567.infs4203 and
submitted with the Submission Title s1234567.infs4203.
File content
• The file must contain 2,714 rows in total:
o Rows 1–2,713: each row corresponds to one test instance, in the same order as provided
in test_data.csv, giving the predicted label (0 or 1, integer type).
o Row 2,714: two values — the accuracy (first column) and F1 score (second column) on
the training data, computed via cross-validation. Both must be reported in float type,
rounded to the third decimal place.
• Values in each row must be separated by commas, and each row must end with a comma.
Reference example
• A sample file result_report_example.infs4203 is provided for reference. Note: This file does not
contain the ground truth.
Important notes
• Only your best prediction (based on cross-validation results) should be submitted. Multiple
submissions of test predictions will not be marked.

4. Code and README File Requirements
You must submit all source code and a README file, compressed into a single ZIP file.
Page 4 of 10

README file
The README should include:
• Final choices: The preprocessing methods, classification model(s), and hyperparameters used to
produce your reported test results.
• Environment description: A clear specification of your coding environment (operating system,
programming language and version, and additional installed packages).
• Reproduction instructions: Step-by-step instructions for reproducing your reported results,
including preprocessing, model selection, hyperparameter tuning, testing, and evaluation.
• Additional justifications: Any extra explanations or references (including references to AI tools)
for the methods you implemented.
• File format: The README may be submitted in .md, or .txt.
Training, evaluation, and testing code
• Must include all code related to preprocessing, training, prediction on the test data, and
generation of the result report.
• The code must include a main function in a main file (e.g., main.py) that executes the overall
process.
• Random seeds must be fixed to ensure reproducibility.
Preprocessing, model selection, and tuning procedures codes
• Include detailed code for the procedures you used to select preprocessing techniques, models,
and hyperparameters.
• Additional explanations (if needed) may be placed in the README file.
• Random seeds must again be fixed to guarantee reproducibility.
Additional requirements
• Include the provided training and test files in your submitted ZIP file to allow results to be
reproduced during marking.
• The generated result report file (the same one you submitted separately) must appear in the root
directory of the ZIP.
• Any programming language may be used.
o If you use Python, you must submit .py files only. .ipynb notebooks will not be accepted.
That means, if you use Jupyter Notebook or Google Colab, and you are working in Python,
you must export your notebook as .py before submission.
o If you use other programming languages (e.g., R, Java, C++), please submit your code files
in the standard source format for that language (e.g., .R, .java, .cpp). Do not submit
notebooks or binary/compiled files.
Page 5 of 10

Submission
• Together with the result report, submit your README file and all code.
• Compress them into a single ZIP file named: sXXXXXXX.zip (where sXXXXXXX is your student
username).
• The Submission Title in Turnitin must match the file name.
Style recommendation
• For good practice, we recommend following the Google Style Guides. This is not mandatory, but
adopting consistent style conventions will benefit your future career as a data scientist.

5. Submission Guide
The previous section specifies the required file contents, formats and naming conventions. This section
explains how and where to submit your files, as well as the late submission policy.
• Final version: Only your last submitted version will be marked.
• Deadline: All required files must be submitted before the due time. Otherwise, penalties will be
applied according to the ECP:
o A penalty of 10% of the maximum possible mark will be deducted per 24 hours after the
due time, for up to 7 days.
o Submissions received more than 7 days late will receive a mark of 0.
Submission links:
• Result report: Submit via the “Report submission” Turnitin link on Blackboard → Assessment →
Project → Report submission. The Submission Title must be sXXXXXXX.infs4203.
• Code and README (compressed file): Submit via the “Readme and code submission” Turnitin link
on Blackboard → Assessment → Project → Readme and code submission. The Submission Title
must be sXXXXXXX.zip.
For the efficiency of marking, please carefully follow the required format and naming conventions for
both files. Submissions that do not meet these requirements cannot be accepted or marked.

6. Marking Standard
Submissions satisfying the following five conditions will be accepted and marked
1. The selected best pre-processing, model and hyperparameter can be reproduced from the
submitted readme file and codes.
2. The classifiers used to do classification can be reproduced by the submitted readme file and codes.
Page 6 of 10

3. The classifiers are generated by using only techniques delivered in INFS4203/7203 lectures.
4. The test and evaluation results can be reproduced by the submitted readme file and codes.
5. The test and evaluation results are generated by applying the learned classifiers to the data.
When the above five conditions are satisfied, the result report will be marked according to the F1 result
on the test data in the following way (rounded to nearest one decimal place)
• For F1 less than or equal to 0.5: Mark = 0
• For F1 greater than 0.5 but less than 0.6: Mark = (F1-0.5)÷0.01
• For F1 greater than or equal to 0.6 but less than 0.65: Mark = 10 + (F1-0.6)÷0.005
• For F1 greater than 0.65: 20
• Please see the example below
F1 Mark
0.50 0
0.52 2
0.54 4
0.56 6
0.58 8
0.60 10
0.61 12
0.62 14
0.63 16
0.64 18
0.65 20

Training time or prediction time will not be counted into marking.

7. Additional Notes
For this assignment, your mark will be based on the submitted results and reproducibility of your code.
Achieving strong results usually requires a comprehensive implementation strategy. You are expected to
systematically explore and justify your choices in:
• Preprocessing (e.g., handling missing values, normalization, outliers), including selection and
tuning.
• Model selection and rationale.
• Hyperparameter tuning and training strategies.
• Evaluation design.
This comprehensive approach not only increases your chance of identifying the best-performing
classifier but also prepares you for the presentation (a separate task later in the semester), where you
will be assessed on how you reasoned about and communicated your solution strategy.
Page 7 of 10

The following dimensions should guide your thinking:
a. Pre-processing Techniques
• Key considerations: Outlier detection, normalization, imputation, handling categorical features.
• How to decide: Use cross-validation on the training data to compare results with and without
specific techniques. Consider the distribution of your features and the impact of preprocessing
on different classifiers.
• Integration into strategy: Explicitly outline how preprocessing supports your overall model
design and why you selected specific techniques to optimize predictive performance.
b. Application of Classification Techniques
• Baseline methods: Apply the four core techniques taught in class—decision tree, random forest,
k-nearest neighbour, and naïve Bayes.
• Hyperparameter tuning: Define reasonable ranges for hyperparameters (e.g., tree depth,
number of neighbours). Use cross-validation to search these ranges systematically. Explain how
the ranges were chosen and why they are relevant to optimization.
• Ensembles: Consider combining classifiers to improve performance (e.g., majority voting).
Reflect on why an ensemble may or may not be effective for your dataset.
c. Model Evaluation
• Evaluation metric: F1 are required. Justify why F1 is particularly important given your dataset.
• How to evaluate: Report both the mean and standard deviation from cross-validation. This
shows not just performance but also the stability of each model.
• Comparative analysis: Use evaluation results to guide final model choice. Discuss why you
preferred one model (or ensemble) over others.
Other Hints for Achieving a Comprehensive Solution
• Synergy of techniques: Some preprocessing choices may work better with specific classifiers.
Use cross-validation to explore these combinations.
• Comparing results: consider both performance (mean) and stability (standard deviation) when
evaluating cross-validation results.
• Beyond single models: Explore ensembles or even “ensembles of ensembles” (e.g., combining
random forest and k-NN with majority voting).
• Consistency: Ensure preprocessing steps used in training are exactly replicated during testing.

(End of Track 1. See the next page for Track 2 specifications.)
Page 8 of 10

Track 2: Competition-oriented project
1. Overview
In this project, you are required to participate in an online data mining competition that aligns with the
learning objectives of this course.
• The competition must:
o Offer monetary rewards (to ensure it is a genuine, competitive challenge).
o Conclude no later than October 1, 2025.
o Have a minimum of 10 competitors.
• Entry-level Kaggle competitions labelled “Getting Started,” “Playground,” or “Community” are
NOT eligible.
• Availability for this track is limited to 10 students. Places are allocated on a first-come-first-served
basis, subject to approval of your chosen competition by the teaching team.
• To join this track, you must first complete the Expression of Interest (EOI) Form.
Unlike the data-oriented track, in this competition-oriented track you may apply any data mining or
machine learning techniques, including those beyond the scope of INFS4203/7203, provided they are
appropriate to the chosen competition.
Any programming language can be used.

2. Project Requirements
You must submit the following:
a. Result report
o A brief report documenting your Public Leader Board results, including:
 A screenshot of your rank on the Public Leader Board.
 The URL of the Public Leader Board page.
o Your Public Leader Board username must be your student username (sXXXXXXX).
b. Code and README file
o README file should include:
 Environment description (OS, hardware, programming language and version,
additional packages).
 Reproduction instructions (how to run the code to generate your competition
submission).
Page 9 of 10

o Code should include:
 All tuning, training and testing code used to generate your final competition
submission.
 A main function in a main file (e.g., main.py) to execute the overall process.
o The README file and all code must be compressed into a single ZIP file.

3. File Format and Naming
• Result report: must be named sXXXXXXX.pdf or sXXXXXXX.docx (where sXXXXXXX is your student
username).
• Code and README: must be compressed into a single ZIP file named sXXXXXXX.zip.
• Files submitted in any other form or with incorrect names will not be accepted or marked.

4. Submission Guide
• Only your last submitted version will be marked.
• All required files must be submitted before the deadline.
• Late submissions will incur penalties according to the ECP:
o A penalty of 10% of the maximum mark per 24 hours after the due time, up to 7 days.
o Submissions more than 7 days late will receive a mark of 0.
• Submission links on Blackboard:
o Result report → “Report submission” Turnitin link (Assessment → Project → Report
submission).
o Code and README ZIP → “Readme and code submission” Turnitin link (Assessment →
Project → Readme and code submission).
o The ZIP file must be under 100MB. If your file is larger, contact infs4203@eecs.uq.edu.au
before the due time.

5. Marking Standard
The following marking standard will apply, unless otherwise discussed with the teaching team for
exceptionally challenging competitions.
Page 10 of 10

You need to submit the evidence of your achievements in the public leading board by the end of the
project deadline to earn your marks. Your username in the public leading board must be your student
username (sxxxxxxx, each x represents a digit).
If your targeted competition ends before the project deadline, you could show by cross-validation that
you have achieved comparable performance to a particular competitor on the public leading board before
the project deadline. Your project could then be assessed by the competitor’s corresponding rank
percentage on the public leading board.
You have to earn a public Leader Board top ranking index (your rank divided by the total number of
competitors) by the project deadline
Earned marks = max (20 – max (public_LB_top_ranking_index – 0.4, 0)*30, 0)
That is, you earn 20 marks when having Public Leader Board top ranking to be within top 40% of all
competitors.

End of Specification for Project

学霸联盟