INFS4205/7205-无代写
时间:2023-05-15
INFS4205/7205 Advanced Techniques for High Dimensional Data Semester 1, 2023
1
INFS4205/7205 Individual Project
Due: 16:00 AEST on 19 May 2023
Weighting: 20%
All assignments should be submitted to the UQ Blackboard. If any assignment fails to be submitted
appropriately before the due date, late penalties will be applied as detailed in the ECP. It is your
own responsibility to ensure your submission is successful on time. Email submission will not be
accepted.
Updates (v1)
Clarification on number of algorithms to implement.
• You need to implement at least three different algorithms in total.
• For each of the query task, you need to implement at least two different algorithms so you
can make comparisons between different methods for the same query task.
• One algorithm (e.g., R-Tree) can be implemented for multiple tasks if applicable.
• For those students who complete more than two methods for each query task, you will get
bonus marks for both Completeness and Innovation.
• Linear scan as a basic baseline method also counts.
Clarification on Correctness.
• You need to test different cases for each query task to prove the correctness of your methods.
• For example, to test the query task: “find all data points in a given rectangular area and within
a certain time window”, you need to test different rectangular areas and different time
windows. You need to make sure for all cases, your algorithms return the same results as
the DBMS.
INFS4205/7205 Advanced Techniques for High Dimensional Data Semester 1, 2023
2
Overview
The project consists of two sections (1) Implementation and (2) Report. In this assignment, you are
asked to implement a set of query scenarios utilising spatial / spatial-temporal data as well as
computational geometry algorithms wherever suitable. You are required to find spatial datasets
that are suitable for this project and implement at least three appropriate algorithms (e.g., k-d tree,
R tree indexing). You need to construct spatial DBMS (e.g., PostgreSQL, Oracle, MySQL) to validate
the correctness of your implementation. Finally, you need to present your problem statement,
methodology, outcomes, and analysis in the project report. You will need to present your findings
in a clear and concise manner, with a focus on the insights gained from the project.
This assignment is designed to assess your ability to apply advanced techniques for high
dimensional data manipulation to solve real-world problems. This is an individual assignment. The
completion of the assignment should be based on your own design.
Language requirements: You are allowed to use any programming languages (e.g., Python or Java)
for implementing the project. You are also allowed to use existing libraries (citation required).
INFS4205/7205 Advanced Techniques for High Dimensional Data Semester 1, 2023
3
Datasets Selection
Any open-sourced dataset is allowed as long as it fits topic about spatial / spatial-temporal data
manipulation. We provide some example datasets for reference, including but not limited to:
Example Datasets Size Attributes Difficulty Marks Capped
Chipotle Locations 2,629 Coordinates Easy 15
Satellite Data 419,438 Coordinates Easy 15
Traffic Accident 2,845,342 Coordinates, Timestamps Moderate 17
FourSquare 38,333 Coordinates, Timestamps Moderate 17
Taxi Trajectory Data 1,703,650
Coordinate Sequences,
Timestamps Hard 20
Gowalla 6,442,890 Coordinates, Timestamps,
Relationships
Hard 20
Note that, for the datasets you found but not listed above, we evaluate the difficulty considering
both datasets size and attributes. For ‘moderate’ datasets, the size is greater than 10, 000 and
attributes contain at least coordinates and timestamps. For ‘hard’ datasets, the datasets size is
greater than 100, 000 and attributes should be more complicated and informative.
Marks Capped (as shown in the last column): If you choose to work with the easy dataset, the
maximum marks you can obtain for this project is 15. This means that any marks beyond 15
will not be counted towards your final grade.
INFS4205/7205 Advanced Techniques for High Dimensional Data Semester 1, 2023
4
Implementation [10 marks]
1. Once you have determined the datasets, you need to conceptualize at least five query tasks
from the real world. Some example query tasks are listed below:
a. find all data points in a given rectangular area and within a certain time window.
b. find all data points within certain distance to a trajectory emerging on the same day.
c. find k nearest neighbours (data points) of a given trajectory for a given date.
d. find the skyline data points.
e. find the trajectory that is shortest and fastest from given data point to another.
f. find the trajectory that is most similar to a given trajectory.
(Note: the distance should be great-circle distance, which can be computed e.g., by geopandas.)
2. You should implement at least three algorithms (e.g., k-d tree, R-tree) taught in this course to
solve the query tasks you defined. You are encouraged to improve the taught methods with
your own ideas or/and try novel methods proposed in recent research literature.
3. You need to design and build a spatial(-temporal) database for the selected datasets, then
write SQL code for each query task you proposed and verify the correctness of your algorithm
by comparing the ground truth results returned by spatial DBMS and the results returned by
your implemented algorithms.
4. You need to use fair and reasonable metrics to evaluate the various methods you implement.
For each query task, you need to compare e.g., the time cost, memory cost, and I/O cost of the
system, when a) building the index and b) executing the query.
5. You must upload your source code for both a) algorithm implementation and b) database
construction and query, otherwise, no marks will be given for this section.
The marking criteria is summarized as follows:
Completeness [4 marks]: The selected high-dimensional database was adequately processed and
cleaned. At least three algorithms taught in this course should be implemented, or methods from
recent scientific research can be reproduced. At least five query tasks from real-world scenarios
need to be given to test your implementation. The testing scenarios should cover different types
of spatial query tasks and make full use of the special attributes (e.g., sequence, relationships) of
datasets, reflecting the completeness of the methods. Evaluating and comparing implemented
methods should be in a comprehensive and fair manner.
Correctness [4 marks]: Your implementation correctly addresses the query tasks and is validated
using a spatial DBMS. You need to show the SQL code used for generating the ground truth query
results to validate the correctness of the implemented algorithms. The implemented code or
program runs without errors and bugs and all the functionalities and features work as intended.
The code is well-structured, easy to understand and maintain, and the follows good programming
practices and standards.
Effectiveness and Innovation [3 marks]: The project should present a unique, innovative and
improved approach to solving a problem or meeting a need, rather than purely utilizing existing
libraries or online code.
INFS4205/7205 Advanced Techniques for High Dimensional Data Semester 1, 2023
5
Report [10 marks]
This report should cogently (1) introduce the task or problem being proposed and elucidate its
practical application value in industry or its potential contribution to scientific research (e.g., why
R tree falls short in facilitating fast query and how it can be enhanced. (2) Then, you need to
explicate the approach employed in a precise and explicit manner, encompassing the overall
algorithm, the technical intricacies of each step or module, as well as any improvements or
innovations you made. (3) You need to show the correctness (precision, recall, F1-score, etc.) of
the results returned by the implemented method for different query tasks, compared to the
ground truth query results. (4) The report must also address the reasonable verification of
method performance (e.g., time cost, memory cost, and I/O cost) or equitable comparison with
alternative methods. (5) Lastly, experimental results must be comprehensively presented by
tables, plots, and/or with some visualization tools. The results should be analysed deeply to
unearth insightful findings.
The report must not exceed four pages in length and should be written in given IEEE doc or latex
template. The marking criteria is summarized as follows:
Definition and Scope [2 marks]: The definition of a substantial and significant topic, problem
and/or hypothesis (including statement of purpose and relevance) and scope (including context,
boundaries, and assumptions) should be clearly presented.
Methodology and Algorithm [3 marks]: The methodology should be described in a systematic
and logical way. You can enrich your descriptions by drawing detailed flowcharts and/or using
rigorous mathematical formulas.
Results Analysis [3 marks]: The project results are complete and comprehensively presented and
analysed, using tables, plots, and/or some visualization tools. If the source code is not uploaded,
no marks will be given for this criterion.
Writing and Presentation [2 marks]: The report should be written in excellent logical structure,
physical layout, scientific and technical style, with no spelling mistakes or grammar errors. You
need to appropriate reference to a correctly formatted bibliography.
INFS4205/7205 Advanced Techniques for High Dimensional Data Semester 1, 2023
6
Submission
You are required to submit all following files.
− A compressed file (zip) consisting of all source code:
o algorithm implementation in any language,
o a SQL file including the database construction, manipulation, and task queries.
− A 4-page project report in PDF format.
Only your submitted version will be marked. A penalty will be applied to the late submission
according to the ECP.
Use of AI Tool
Artificial Intelligence (AI) provides emerging tools that may support students in completing this
assessment task. Students may appropriately use AI in completing this assessment task. Students
must clearly reference any use of AI in each instance. A failure to reference AI use may
constitute student misconduct under the Student Code of Conduct. This task has been designed to
be challenging, authentic and complex. Whilst students may use AI technologies, successful
completion of assessment in this course will require students to critically engage in specific
contexts and tasks for which artificial intelligence will provide only limited support and guidance.
A failure to reference AI use may constitute student misconduct under the Student Code of
Conduct. To pass this assessment, students are required to demonstrate detailed comprehension
of their written submission independent of AI tools.
When you use generative AI (ChatGPT) in this assessment, you should:
− Do not provide any private information when using these tools.
− Verify any information provided by generative AI tools with credible sources and check
for missing information.
− Acknowledge any generative tools that you use for your assignments or work and how
you used them. For example, include the name, model or version, date used and how you
used it in your assignment or work.
Useful Tools
− Visualization spatial (-temporal) data over google maps: [link]
− Import CSV file into PostgreSQL table: [link]
essay、essay代写