COMP5310-无代写
时间:2024-03-15
COMP5310 Project Stage 1
Explore, clean, summarise and analyse the data
Due: 11:59PM on 28th of March 2024 (Week 6)
This assignment is worth 15% of the final mark of the unit of study.
GROUPS
This assignment is done in groups of 2 or 3. All students in a group must be attending the
same lab session.
Note: there is work required from each member separately, but the project is handed in as a
combined effort, and it is marked as a whole: there will be individual and group components
to the marks, all based on the single submitted document.
Group formation procedure
In Week 2 lab session there will be an opportunity to meet other students and form a group
with help from the tutor. Students must be in project groups with others who are all
timetabled in the same lab session.
In Week 2 lab:
• Exchange names and contact information (e.g., which social media platforms you
prefer for coordinating).
• Arrange when to get together: at least one meeting per week (in addition to your
scheduled lab session) is vital, but more frequent coordination is even better.
Dispute resolution
If during the course of the assignment work there is a dispute among group members that
you can’t resolve or that will impact your group’s capacity to complete the task well, you
need to inform the unit coordinator maryam.khaniannajafabadi@sydney.edu.au or the TA
daniela.rivasromero@sydney.edu.au. Make sure that your email specifies the lab session
and group name, and is explicit about the difficulty; also make sure this email is copied to all
group members (including anyone you are complaining about) and your lab tutor.
We need to know about problems in time to help fix them, so set early deadlines for group
COMP5310 Project Stage 1
Explore, clean, summarise and analyse the data
members, and deal with non-performance promptly (don’t wait till a few days before the
work is due to complain that someone is not delivering on their tasks). If necessary, the
coordinator will split a group and leave anyone who didn’t participate effectively in a group
by themselves (they will need to achieve all the outcomes on their own). This option is only
available up until Friday Week 5, which is the last day with time to resolve the issue before
the due date. For any group issues that arise after this time, you will need to try to resolve
the problem on your own, and you will continue to be treated as a single group which all get
the same mark for this stage, based on whatever is submitted (though you should still let
the coordinator, TA and lab tutor know about them). Groups may be changed after stage 1
is finished in this case.
PROJECT
Overview
The objective of stage 1 of the project is to acquire and meticulously clean the dataset,
followed by a comprehensive analysis to derive meaningful insights about the data, and
effectively prepare the data to build a predictive model in stage 2. Additionally, you will
define your research question, based on a research/business requirement, which you aim to
answer on stage 2.
Identify the topic
Each member needs to choose a different dataset and different topic. The dataset each
member chooses must be relevant to the topic and research question they define. We
realize that you may not find data that completely resolves the problem you have defined,
but all the data should at least be potentially able to provide some insights. For example, if
your topic is “what influences the level of wealth in a community?”, you might look at
datasets that relate to the economy, climate, education, type of government, etc. Please
make sure that your question or issue is not simply a factual matter, but instead looks at
relationships where insights might be impactful for some stakeholder groups (for example, it
is not a good choice of question to ask just “which country has the highest level of
wealth?”).
COMP5310 Project Stage 1
Explore, clean, summarise and analyse the data
Obtain the dataset and metadata
Each member needs to obtain a different dataset that can contribute to the exploration of
their own topic. We prefer that you use publicly available data (so we can check your work if
we need to) but it is OK for you to work on privately-owned data as long as you have
permission to use it, and permission to reveal it to the markers.
Each dataset must have a sufficient volume of data. For this assignment, a dataset is
considered sufficient volume if it contains at least 1000 rows/objects, and each
row/object has at least 15 attributes/columns. We recommend you choose a dataset that
is not already cleaned: you need to demonstrate that you can clean a dataset in Project
Stage 1 or prove that it is already cleaned (which may be harder). Consider your research
question when choosing your dataset, and make sure it has a range of attribute types and
data size that will help you answering your research question.
We will keep track of the datasets chosen by every student to make sure there are no
repetitions within the group and tutorial session. You can submit your chosen dataset by
filling this form, where you will be asked for your personal details, tutorial Activity code (find
it here), Group number (ask your tutor if unsure), a short description of your dataset and a
link to your dataset. If by any chance two students within a same group or same tutorial
session have chosen the same dataset, the student who entered their dataset first will
have priority and the other student will be contacted by their tutor to choose a different
dataset.
Note: You can choose to store your data using a Pandas data frame or by creating a
database on PostgreSQL, whichever option works best for you.
Note 2: It is your responsibility to make sure the dataset you have chosen meets the
minimum requirements, there will be no exceptions. Also, finding an appropriate dataset
to build a predictive model is part of the learning process and we won’t be “approving” or
“rejecting” datasets, it is your job to check if the dataset is appropriate for the task.
Ensure data quality
Each member needs to work with their dataset to ensure high-quality data that can be
COMP5310 Project Stage 1
Explore, clean, summarise and analyse the data
analyzed; we expect you to do whatever transforming and cleaning is appropriate. The
details of this aspect vary a lot, depending on the data you obtained. For example, the work
needed may be removing instances that have corrupted or missing values or filling in those
missing values in some sensible way; you may be correcting obvious spelling mistakes or
bringing different date formats to a common standard; maybe you need to remove
duplicate rows, or deal with inconsistent information (e.g., two different values for the
population of the same country). In any case, you are required to get the data to be fairly
clean: for some datasets you need to clean the data, in other cases, where your data
sources were already carefully curated, you would at least write a script that checks that the
data is clean (for example, by showing there is no missing data, or that every entry has an
appropriate value for that attribute). At the end of this part of the work, you will have a
dataset that should be high-quality.
Exploratory data analysis (EDA)
Each member needs to perform exploratory data analysis of their data in order to obtain
relevant information for the next stage of the project. This analysis must include at least
TWO supporting figures and a detailed discussion of the results obtained, indicating what
they tell you about your data and how they could impact the results of your modelling on
the next stage of the project. Do not include a matrix of figures of multiple analysis of all
attributes, you need to select and highlight the most important results from your analysis.
DELIVERABLES
Report
The report should have a maximum of 3 pages for each individual section and maximum 2
pages for the group section. It should use high-level headings to indicate the different
sections and sub-sections of the report and use line spacing of at least 1.15 and body font
size of at least 10pt. The goal is to convey the problem clearly and concisely.
The report should be targeted at a tutor whose goal is to see what you did, so they can
allocate a mark. It should have a front page that gives the group name and lists the
members involved (giving their SIDs and unikeys, NOT their names), and then the body of
COMP5310 Project Stage 1
Explore, clean, summarise and analyse the data
the report has a structure as follows (this corresponds to the marking scheme):
Individual Component
The report should begin with a section per group member (state the member’s unikey),
with:
1. Topic and research question: Describe the problem from a general perspective,
highlighting the business/research need, clearly state your research question, and
indicate some groups of stakeholders and how they could be helped by answering the
research question.
2. Data description: Provide a description of the data, including:
2.1. Provenance of the data: Indicate the provenance of your data, giving the whole
chain, from the original source of the data through any intermediate collections up
to the place where you got it from, and the date you obtained it.
2.2. Data license: Indicate any license or other restrictions on the use of the data.
2.3. Data structure and metadata: Indicate the number of attributes and instances, and
state the relevant metadata about this dataset, including a data dictionary which
indicates the attributes on your dataset, a description of each attribute, and the
data type of each attribute (int, float, string, date, etc.). Note: The data dictionary
can be included as an appendix and will not be counted towards the page limit.
3. Data quality and cleaning: Describe any data ingestion steps, indicating if you used a
Pandas data frame or a database in PostgreSQL, and briefly describe the data structure
or schema. Describe how you ensured data quality, if there were any quality problems,
describe what they were and how you cleaned the data; even if there were no problems,
you need to describe what problems you checked for, and how you did the checks.
Remember to justify your decisions, for example, if you decide to remove any rows with
missing data, explain why you decided to do this and how your decision might impact
data quality. Indicate which tools you used to acquire, ingest and clean the data, for
example, indicate which Python functions you used to clean your data.
Note: You don’t have to include the code on the report, as you will submit it separately,
but you can include snapshots of small sections of your code to support your explanation.
4. Exploratory data analysis (EDA): Describe in detail any exploratory data analysis you
COMP5310 Project Stage 1
Explore, clean, summarise and analyse the data
performed which provided you relevant information to answer your research question.
This analysis must include at least TWO supporting figures and a detailed discussion of
the results obtained, indicating what they tell you about your data and how these results
could impact the modelling results in the next stage of the project. Do not include a
matrix of figures of multiple analysis of all attributes, you need to select and highlight
the most important results from your analysis.
Group Component
The group section of the report should include:
1. Discussion: Discuss your thoughts on the strengths and limitations of each dataset, for
the purpose of investigating the question of interest. Discuss the exploratory data
analysis performed in the individual sections, highlighting the strengths and limitations
of each approach, and compare each approach to at least one alternative.
2. Conclusion: Include a recommendation, with reasons, on which dataset to use for the
next stage of the project and indicate the most important outcomes from the
exploratory data analysis performed.
Code and Dataset
You must also submit a copy of the dataset each member used for stage 1 of the project,
alongside the Python code each member used for exploring, cleaning, summarising and
analysing their data. This should be submitted as a single zip or tar.gz folder. This
compressed folder should contain a subfolder for each member of the group, using their
unikey as name of the folder. Then, each subfolder should contain the raw dataset as it was
obtained from the source, the Python code used for exploring, cleaning, summarising and
analysing their dataset, and the clean dataset which will be used for the next stage of the
project.
COMP5310 Project Stage 1
Explore, clean, summarise and analyse the data
MARKING
Marking Criteria Marks
Individual Component
Topic and research question 1
Data description 2
Data complexity 1
Data quality and cleaning 3
Exploratory data analysis 3
Group Component
Discussion 3
Conclusion 2
TOTAL 15
Deductions
• 10% of the overall individual mark will be deducted if your section of the report
exceeds the maximum number of pages. If the group section exceeds the maximum
number of pages, the deduction will apply to all group members.
• 5% of the maximum awardable mark will be deducted per day of late submission.