COMP5310-Python代写|学霸联盟

COMP5310-Python代写

时间：2023-03-10

COMP5310 Project Stage
1
Explore, Clean, Define

Due: 11:59pm on March 19th, 2023 (week 4)
Value: 10% of the unit
Note: these instructions are long and somewhat complicated, but the
work you need to do is not actually very much. Don’t wait until near the
due date to start! If anything in the instructions is unclear or confusing,
please ask about it on Edstem.

GROUPS

RULES
This assignment is done in groups of 2 or 3. Under exceptional circumstances a group
of 4 members may be created by the unit coordinator. Similarly, a smaller group may be
created by the coordinator when dealing with group disputes as described below, or
when a group is reduced in size due to member discontinuing this unit. All students in a
group must be attending the same lab session and will be reporting progress to your lab
demonstrators. Note: there is work required from each member separately, but the project
is handed in as a combined effort, and it is marked as a whole: there will be an individual and
group component to the marks, all based on the single submitted document.

GROUP FORMATION PROCEDURE
In week 1 lab, you should form a group by joining the same Project Group on Canvas.
In on campus labs: there will be an opportunity to meet other students and form a
group with help from the tutor.
In remote labs: the tutor will help assign you to a group, but you may ask to join a
group with specific people in the same remote lab.
Students must be in project groups with others who are all timetabled in the same lab class.
See announcements in Canvas for exceptions.
In week 2 lab:
- Exchange names and contact information (e.g., which social media platforms you
prefer for coordinating).
- Arrange when to get together (virtually): at least one meeting per week (in addition
to your scheduled lab session) is vital, but more frequent coordination is even better.
- If necessary, the lab demonstrator may rearrange group membership; this is
Page 2 of 7
most often needed when someone, who is left out of any other group, gets added to
an existing group, but the demonstrator is also allowed to split a group if this is
necessary.

DISPUTE RESOLUTION
If during the course of the assignment work, there is a dispute among group members
that you can’t resolve, or that will impact your group’s capacity to complete the task
well, you need to inform the unit coordinator, nazanin.borhan@sydney.edu.au. Make
sure that your email names the group, and is explicit about the difficulty; also make sure
this email is copied to all the members of the group (including anyone you are
complaining about). We need to know about problems in time to help fix them, so set
early deadlines for group members, and deal with non-performance promptly (don’t wait
till a few days before the work is due, to complain that someone is not delivering on
their tasks). If necessary, the coordinator will split a group, and leave anyone who didn’t
participate effectively, in a group by themselves (they will need to achieve all the
outcomes on their own). This option is only available up until Friday Week 3, which is
the last day with time to resolve the issue before the due date. For any group issues that
arise after this time, you will need to try to resolve the problem on your own, and you
will continue to be treated as a single group which all get the same mark for this Stage,
based on whatever is submitted (though you should still let the coordinator know about
them). Groups may be changed after Stage 1 is finished in this case.

THE PROJECT WORK FOR THIS STAGE:

SUMMARY
• [Done separately by each member] Identify a topic you want to understand
through data.
• [Done separately by each member] Obtain a suitable data set, and identify
relevant meta-data associated to each dataset.
• [Done separately by each member] Ensure that each set of data has good quality
and is clean from serious errors.
• [Done separately by each member] Produce a few summaries (aggregates) of some
attributes in each set.
• [Done separately by each member] Contribute to the final report, describing your
dataset, the topic you sought to address, and the procedure you used to clean the
data.
• [Done together] In the final report, describe the pros and cons of the topic and
dataset contributed by each member, and a recommendation of which topic and
dataset your group should use in Stage 2 onwards, in relation to the objectives of
Project Stage 2A and 2B.

IDENTIFY TOPIC [INDIVIDUAL]:
Each member needs to choose a different datasets and different topics (in special
circumstances a group can work on the same dataset but on different topics, with approval
from the teaching staff). The dataset each member chooses must be relevant to that single
topic or question, which you are investigating because it matters to some stakeholders.
We realize that you may not find data that completely resolves the issue you are targeted
at, but all the data should at least be potentially able to provide some insights. For
Page 3 of 7
example, if your topic is “what influences the level of wealth in a community?”, you might
look at datasets that relate to the economy, climate, education, type of government etc.
Please make sure that your question or issue is not simply a factual matter, but instead
looks at relationships where insights might be impactful for some stakeholder groups
(for example, it is not a good choice of question to ask just “which country has the highest
level of wealth?”)

OBTAIN DATASETS AND METADATA [INDIVIDUAL]:
Each member needs to obtain a different dataset that can contribute the exploration of their
own topic. We prefer that you use publicly available data (so we can check your work if we
need to) but it is OK for you to work on privately-owned data so long as you have
permission to use it, and permission to reveal it to the markers.

Deliverables:
• Keep (and provide to us) a copy the data as you originally obtained it.
• State relevant metadata about this dataset, including:
o A data dictionary (which indicates which attributes there are, and what each
means).
o The data provenance (giving the whole chain, from the original source of the
data, through any intermediate collections, up to the place where you got it
from [and the date you obtained it]).
• To qualify for a Pass grade, each dataset must have a “sufficient volume of data”, so
that automation of processing becomes crucial. For this assignment, we define
“sufficient volume of data” as follows:
o For defining volume, we will consider the number of “values”. For the most
common case, rectangular data e.g., CSV, the contents of a field for an item
would be a value. So, if you have 100 rows of data, each with 5 attributes, that
would be 500 values. For JSON data, the keys don’t count, and the values count
based on their atomic (string, number etc) components: so, if one attribute’s
value somewhere is a list of 5 numbers, that counts as 5 values; if it is a
dictionary with 7 keys, each associated to a string, that counts as 7 values.
o A dataset is considered as sufficient volume if it contains at least 1000
values and have at least 30 attributes. Datasets of smaller size or number
of attributes might be considered with special permission from your tutor.
• We recommend you choose a dataset that is not already cleaned: you need to
demonstrate that you can clean a dataset in Project Stage 1 or prove that it is already
cleaned (which may be harder).
• Consider the objectives of Stage 2A and 2B when choosing your dataset, and make
sure it has a range of attribute types, and data size that will help you meet those
objectives.

ENSURE DATA QUALITY [INDIVIDUAL]:
Each member then needs to work with their dataset to ensure high-quality data that can
be usefully analyzed; we expect you to do whatever transforming and cleaning is
appropriate. The details of this aspect all vary a lot, depending on the data you obtained.
For example, the work needed may be removing instances that have corrupted or
missing values or filling in those missing values in some sensible way; you may be
correcting obvious spelling mistakes or bringing different date formats to a common
standard; maybe you need to remove duplicate rows, or deal with inconsistent
Page 4 of 7
information (e.g., two different values for population of the same country). In any case,
you are required to get the data to be fairly clean: for some data sets, you need to clean
the data, in other cases where your data sources were carefully curated already, you
would at least write a program that checks that the data is clean (for example, by
showing there is no missing data, or that every entry has an appropriate value for that
attribute). At the end of this part of the work, you will have a dataset which should be
high-quality.

WRITE A REPORT [IN YOUR GROUP WITH INDIVIDUAL COMPONENT]:
Working together as a group, you need to produce a report. The structure of the report is
described below in detail, as the report is the main deliverable for grading in this project.
The report has sections for each member’s separate work, as well as a combined
discussion of the strengths and limitations of each dataset and a recommendation, with
reasons, on which to use for the next stage of the group project. The length of the report
should not exceed 1400 words for groups of 2, and 2000 words for groups of 3.

WHAT TO SUBMIT, AND HOW:

There are two deliverables in this Stage of the Project, and both should be submitted
by ONE PERSON on behalf of the whole group.

SUBMIT A STAGE 1 WRITTEN REPORT ON YOUR WORK, AS A PDF
This should be submitted via the Stage 1 Report link in Canvas. The report should be
targeted at a tutor or lecturer whose goal is to see what you did, so they can allocate a
mark. The report should have a front page, that gives the group name, and lists the
members involved (giving their SIDs and unikeys, not their name), and then the body
of the report has structure as follows (this corresponds to the marking scheme):

Report length
There is a maximum length for the report of 1400 words for groups of 2 and 2000 words
for groups of 3.

Individual component
The report should begin with a section per group member (state the member’s unikey),
with:
1. A subsection that (i) describes the topic or question that you are interested to explore,
(ii) it indicates some groups of stakeholders and says how they will be helped by
understanding this topic or answering this question, and (iii) it includes a short
discussion of the relevance to this issue of the data you have obtained.
2. A subsection that explains the dataset, including clear statements of the relevant
metadata:
a. The provenance of the data.
b. Any licence or other restrictions on use of the data.
c. Description of all the changes you did between the original datasets and the
final dataset.
d. A data dictionary indicating the meaning of each attribute, what format or units
are used, etc.
3. A subsection that describes how you ensured data quality in the dataset. If there were
Page 5 of 7
any quality problems, describe what they were and how you cleaned the data; even
if there were no problems, you need to describe what problems you checked for, and
how you did the checks. If the cleaning was done with a spreadsheet, describe the steps
clearly; if the cleaning or checking was done by Python code, then include the code in
your report.
4. A subsection that describes and explains some simple analysis that you have done
using Python code (show the code and the output of the analysis).

Group component
This subsection should:
• Give your thoughts on the strengths or limitations of each dataset, for the purpose of
investigating the topic or question of interest.
• A recommendation, with reasons, on which dataset to use for Stage 2A onwards.

SUBMIT A COPY OF THE STAGE 1 PER-MEMBER DATASETS
This should be submitted through the Canvas system, as a single zip or tar.gz file. You
should have a single folder, with subfolders for each member. The subfolder for a
member should contain the raw dataset as it was obtained from the source, any
spreadsheet or Python code used for cleaning/checking, the Python code to calculate
some summaries, and the clean version of the dataset (if you have used Grok or some
other browser-based approach to running your code, make sure you download a copy of
the code to have a file you can include in your submission). You then compress the top
folder (with all these subfolders and their contents), then submit the single compressed
file.

MARKING

Here is the mark scheme for this assignment. The score (out of five) is the sum of
separate scores for each of four components. Note that the final score is a combination of
individual and group components.

IDENTIFYING THE TOPIC [2 POINTS]
This component of assessment is based on the corresponding individual section of the
report.
Full marks: a clear and exciting statement of a topic, question, or problem, along with a
clear account of several distinct stakeholder groups who would be impacted in different
ways by improved understanding or solutions, and also a convincing explanation of how
the dataset can be useful in resolving the issue.
Distinction: a clear statement of a topic, question, or problem, along with a clear account
of at least one stakeholder group who would be impacted by improved understanding or
solutions, and also a clear explanation of how the dataset can be useful in resolving the
issue
Pass: a clear statement of a topic, question, or problem, to which the dataset is potentially
relevant.
Flawed: A solid attempt to describe the topic

Page 6 of 7

DATASETS AND METADATA [2 POINTS]
This component is assessed based on the corresponding individual subsections of the
report; the uploaded data and code may be checked by the marker as supporting
evidence for claims made in the report.
Full marks: the group member's dataset has a detailed and thorough statement of
appropriate metadata that describes the data structure, the data meaning, and the data
provenance, including evidence that you are authorized to use the data as you have done.
The dataset must be sufficient volume.
Distinction: the group member’s dataset has a statement of appropriate metadata
including the data structure, the data meaning, and the data provenance. The dataset must
be sufficient volume.
Pass: the group member’s dataset has a statement of metadata that describes some
significant aspect of the data. The dataset must be sufficient volume.
Flawed: the group member’s dataset does not have a statement of metadata, or has a non-
complete statement of metadata. The dataset is not appropriate for the exercise.

ENSURING DATA QUALITY [2 POINTS]
This component is assessed based on the corresponding individual subsections of the
report; the uploaded data and code may be checked by the marker as supporting
evidence for claims made in the report.
Full marks: In the group member’s dataset, several distinct aspects of data quality have
been checked in an automated way by Python code, and if any problems were found, they
have been all handled in an automated way by Python code (that is, the “clean” dataset
from this member does not suffer from any of these particular quality problems). The
dataset must be sufficient volume.
Distinction: In the group member’s dataset, some aspect of data quality has been
checked in an automated way by Python code and if any problems were found, they have
been handled (that is, the “clean” dataset from this member does not suffer from this
particular quality problem). The dataset must be sufficient volume.
Pass: In the group member’s dataset, some aspect of data quality has been checked and if
any problems were found, they have been handled (that is, the “clean” dataset from this
member does not suffer from this particular quality problem). The dataset must be
sufficient volume.
Flawed: Some reasonable attempts to improve or check data quality, or the dataset is not
suffient volume.

DISCUSSION AND RECOMMENDATION [4 POINTS]
This component is assessed based on the group section of the report.
Full marks:
• An insightful discussion of the strengths and limitations of each dataset, linking these
to the corresponding topic or research question.
• A recommendation, with reasons, which dataset to use in Project Stage 2A.
Page 7 of 7

Distinction:
• A discussion of the strengths and limitations of each dataset.
• A recommendation, with reasons, which dataset to use in Project Stage 2A.

Pass:
• A discussion of the strengths or limitations of some of the datasets.
• A recommendation, which dataset to use in Project Stage 2A.

Flawed:
• Some discussion of the datasets or a recommendation which to use in Project Stage
2A.

LATE WORK
As announced in the unit outline, late work (without approved special consideration or
arrangements) suffers a penalty of 5% of the maximum marks, for each calendar day
after the due date. Note: No late work will be accepted more than 10 calendar days after
the due date.