Python Jupyter代写-FIT1043-Assignment 1
时间:2021-08-25
FIT1043 Assignment 1 Specifications
Due date: Friday 27th August 2021 - 11:59 pm
Objective
The objective of this assignment is to investigate and visualise data using Python in the
Jupyter Notebook environment. This assignment will test your ability to:
● Read data from files in Python,
● Manipulate the data,
● Describe the data using basic statistics,
● Produce non-graphical and graphical visualisation to explore the data,
● Communicate your findings as insights, and
● Self-learn new techniques from other resources to complement what is taught in this
unit.
Data
The data is provided in three comma separated (CSV) files, which are sourced from Kaggle,
The World Bank, and United Nations. The files are to be downloaded and kept in a folder
(directory) called “data” where your ipython notebook is. The data are:
? “Country-Vaccinations.csv” contains information about the progress of Covid-
19 vaccinations around the world. (source: http://www.kaggle.com/gpreda/covid-
world-vaccination-progress). As part of the exercise, you can get the description of
the fields (columns) on the Kaggle site.
? “2020-GDP.csv” is the recorded Gross Domestic Product (GDP) of almost all
countries in the world for the year 2019. There are 4 columns in there but you will
only need the last 2 columns which are the country name and the GDP stated in US
Dollars. (source: http://datacatalog.worldbank.org/dataset/gdp-ranking).
? “2020-Population.csv” contains information about country and region population
from 1950 to 2020. (source: http://population.un.org/wpp/Download/Standard/CSV/)
Most of the columns are self-explanatory but do participate in the Moodle forum to ask for
clarifications or discussion on the data.
Note: For this assignment, do not download the latest from the sources. Use the provided
data on Moodle.
Submission
This assignment has to be done using the Jupyter Notebook only. Your Jupyter Notebook
has to use the Markdown language for proper formatting of the report and answers, with
inline Python code and graphs.
You are to hand in two files:
1. The Jupyter Notebook file (.ipynb) that contains a working copy of your report
(using Markdown) and Python code that answers the questions.
2. A PDF file that is generated from your Jupyter Notebook. Execute your Python code
and then download it as a PDF document. To do so (in Windows), you can do a
“Print Preview”, then “Print” the document, and then select “Save as PDF”. Note that
there are other ways to do this, depending on the environment that you are in.
Alternatively, you can download as HTML and then “Print” that to PDF. Again,
participate in the Moodle forum if you need assistance on this.
Clarifications
This assignment is not meant to provide step by step instructions and I expect to have some
clarification questions. I would like you to post these questions on the Moodle Forum and I
strongly encourage interactions between all of you in the forum. You can also go to
consultation times if you need more clarifications. Some of the questions probably don’t have
a single answer or a correct answer and is up to each individual’s interpretation. Just make
sure that you do not post answers in the forum.
Assignment
This assignment is worth 20 marks, which makes up for 10% of this Unit’s assessment. This
assignment has to be done using the Python programming language in the Jupyter
Notebook environment. It should also be formatted properly using the Markdown
language. As an example, your Jupyter file should produce something like below (image
taken from http://stackoverflow.com/questions/36288670/how-to-programmatically-
generate-markdown-output-in-jupyter-notebooks). For each section, you are to write about
your approach, then your code and the output (can be non-graphical or graphical).
Tasks
You should start your assignment by providing the title of the assignment and unit code, your
name and student ID, e.g.
The tasks will involve:
1. Importing the necessary libraries,
1.1. ensure you explain each step before executing the code and functions, so we know
what you want to do (like the “hidden” example above)
2. Read the files,
2.1. do not change the location of the intended files, i.e. they should be in a folder called
“data”
2.2. make sure you show that you have read the data correctly. Hint: show the head, tail
and randomly some parts of data.
3. Wrangle the data,
3.1. sub-setting the necessary data: For the Vaccination related DataFrame, you are to
keep the columns: country, people_fully_vaccinated, total vaccinations (not
from the column total_vaccinations but aggregated from the column
daily_vaccinations as they are inconsistent), and vaccines; For the GDP
and Population datasets only select the population and GDP values for the countries
and save them as two separate datasets.
3.2. Ensure proper renaming of the columns(and indexing),
3.3. In this assignment you only need to work with a subset of data that includes data for
Indonesia, Malaysia, Singapore, Thailand, Philippines, and Australia. Create a
list or tuple or other data structure to store the names of the countries and
explain why you selected the data structure.
4. Select the information for the above-mentioned countries for all datasets in 3.1 and
merge the datasets correctly; hint: you should merge the three small created datasets for
the seven requested countries.
5. Manage any data type issues or data issues,
6. Feature engineer (create) the column “perCapitaGDP”, and
6.1. as a guide, your final DataFrame should have 6 rows (the countries) and 7 columns
(country, vaccines, people_fully_vaccinated, total_vaccinations,
population, GDP, perCapitaGDP). You may have extra fields, and that will be
ok.
7. provide some statistical description of the final data that you have.
7.1. Interpret the data that you have obtained using basic statistics.
You are then to select the appropriate plots (graphs) and provide some basic insights to the
following questions (referred to as Question 1, 2 & 3 in the rubrics):
1. Each country currently may have only 1 vaccine being used but some will be more
than 1 vaccine type. If there are more than 1 vaccine type, assume that it will be
equally distributed to the country’s population. With this in mind, how would you
visualise the estimated number of people vaccinated for each vaccine type for the
selected countries population? Provide some form of insights (although it may be
straight forward and easily understood from the visualisation).
2. For each of the country, plot a bar graph, with side-by-side bars for population, total
vaccinations, and people_fully_vaccinated. There are two challenges here,
firstly, the default graph will be difficult to visualise due to large differences in the
numbers, and secondly, this information may not give a good visualisation. These 2
challenges are for you to figure out, be creative to make the appropriate code
changes for a better visualisation.
3. For the final question, you will probably need the non-aggregated data from
“Country-Vaccination.csv”. You are to extract the data that’s related only to
Australia and then plot a line graph on the daily_vaccinations over time. Like
earlier, you are to discard the original total_vaccinations and create the total
vaccinations for each day using the cumulative sum of the daily vaccinations (up to
that day). Again, plot a line graph to visualise the cumulative vaccinations over time.
Explain in what circumstances would the first line graph be useful (if at all) and in
what circumstances that the second (cumulative) line graph would be useful?
For all the visualisations (graphs), they are to be labelled and formatted appropriately. There
will be some clarifications needed as each of you may approach it differently. As such, do
use the Discussion Forum for this purpose as it will encourage peer-to-peer learning and
also may benefit others who are taking a similar approach.
There will be penalties for late submission (as per University policy), incorrect submission
format and/or unreadable submissions.
Marking Rubrics (Guideline ONLY)
Report Appropriately formatted
using Markdown (and
HTML) and content
1 mark - Using at least 2 formatting codes
(Markdown or HTML)
1 mark - Good and easy to read submission,
including introduction and conclusion.
Code Reading and describing
the file content
1 mark – Importing libraries, reading files and
showing that they are read correctly, and basic
statistics of the values in the files.
Wrangling, merging the
files into one
DataFrame
2 mark – Using a list/tuple/other data structure
to store the selected countries, and explaining
the choice.
5 marks – Aggregating, sub-setting, renaming,
re-indexing, type manipulation (type casting),
and merging (some evidence of the use of any
of them)
3 marks – Feature engineered “per capita
GDP”, neat and no duplicated fields in final
DataFrame, describe data.
Question 1 1 mark – Appropriately explained choice of
graphing.
1 mark – Code and graph (logical and
executable with explanation)
Question 2 1 mark – Code and graph (logical and
executable)
1 mark – Explain the issue with the basic bar
graph
1 mark – Challenge to create a more
appropriate bar graph and also explain why
purely using this data may not be appropriate.
Question 3 1 mark - Code and graph for daily vaccinations
(logical and executable)
1 mark – Code and graph for cumulative
vaccinations (logical and executable with
explanation)
Have Fun!
Upon completion of this assignment, you should have a high level experience of bits and
pieces of Drew Conway’s Venn Diagram By completing this assignment, you would have
shown that you have “hacking skills” (your Python code), you should have touched upon
some basic statistics (although you have not used it effectively for understanding Machine
Learning) and hopefully, you have managed to convince that you have some domain
knowledge (e.g. GDP of countries and also about Covid-19 vaccination – useful to know if
you don’t already!).