COMP42315 Programming for Data Science 1
COMP42315 Assignment –
Web Scraping, Data Analysis, and Visualization
Module/Lecture Course: Programming for Data Science
Deadline for submission: 17th Feb 2022 16:30
Deadline for marks and feedback to be
returned to students:
30th March 2022 16:30
Submission instructions: Submit all files via Blackboard
Submission file type(s) required: Word/PDF document for the individual reports, Jupyter
notebook
Format: Report as a Word or PDF document. Accompanying
software implementation for the individual report as a
Jupyter notebook (i.e. the .ipynb file). Do not put your
identity on your report.
Contribution: The assignment contributes 100% to the final mark for the
module.
In accordance with University procedures, submissions that are up to 5 working days late will be subject
to a cap of the module pass mark, and later submissions will receive a mark of zero.
Content and skills covered by the assignment
• Understand advanced concepts of programming in Python.
• Have a critical appreciation of the main strengths and weaknesses of a range of Python
packages and understand how to use them.
• Have a critical appreciation of how to acquire and clean datasets for analysis.
• Understand how to manipulate potentially large datasets efficiently.
• Be able to write computer programs in python using industry-standard packages.
• Be able to select appropriate data structures for modelling various data science scenarios.
• Be able to select the appropriate algorithm and programming package for a given problem.
• Be able to write a computer program in python to collect or read data from available
sources, and clean these datasets using the appropriate packages.
• Effective written communication.
• Planning, organising, and time-management.
• Problem solving and analysis.
Requirements
Students are expected to work on the coursework individually.
COMP42315 Programming for Data Science 2
In this assignment, you are asked to scrape data from a website, perform data analysis and visualization.
You will implement the programming solution with a written report that explains the implementation and
justifies the design.
What the examiners expect from program implementation:
• Your program must be runnable on the Durham NCC server – a program that partially
works or does not run at all will receive no mark.
• You are asked to use Python and the Python libraries taught in this module to complete this
part. If you wish to use other libraries, you should ask for permission from your tutors first
and provide a strong reason.
• Your source code should be documented with comments, making it to be followed as easily
as possible.
• Apart from performing the requested functionality, your design should aim at a clear
programming logic. Your proposed solution should also be as robust as possible, such that it
works in different situations and would hopefully work in the future when the site owner
updates the webpage (i.e. as future-proof as possible).
What the examiners expect from the report:
• Your report should explain your solution with reference to your source code. You are NOT
encouraged to copy the whole source code to your report, but you may refer to/quote
important lines if you believe that is helpful.
• If there are any features that you wish to highlight, you are also encouraged to do so such
that your examiner can pay attention to them.
• You are welcome to use visualizations, figures, tables, organization structures, etc. to help
you explain your design ideas and showcase the results.
• You should also provide support and justification for your design.
Questions
You are asked to perform the following tasks based on the following target website, which contains
artificial content designed for this assignment: https://community.dur.ac.uk/hubert.shum/comp42315/
1. Please design and implement the solution to crawl all the unique URLs for the detailed
publication pages. Explain your design and highlight any features with no more than 150 words.
(10%)
2. Please design and implement the solution to crawl all the text-based information of each
publication from the website, to convert such information into a suitable data format, and to
store it in a data file. Explain your design and highlight any features with no more than 250
words. (20%)
3. Please design and implement a solution to find out the 100 most popular words used for the title
and the abstract of the publications. You should define what a “word” means under your design.
For example, such “words” can be of an arbitrary length (single word/double word) and/or they
should be as meaningful as possible. Explain your design and highlight any features with no more
than 250 words. (20%)
4. Please design and implement the solution to use data analysis and visualization for analysing
which authors collaborate (or appear) as co-authors in the publications. Explain your design,
highlight any features, and showcase your findings with no more than 300 words. (20%)
5. Please design and implement the solution to use data analysis and visualization for analysing
how the features of a publication would affect its “citation” (a value that can be found in the
publication detail pages). Explain your design, highlight any features, and showcase your findings
with no more than 400 words. (30%)
COMP42315 Programming for Data Science 3
Word Limit policy
The word count as mentioned in individual questions will:
• Include all the text, including title, preface, introduction, in-text citations, quotations,
footnotes, and any other item not specifically excluded below.
• Exclude diagrams, tables (including tables/lists of contents and figures), equations, executive
summary/abstract, acknowledgments, declaration, bibliography/list of references, and
appendices. However, it is not appropriate to use diagrams or tables merely as a way of
circumventing the word limit. If a student uses a table or figure as a means of presenting
his/her own words, then this is included in the word count.
Examiners will stop reading once the word limit has been reached, and work beyond this point will not be
assessed. Checks of word counts may be carried out on submitted work. Checks may take place manually
and/or with the aid of the word count provided via electronic submission.
Plagiarism and collusion
Your assignment will be put through the plagiarism detection service on the Learn Ultra.
Students suspected of plagiarism, either of published work or work from unpublished sources, including
the work of other students, or of collusion will be dealt with according to the Computer Science
Department and University guidelines.