1
COMPSCI4077 Web Science H - Course Work
Semester 2 (January 2021 - March 2021)
Academic Year 2020-2021
School of Computing Science
University
of Glasgow Social Media Analytics Total Marks – 100 marks &
Weightage – 20% Course work deadline – Monday, 22 March 2021 4:30PM
INTRODUCTION Part I – Individual project (100% weightage) The
objective of this course work is to develop a Twitter crawler for data
collection in English and to conduct social media analytics. We
recommend students to use Python language and also MongoDB for data
storage. It is very important that students provide working version of
the software, as we need to validate them. Students submit their code
and report on or before the specified deadline. In addition, students
provide a sample of the data set collected. Submission is through the
Moodle page for the Web Science course. The coursework will be marked
out of 100. Course work will have 20% weight of the final marks. As the
usual practice across the school, numerical marks will be appropriately
converted into bands. Final written exam will have 80% weightage, which
will be in April/May 2021. We are interested in tweets, which are posted
in United Kingdom. Collect data for 1 hour of any day. In additions,
sample multimedia contents for tweets with media objects should be
downloaded. Part II – Group project (Part 1 have 70% weightage; Part II
have 30 marks) This can be done by groups of two (not more than 2). In
case of group project, you should do the following as well – a.
Identify important events in the clusters b. Investigate the
geo-localisation of events
Specific tasks to do Develop a
crawler to access as much Twitter data as possible & group the
tweets based on similarity. Use important activity specific data to
crawl additional data. During this process you will identify Twitter
data access APIs along with access constraints.
2
1. [10
marks] Use Twitter streaming API for collecting data. Use streaming
API with a United Kingdom geographical filter along with selected words.
(provide clear documentation on the crawling processes). Count
the amount of data collected. You consider all the data you
collected for counting this. Count the re-tweets and quotes. 2. [20
marks] Group the tweets based on similarity – count the number of
groups; count the elements in each group; identify prominent
groups; prioritise terms in the group; identify entities in each
group. Use this information to develop a REST API based crawler for
activity specific data. a. [10 marks] Data structures to handle large
stream of data. As the amount of data increases, you may need
better strategies to manage. 3. [20 marks] Enhance the crawling using
the hybrid architecture of Twitter Streaming & REST APIs. For
example, topic based or user-based streaming (provide
justification for why/how you chose certain words or user to
follow). This should be based on the identified groups and concepts.
Provide Statistics- Count redundant data present in the collection (you
may end up collecting the same tweets again through various APIs); and
redundancy can be counted using the tweet id. 4. [10 marks] Analyse
geo-tagged data for UK for the period – Count the amount of
geo-tagged data from UK collected. How many with geo codes (actual GPS
locations; how many with generic locations and how many with Twitter
place objects. Measure if there is any overlap between REST and
Streaming APIs. 5. [10 marks] Download multimedia contents including
videos and pictures for tweets with media objects (You don’t need to
download all multimedia data but show the ability to download
different types of media files – highlight this in the report
and software). Provide a basic analysis of collected data – how
many images, videos etc. 6. [10 marks] Discuss your data access
strategies and how did you address Twitter data access restrictions.
Clearly specify the Twitter API specific restrictions you
encountered and how you addressed these restrictions for collecting as
much Twitter data as possible.
Report structure [10 marks for the
report] Report should be written with 11pt font and with the maximum
length of 10 pages. It should be organised the following way:
3
2. Section 1: Introduction a. Describe the software developed with
appropriate details; if you have used code from elsewhere please specify
it b. Specify the time and duration of data collected 3. Section 2:
Data crawl a. Use Twitter Streaming API for collecting 1% data i.
Specify the APIs used 1. Please do not include entire code here;
just main description of the function 2. Along with a short
description/justification Describe the seed crawl data used – (some
given in the sample code) Users, hashtags, words, location etc. Provide
in tabular form Data collected; total Streaming API No of retweets
No of quotes
No of images
No of videos
How many verified?
No of geo-tagged data
How
many with locations/Place Object b. Specify data
grouping methods and associated statistics Similarity measure used -
briefly explain the method. Grouping strategy: explain the method. You
may use pseudo code if it is convenient Provide in tabular form … Data
collected total Groups formed Min size Max size Avg size … …
Discuss the effectiveness of similarity, grouping
strategy, nature of Twitter data etc. How did you overcome issues
with large amount of data and the groups. What data structures used? How
effective are they? c. Enhance the crawling using the hybrid
architecture of Twitter Streaming & REST APIs How did you
prioritise the groups? How did you prioritise the words? Results in
tabular form
4
Total Streaming API REST API data
Redundant No of Quotes No of Retweets No of geo-tagged data
No
of images … videos etc. Discuss the results –
effectiveness of the approach – any efficiency issues encountered –
how did you address these issues? What strategies you chose to
deal with efficiency issues? d. Discuss your download strategies for
tweets with media objects. Explain how you downloaded such data.
What level of storage requirements needed? (Only download a
sample of data to reduce space/storage issues. However, do the
analytics. Discuss the variations of data involved) 4. Scheduler/ranker
How did you design the scheduler to address twitter access
restrictions? Maybe provide a pseudo-code and justify. Hint: Remember
the objective is to get as much data as possible. Twitter has access
restrictions too- How do you negotiate these two conflicting
constraints?
FOR PART II (Only for group projects students) [20
marks] Event selection technique – Describe the techniques you employed
to select the events. (more details on event detection lecture) [10
marks] Event geo-localisation technique – Provide the details of
geo-localisation Total clusters Total events Events with
geo-coordinates Events with location info Events without any geo-taggs
What to submit 1) Report as a pdf file. (Please
submit this just for the report link) 2) A zip file containing (it
should be less than 100MB) [Please submit this in a separate link). a.
Software (runnable version, readme info, and also properly
commented). It is important that software is runnable with
minimum effort for the markers. b. Data – provide a sample data for
about 5 minutes. You can decide the format (like JSON, or plain data
file). Importantly your software should be able to run without much
hassle and collect and organise data in similar structure and
format. Multimedia contents are
5
excluded from the
submission though the required provisions should be in the software
c. Make sure, together a & b is less than 100 MB as Moodle
doesn’t allow to upload files over 100 MB. Tips on how to compute
Similarity 1. Remove stop words to reduce the size and noise 2. To
reduce the data size, you may consider implementing interestingness of
tweets (Lecture 4 – L04)- this will reduce huge amount of data 3. The
idea is to compute the cosine similarity between two vectors (see notes
below) 4. How to construct a document vector (Tweet vector) a. Remaining
words after stop word removal constitute a vector b. If you assign a
weight equal to 1/length of the vector (that is number of words in the
tweet after removing stop words), you get a weight (this means all other
words, that are not in the vector has a weight of zero)- zero weighted
words will not contribute to similarity. c. First vector in the stream
constitutes a cluster d. When the next document arrives, you check the
similarity e. If that similarity above a threshold, then you add the new
document to that cluster otherwise you create a new cluster i. If you
add a new document to a cluster, you update its cluster vector. For
this you sum all word weights and normalize. ii. You should decide what
similarity threshold to use (try out different values between 0 and 1.)
How to prioritise the groups of tweets for crawling? Many different
ways this can be done but one way is to look how fast the group is
growing. In other words, this shows some activity is getting traction.
Also we could look for certain word distributions- for example a set of
words (coronavirus, vaccination , …). We could also remove groups with
certain word distributions from further crawling. It is important that
you look through the groups before you finalise the strategy. How to
prioritise words for crawling within groups? We want to crawl on
certain words representing that activity. You can fill the details.
6
Cosine Similarity Cosine similarity calculates similarity by measuring the cosine of angle
between two vectors.
A
and B are vectors. Cosine similarity is a measure of similarity between
two non-zero vectors of an inner product space that measures the cosine
of the angle between them. The cosine of 0° is 1, and it is less than 1
for any angle in the interval (0,π] radians. It is thus a judgment of
orientation and not magnitude: two vectors with the same orientation
have a cosine similarity of 1, two vectors oriented at 90° relative to
each other have a similarity of 0, and two vectors diametrically opposed
have a similarity of -1, independent of their magnitude. The
denominator is normalising each vector components – if you make all
vectors normalised, then you can ignore the denominator. If A_i and B_i
are same words in vectors A and B. that means for implementation
purposes, you multiply weights of same terms in different vectors and
sum them up for all terms. If a term is not present in one of the
vectors product (A_i*B_i) will be zero. How to normalise a vector A
First compute length of vector L= Sqrt(A_i * A_i) Divide all elements by
length, we get a normalised vector; this means A_i/L for all i
Normalised vector means length of the vector is 1
学霸联盟