COMPSCI4077 Web Science H - Course Work
Semester 2 (January 2021 - March 2021)
Academic Year 2020-2021
School of Computing Science
University of Glasgow Social Media Analytics Total Marks – 100 marks & Weightage – 20% Course work deadline – Monday, 22 March 2021 4:30PM
INTRODUCTION Part I – Individual project (100% weightage) The objective of this course work is to develop a Twitter crawler for data collection in English and to conduct social media analytics. We recommend students to use Python language and also MongoDB for data storage. It is very important that students provide working version of the software, as we need to validate them. Students submit their code and report on or before the specified deadline. In addition, students provide a sample of the data set collected. Submission is through the Moodle page for the Web Science course. The coursework will be marked out of 100. Course work will have 20% weight of the final marks. As the usual practice across the school, numerical marks will be appropriately converted into bands. Final written exam will have 80% weightage, which will be in April/May 2021. We are interested in tweets, which are posted in United Kingdom. Collect data for 1 hour of any day. In additions, sample multimedia contents for tweets with media objects should be downloaded. Part II – Group project (Part 1 have 70% weightage; Part II have 30 marks) This can be done by groups of two (not more than 2). In case of group project, you should do the following as well – a. Identify important events in the clusters b. Investigate the geo-localisation of events
Specific tasks to do Develop a crawler to access as much Twitter data as possible & group the tweets based on similarity. Use important activity specific data to crawl additional data. During this process you will identify Twitter data access APIs along with access constraints.
1. [10 marks] Use Twitter streaming API for collecting data. Use streaming API with a United Kingdom geographical filter along with selected words. (provide clear documentation on the crawling processes). Count the amount of data collected. You consider all the data you collected for counting this. Count the re-tweets and quotes. 2. [20 marks] Group the tweets based on similarity – count the number of groups; count the elements in each group; identify prominent groups; prioritise terms in the group; identify entities in each group. Use this information to develop a REST API based crawler for activity specific data. a. [10 marks] Data structures to handle large stream of data. As the amount of data increases, you may need better strategies to manage. 3. [20 marks] Enhance the crawling using the hybrid architecture of Twitter Streaming & REST APIs. For example, topic based or user-based streaming (provide justification for why/how you chose certain words or user to follow). This should be based on the identified groups and concepts. Provide Statistics- Count redundant data present in the collection (you may end up collecting the same tweets again through various APIs); and redundancy can be counted using the tweet id. 4. [10 marks] Analyse geo-tagged data for UK for the period – Count the amount of geo-tagged data from UK collected. How many with geo codes (actual GPS locations; how many with generic locations and how many with Twitter place objects. Measure if there is any overlap between REST and Streaming APIs. 5. [10 marks] Download multimedia contents including videos and pictures for tweets with media objects (You don’t need to download all multimedia data but show the ability to download different types of media files – highlight this in the report and software). Provide a basic analysis of collected data – how many images, videos etc. 6. [10 marks] Discuss your data access strategies and how did you address Twitter data access restrictions. Clearly specify the Twitter API specific restrictions you encountered and how you addressed these restrictions for collecting as much Twitter data as possible.
Report structure [10 marks for the report] Report should be written with 11pt font and with the maximum length of 10 pages. It should be organised the following way:
2. Section 1: Introduction a. Describe the software developed with appropriate details; if you have used code from elsewhere please specify it b. Specify the time and duration of data collected 3. Section 2: Data crawl a. Use Twitter Streaming API for collecting 1% data i. Specify the APIs used 1. Please do not include entire code here; just main description of the function 2. Along with a short description/justification Describe the seed crawl data used – (some given in the sample code) Users, hashtags, words, location etc. Provide in tabular form Data collected; total Streaming API No of retweets No of quotes
No of images
No of videos
How many verified?
No of geo-tagged data
How many with locations/Place Object b. Specify data grouping methods and associated statistics Similarity measure used - briefly explain the method. Grouping strategy: explain the method. You may use pseudo code if it is convenient Provide in tabular form … Data collected total Groups formed Min size Max size Avg size … … Discuss the effectiveness of similarity, grouping strategy, nature of Twitter data etc. How did you overcome issues with large amount of data and the groups. What data structures used? How effective are they? c. Enhance the crawling using the hybrid architecture of Twitter Streaming & REST APIs How did you prioritise the groups? How did you prioritise the words? Results in tabular form
Total Streaming API REST API data
Redundant No of Quotes No of Retweets No of geo-tagged data
No of images … videos etc. Discuss the results – effectiveness of the approach – any efficiency issues encountered – how did you address these issues? What strategies you chose to deal with efficiency issues? d. Discuss your download strategies for tweets with media objects. Explain how you downloaded such data. What level of storage requirements needed? (Only download a sample of data to reduce space/storage issues. However, do the analytics. Discuss the variations of data involved) 4. Scheduler/ranker How did you design the scheduler to address twitter access restrictions? Maybe provide a pseudo-code and justify. Hint: Remember the objective is to get as much data as possible. Twitter has access restrictions too- How do you negotiate these two conflicting constraints?
FOR PART II (Only for group projects students) [20 marks] Event selection technique – Describe the techniques you employed to select the events. (more details on event detection lecture) [10 marks] Event geo-localisation technique – Provide the details of geo-localisation Total clusters Total events Events with geo-coordinates Events with location info Events without any geo-taggs
What to submit 1) Report as a pdf file. (Please submit this just for the report link) 2) A zip file containing (it should be less than 100MB) [Please submit this in a separate link). a. Software (runnable version, readme info, and also properly commented). It is important that software is runnable with minimum effort for the markers. b. Data – provide a sample data for about 5 minutes. You can decide the format (like JSON, or plain data file). Importantly your software should be able to run without much hassle and collect and organise data in similar structure and format. Multimedia contents are
excluded from the submission though the required provisions should be in the software c. Make sure, together a & b is less than 100 MB as Moodle doesn’t allow to upload files over 100 MB. Tips on how to compute Similarity 1. Remove stop words to reduce the size and noise 2. To reduce the data size, you may consider implementing interestingness of tweets (Lecture 4 – L04)- this will reduce huge amount of data 3. The idea is to compute the cosine similarity between two vectors (see notes below) 4. How to construct a document vector (Tweet vector) a. Remaining words after stop word removal constitute a vector b. If you assign a weight equal to 1/length of the vector (that is number of words in the tweet after removing stop words), you get a weight (this means all other words, that are not in the vector has a weight of zero)- zero weighted words will not contribute to similarity. c. First vector in the stream constitutes a cluster d. When the next document arrives, you check the similarity e. If that similarity above a threshold, then you add the new document to that cluster otherwise you create a new cluster i. If you add a new document to a cluster, you update its cluster vector. For this you sum all word weights and normalize. ii. You should decide what similarity threshold to use (try out different values between 0 and 1.) How to prioritise the groups of tweets for crawling? Many different ways this can be done but one way is to look how fast the group is growing. In other words, this shows some activity is getting traction. Also we could look for certain word distributions- for example a set of words (coronavirus, vaccination , …). We could also remove groups with certain word distributions from further crawling. It is important that you look through the groups before you finalise the strategy. How to prioritise words for crawling within groups? We want to crawl on certain words representing that activity. You can fill the details.
Cosine Similarity Cosine similarity calculates similarity by measuring the cosine of angle
between two vectors.

A and B are vectors. Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. The cosine of 0° is 1, and it is less than 1 for any angle in the interval (0,π] radians. It is thus a judgment of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors oriented at 90° relative to each other have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude. The denominator is normalising each vector components – if you make all vectors normalised, then you can ignore the denominator. If A_i and B_i are same words in vectors A and B. that means for implementation purposes, you multiply weights of same terms in different vectors and sum them up for all terms. If a term is not present in one of the vectors product (A_i*B_i) will be zero. How to normalise a vector A First compute length of vector L= Sqrt(A_i * A_i) Divide all elements by length, we get a normalised vector; this means A_i/L for all i Normalised vector means length of the vector is 1