python代写-COMP3210/COMP6210-Assignment 1
时间:2021-03-29
COMP3210/COMP6210 - Big Data
Assignment 1: Map-Reduce
Semester 1, 2021
Macquarie University, Department of Computing
 Dataset:
 10000 Tweets dataset included in “tweets.zip” on iLearn
 Programming environment:
 You should use Pymongo and Mrjob in Python to implement your Map-Reduce algorithms.
 Task 1 (60%):
 MapReduce: Calculate the total number of tweets posted on each day by personal account (i.e., the
objectType of actor is “person”) that appears in the dataset.
{
"_id" : ObjectId("603d975915cb074610ddb000"),
"id" : "tag:My number 1 tweet",
"objectType" : "activity",
"actor" : {
"objectType" : "person",
"id" : "id:twitter.com:123123",
"link" : "http://www.twitter.com/Intelledox",
"displayName" : "Intelledox",
"postedTime" : "2008-12-11T23:47:55.000Z",
"image" : "https://pbs.twimg.com/profile_images/485981380585603072/inMuMtJ7_normal.png",
"summary" : "Intelledox's mobile-ready digitalization software helps over 1 million people to do business
faster, smarter & efficiently Digitalize your business process now!",
"links" : [
{
"href" : "http://www.intelledox.com",
"rel" : "me"
}
],
"friendsCount" : NumberInt(486),
"followersCount" : NumberInt(549),
"listedCount" : NumberInt(24),
"statusesCount" : NumberInt(1188),
personal account
"twitterTimeZone" : "Canberra",
"verified" : false,
"utcOffset" : "39600",
"preferredUsername" : "Intelledox",
"languages" : [
"en"
],
"location" : {
"objectType" : "place",
"displayName" : "Canberra, Australia"
},
"favoritesCount" : NumberInt(55)
},
"verb" : "post",
"postedTime" : "2016-04-01T00:00:00.000Z",
"generator" : {
"displayName" : "HubSpot",
"link" : "http://www.hubspot.com/"
},

The partial results are similar to the following,



 Task 2 (40%):
 MapReduce: Implement either the Merge Sort1 algorithm or Bucket Sort2 algorithm using Map-
Reduce to sort the posted dates (from Task 1) according to the number of tweets (ascending
order).
The partial results are similar to the following,
posted date
number of tweets
posted time

1 https://en.wikipedia.org/wiki/Merge_sort
2 https://en.wikipedia.org/wiki/Bucket_sort
 Workflow for Tasks 1 and 2:
 Task 1 (60%):
 Step 0: Import the JSON file '10000 Tweets' into MongoDB.
 Step 1 (30%): Connect to MongoDB from Python application, extract the posted dates in
tweets and save them in a txt file, such as 'postedDates.txt'. Please extract the first 10
characters from the posted time of a tweet to form the corresponding posted date.
 Step 2 (30%): Implement the MapReduce algorithm for Task 1 to calculate the number of
tweets posted on each day, namely, to calculate the count of each posted date in '
postedDates.txt' and save the results in 'task1_output.txt'.
 Task 2 (40%):
 Implement the Merge Sort or Bucket Sort MapReduce algorithm to sort posted dates in
'task1_output.txt' in ascending order according to the count and save the results in
'task2_output.txt'.
 Submission:
Submit a zip file “Firstname_LastName_Assignment1.zip” to iLearn, including:
 A Word or PDF documentation in 2-4 pages including the Flowchart and Pseudocode for each task;
 Source code for Tasks 1&2, including for curation, mapper(s), reducer(s) and maybe combiner;
 Output file for each task, i.e., 'postedDates.txt', 'task1_output.txt' and 'task2_output.txt'.
























































































学霸联盟


essay、essay代写