Java代写-COMP9313|学霸联盟

Java代写-COMP9313

时间：2022-07-02

COMP9313 21T3 Project 1 (12 marks)
Problem statement:
Detecting popular and trending topics from the news articles is an important task for
public opinion monitoring. In this project, your task is to perform text data analysis
over a dataset of Australian news from ABC (Australian Broadcasting Corporation)
using MapReduce. The problem is to compute the weights of each term regarding
each year in the news articles dataset.
Input files:
The dataset you are going to use contains data of news headlines published over
several years. In this text file, each line is a headline of a news article, in format of
"date,term1 term2 ... ... ". The date and texts are separated by a comma, and the terms
are separated by the space character. A sample file is like below (note that the stop
words like “to”, “the”, and “in” have already been removed from the dataset):
20191124,woman stabbed adelaide shopping centre
20191204,economy continue teetering edge recession
20200401,coronanomics learnt coronavirus economy
20200401,coronavirus home test kits selling chinese community
20201015,coronavirus pacific economy foriegn aid china
20201016,china builds pig apartment blocks guard swine flu
20211216,economy starts bounce unemployment
20211224,online shopping rise due coronavirus
20211229,china close encounters elon musks
This small sample file can be downloaded at:
https://webcms3.cse.unsw.edu.au/COMP9313/22T2/resources/76308
Term weights computation:
To compute the weight for a term regarding a year, please use the TF/IDF model.
Specifically, the TF and IDF can be computed as:
TF(term t, year y) = the frequency of t in y
IDF(term t, dataset D) = log10 (the number of years in D/the number of years having t)
Finally, the term weight of term t regarding the year y is computed as:
Weight(term t, year y, dataset D) = TF(term t, year y)* IDF(term t, dataset D)
Please use java.lang.Math.log10 to compute the term weights.
Output format:
If there are N terms in the dataset, you should output exactly N lines in your final
output file on HDFS, and these lines are sorted by terms in alphabetical order. In each
line, you need to output a list of pairs, and these pairs are sorted by
year in ascending order. Specifically, the format of each line is like: “term\t
Year1,Weight1;Year2,Weight2;… …;Yeark,Weightk”. For example, given the above
data set, the first few lines of the output should be (there is no need to remove the
quotation marks):
adelaide\t2019,0.47712125471966244
aid\t2020,0.47712125471966244
apartment\t2020,0.47712125471966244
blocks\t2020,0.47712125471966244
bounce\t2021,0.47712125471966244
builds\t2020,0.47712125471966244
centre\t2019,0.47712125471966244
china\t2020,0.3521825181113625;2021,0.17609125905568124
Code format:
Name your package as “comp9313.proj1” and name your driver class as
“Project1.java”. To reduce the difficulty of the project, you are allowed to pass the
number of years to your job. We will also use more than 1 reducer to test your code.
Your program should receive 4 parameters: the input folder, the output folder, the
number of years, and the number of reducers. Finally, package all your java files as a
zip file with name “zID_proj1.zip” (e.g. z5123456_proj1.zip).
Command of running your code:
Your java code will be compiled and packaged as a jar file. Assuming there are 20
years, and we use 2 reducers, we will use the following command to run your code:
$ hadoop jar YOURJAR.jar comp9313.proj1.Project1 input output 20 2
In your main function, after you receive the number of years, you can pass this value
to the reducer by using a Configuration object. You can use the set() method to set
the value in the main function, and then use the get() method to get the value in the
reducer.
Please ensure that the code you submit can be compiled. Any solution that has
compilation errors will receive no more than 4 points.
Marking Criteria:
Your source code will be inspected and marked based on readability and ease of
understanding. The documentation (comments of the codes) in your source code is
also important. Below is an indicative marking scheme:
Result correctness: 6
Algorithm design (the use of design patterns
learned to reduce memory consumption and
to improve efficiency): 5
Code structure, Readability, and
Documentation: 1
Submission:
Deadline: Sunday 3rd July 11:59:59 PM
You can submit through Moodle:
If you submit your assignment more than once, the last submission will replace the
previous one. To prove successful submission, please take a screenshot as assignment
submission instructions show and keep it by yourself. If you have any problems in
submissions, please email to yufan.sheng@unsw.edu.au.
Late submission penalty
5% reduction of your marks for up to 5 days
Plagiarism:
The work you submit must be your own work. Submission of work partially or
completely derived from any other person or jointly written with any other person is
not permitted. The penalties for such an offence may include negative marks,
automatic failure of the course and possibly other academic discipline. Assignment
submissions will be examined manually.

Relevant scholarship authorities will be informed if students holding scholarships are
involved in an incident of plagiarism or other misconduct.

Do not provide or show your assignment work to any other person - apart from the
teaching staff of this subject. If you knowingly provide or show your assignment
work to another person for any reason, and work derived from it is submitted you
may be penalized, even if the work was submitted without your knowledge or
consent.