程序代写案例-XJCO2121|学霸联盟

程序代写案例-XJCO2121

时间：2021-05-05

COURSEWORK for XJCO2121 Data Mining, Semester 2 2020/21
Lecturer: Prof Eric Atwell, School of Computing, University of Leeds http://www.comp.leeds.ac.uk/eric
Assessment:
ONLINE TESTS: cw1: Test 1 (20%); cw2: Test 2 (20%)

COURSEWORKS:

Week 1-5: cw0: (0%, formative): Build a corpus of a national English dialect, cooperating in a Team.

Week 6-10: cw3: Individual research paper (60%): Key features and classifier for a national dialect.

English is an international language, used in (probably) every country in the world. For cw1 and cw3, each
student will contribute to a joint research project to create and analyse a new>text analytics research: the CNE Corpus of National Englishes.
For cw1, students will use SketchEngine and WebBootCat cooperating in teams to collate samples of 4-6
national dialects of English; and submit the plain text (.txt) data file via Minerva.
For cw3, each student will apply data mining and text analytics tools including Weka and SketchEngine to
compare their national sub-corpus with the other 3-5 collected by their team, to identify features of the
national dialect which distinguish it from the other national dialects; and experiment with machine learning
classifiers to classify samples of dialect text. Each student will write up their methods, results and
conclusions in a short research conference paper, submitted via Minerva for assessment.
After teaching is finished, in the last week of summer term, we will host an online International Conference
on Data Mining and Text Analytics: students can choose to submit their papers to be published in the online
conference proceedings, and a selection of students will be invited to present their research work via Zoom.

Learning objectives: this exercise will enable you to
- Investigate theory, methods and terminology used in Data Mining and Text Analytics;
- Experience how to apply algorithms, resources and techniques for implementing and evaluating
Data Mining and Text Analytics in a practical research exercise;
- Summarise and present your achievements to a peer audience, in a research conference paper.

You should work in a team of 4-6 students. To find partners, look at the class list in Minerva to find others
without a team, and ask them to join you. You can use the module discussion forums to find partners and
discuss the coursework. If you cannot work in a team and need to work alone, you can do so, but you must
notify the lecturer: email e.s.atwell@leeds.ac.uk

With your team partners, choose a team name, and choose a set of 4-6 countries and TLDs (Top Level
Domains) where the English language is used. For example, each student could decide to collect one
50,000+ word sample of English language texts from one of the 4-6 countries such as Estonia (EE), Latvia
(LV), Lithuania (LT), Belorussia (BY), and Ukraine (UA). Then write your TEAM NAME, COUNTRY and
TLD next to your name in the online class list. Do this in weeks 1-3; after this, I will identify students on the
class list who have not notified a team name, and assign them to a team.
.
Each student in a team must produce a cw0 submission. Each student must also take an individual
cw1:test1 and individual cw2:test2, and submit an individual cw3 research paper.

Lectures will include advice on use of SketchEngine for text>>other tools for data preparation and modelling, and report on these in your research paper.

cw0 submission (formative, 0% of module grading):
(1) a .txt file containing 50,000+ words, your sample of national dialect text. The filename must be
for example EnglishEE.txt for English of Estonia. In the text comment box, write
your TEAM NAME, COUNTRY and TLD.
(2) A 1-page PDF document describing how you collected this>tried before arriving at the final version; what seed terms you used and how you chose these; other
parameter settings in WebBootCat.

cw3 submission (summative, 60% of module grading):
4-page short conference paper (ACL format PDF file) on collection and analysis of your national English
corpus, covering CRISP-DM Data Mining phases: Business Understanding, Data Understanding, Data
Preparation, Modelling and Evaluation. Your cw3 paper must comply with ACL conference paper format:
Microsoft Word http://www.acl2019.org/medias/341-acl2019-word.zip
or LaTeX http://www.acl2019.org/medias/340-acl2019-latex.zip

You MUST keep to limits: 4 pages main contents, PLUS 1 or more additional page(s) for references. You
may also add 1 or more Appendices at the end, but these will not be assessed.

After teaching is finished, in the last week of summer term, we will host an online International Conference
on Data Mining and Text Analytics: students can choose to submit their papers to be published in the online
conference proceedings, and a selection of students will be invited to present their research work via Zoom.

Marking schemes:
In your cw0 submission, I will assess:
1. Contents and format of the corpus: 50,000+ words of the national English dialect (0-5)
2. Description of data collection procedure (0-5)
TOTAL: up to 10 marks

In your cw3 research conference paper, I will assess:
1. Business Understanding: state objectives & requirements, and data mining problem definition (0-2)
2. Data Understanding: explain data format and content, note data quality issues (0-2)
3. Data Preparation: how the data was converted for mining tools, with evidence (0-2)
4. Modelling: Classifiers, features and parameter settings investigated, with example outputs (0-6)
5. Evaluation: evaluation methods, tables of results, best features and classifiers (0-4)
6. References: you must include at least 4 references by Leeds University researchers (0-2)
7. In addition: Format: conforming to ACL format and organisation (0-2)
TOTAL: up to 20 marks
References
ACL format templates: Microsoft Word http://www.acl2019.org/medias/341-acl2019-word.zip
or LaTeX http://www.acl2019.org/medias/340-acl2019-latex.zip
Alshutayri, A. Atwell E; Alosaimy A; Dickins J; Ingleby M; Watson J. 2016. Arabic language WEKA-based
dialect classifier for Arabic automatic speech recognition transcripts. Proceedings of VarDial’2016
Workshop on Natural Language Processing for Similar Languages, Varieties and Dialects.
http://aclweb.org/anthology/W16-4826
Witten I; Frank E; Hall M. 2011. Data Mining (3rd edition). Morgan Kaufmann, Elsevier.