R代写-CSC2062-Assignment 2|学霸联盟

R代写-CSC2062-Assignment 2

时间：2021-03-15

Page 1 of 8

Assignment 2 – CSC2062 “AIDA”
Worth 25% of the module assessment. Assignment is marked out of 100 marks.
Deadline: 11pm Friday, 19th March 2021.
This version: 2021-01-22.
Changelog:
2021-01-22: corrected a few minor typos

Introduction
In this assignment, you will:
(a) Create a dataset of handwritten symbols (which you will use for your analyses and experiments
in the rest of Assignment 2, and in Assignment 3).
(b) Perform feature engineering, i.e. calculate features (variables) from the handwritten symbols
which may be useful for identifying the handwritten symbols automatically.
(c) Perform statistical analysis of the datasets, using methods of statistical inference.
(d) Implement and evaluate some introductory machine learning models that perform classification
on the dataset.
When you use a procedure that has an element of randomness, please use the seed value 42 (your
code should give the same results each time it runs). This assignment must be completed in R. You
may not use Microsoft Excel to complete any part of this assignment.
Please read carefully the information about the assessment criteria and marking process at the end of
this document.

Section 1 (10 marks): Creating a dataset
This section asks you to build a dataset of images composed of written numbers, letters and
mathematical symbols. Each image is represented by a black & white matrix with size 25 rows by 25
columns. In the matrix, the number “1” represents black pixels and “0” represents white pixels. As
such, one image can be stored in a plaintext “.csv” file containing the matrix (and no headers), as in
these examples:
Class a b one three
Example
Image

Page 2 of 8

Image
Matrix
csv file
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0
0,0,0,0,0,1,1,1,0,0,0,0,1,1,0,0
0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0
0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0
0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0
0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0
0,0,0,1,1,0,0,0,0,0,0,0,1,1,0,0
0,0,0,0,1,0,0,0,0,0,0,0,1,1,0,0
0,0,0,0,1,0,0,0,0,0,0,0,1,1,0,0
0,0,0,0,0,1,0,0,0,0,0,0,1,1,0,0
0,0,0,0,0,1,0,0,0,0,0,0,1,1,0,0
0,0,0,0,0,0,1,1,0,1,1,1,0,1,0,0
0,0,0,0,0,0,0,1,1,1,0,0,0,1,1,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0
0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0
0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0
0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0
0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0
0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0
0,0,0,1,1,0,0,1,1,1,0,0,0,0,0,0
0,0,0,1,1,0,0,0,0,0,1,1,0,0,0,0
0,0,1,0,1,0,0,0,0,0,0,1,1,0,0,0
0,0,1,0,1,0,0,0,0,0,0,1,1,0,0,0
0,0,1,0,0,1,1,1,0,0,0,1,0,0,0,0
0,1,0,0,0,0,0,0,1,1,1,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0
0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0
0,0,0,0,1,1,1,0,1,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0
0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0
0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Figure 1: Examples of handwritten images and their 25x25 matrix representation in plaintext.

The goal is to create a dataset containing eight handwritten images of each of the digits {1,2,3,4,5,6,7},
eight handwritten images of each of the digits {a,b,c,d,e,f,g}, and 8 handwritten images of the
mathematical symbols {<,>,=,≤,≥,≠,≈}. We will refer to these as the digit, letter and math
datasets, respectively. Each image should be obtained by writing the hand-written symbol yourself (as
described below). The quality of the drawing is not essential, as long as the digit or letter images can
easily be read by a human. The images will vary from sample to sample, due to your handwriting;
however, each character should fit reasonably well in the 25 x 25 box (i.e. do not draw a tiny character
in one corner of the 25 x 25 box; this will make your life easier when it comes to doing analyses!). In
total, you are creating 168 images.
Each image is represented by a black & white matrix with size 25 rows by 25 columns. In the matrix,
the number “1” represents black pixels and “0” represents white pixels. As such, one image can be
stored in a comma-delimited plaintext “.csv” file containing the matrix (and no header row); see Fig.
1 above.
You may use whatever means you prefer to obtain the 168 .csv files, provided they are handwritten
and are in the comma-delimited .csv format specified above. However, it is strongly recommended
that you use the software GIMP (http://www.gimp.org). GIMP is available for free for all PC OSs, and
is also installed on the lab machines and the EEECS virtual machines. Using GIMP, you can create a
new image with 25 by 25 points (px), advanced options 1 pixel/pt, color space grayscale, fill with
background colour. This will give you a small white square, which you can magnify to e.g. 2000% in
order to make it easier to draw on. To draw on the image, you can select the pencil tool and adjust
the brush size to 1 pixel.

Figure 2: Creating the blank canvas of size 25 x 25 pixels in the GIMP interface.

Page 3 of 8

The standard file formats of GIMP are useful to save the images, but we need a more easily readable
format. One good option is to export as PGM, type ASCII. This PGM file can be opened in GIMP, but it
is also simply a text file that can be opened in a text editor (or read as text file by R code). The PGM
text file has a header consisting of the following four lines:
P2
# CREATOR: ...
25 25
255

The third and fourth lines of the header above specify the pixel array size and the maximum allowed
pixel value, respectively. (The images are greyscale, with 0 representing fully black and 255
representing fully white).1

The remaining lines of the file specify the pixel values, with one value on each line; the total number
of pixel values should correspond to the specified array size (i.e. 25*25=625).

For our purposes, a number < 128 represents a black pixel, while a number >= 128 represents a white
one. Such a format can be easily converted into a matrix containing ones and zeros, as presented in
Figure 1 above (you can write some R code to do this; reading in the PGM file and writing the .csv file).
You shall save each image matrix as a csv file following the specification above, and using the filename
STUDENTNR_LABEL_INDEX.csv, where STUDENTNR is your student number (e.g. 4123456), INDEX is a
two numeral code from ‘1’ to ‘8’, indexing the set of 8 images you must create for each symbol, and
LABEL is the name of the symbol in the image, as specified in Fig 3 below.

Symbol Label
1 one
2 two
3 three
4 four
5 five
6 six
7 seven

Symbol Label
a a
b b
c c
d d
e e
f f
g g

Symbol Label
< less
> greater
= equal
≤ lessequal
≥ greaterequal
≠ notequal
≈ approxequal

Figure 3: labels to use for the created images

For example, if your student number is 4123456, then 4123456_notequal_08.csv would be the
eighth image you created for the ≠ symbol. (As well as creating the csv files, you may also want to
keep the PGM files, in case you need to inspect the data later on).

As part of your submission, upload the csv files that you create in a directory called “section1_images”,
along with any code you wrote to create the csv files, in a folder called “section1_code” (see
submission instructions at the end of this document).

It is very important to upload the images in the correct csv format as these files will be used to verify
your calculations in the next section. The .csv files should be comma delimited, not tab-delimited or
anything else. File Encoding should be UTF8 (not UTF8-BOM or anything else). You can check the
encoding in Notepad++.

In your report, very briefly (2-3 sentences) explain in your own words how you created the images and
obtained the matrices from them.

1 For further information about this image format, see https://en.wikipedia.org/wiki/Netpbm_format

Page 4 of 8

Section 2 (30 marks): Feature Engineering
Using each 25x25 matrix obtained from an image as described above, you must create an array of
characteristics that describe some features of the image. Each feature will be a number (i.e. each
feature is a numeric variable). There are 14 features in total.
Before we describe the features, let us define some vocabulary. An n-tile is an nXn selection from the
image array. This is an example of a 2-tile:

Features to be calculated (corresponding to columns of your features output file):
Feature
Index
Feature Short
Name
Feature Description
label The true name of the symbol in the image (i.e. one of the 21 symbol
names given in Fig. 2). The label is not a true feature, and should not be
used as a feature for statistical tests or during model training.
index The index of this image instance (a number from 1 to 8). The index is not
a true feature, and should not be used as a feature for statistical tests or
during model training.
1 nr_pix The number of black pixels in the image.
2 rows_with_2 Number of rows with exactly 2 black pixels
3 cols_with_2 Number of columns with exactly 2 black pixels
4 rows_with_3p Number of rows with 3 or more black pixels
5 cols_with_3p Number of columns with 3 or more black pixels
6 height The vertical distance, in pixels, between the topmost and bottommost
black pixels in the image (measuring from the pixel centre).
7 width The horizontal distance, in pixels, between the leftmost and rightmost
black pixels in the image (measuring from the pixel centre)
8 left2tile The number of unique 2-tiles in the image where the leftmost two entries
are black and the rightmost two entries are white: ◧. Tiles may overlap.
9 right2tile The number of unique 2-tiles in the image where the rightmost two
entries are black and the leftmost two entries are white: ◨.
10 verticalness The sum of the previous two features, divided by the number of black
pixels in the image.
11 top2tile The number of unique 2-tiles in the image where the top two entries are
black and the bottom two entries are white.
12 bottom2tile The number of unique 2-tiles in the image where the bottom two entries
are black and the top two entries are white.
13 horizontalness The sum of the previous two features, divided by the number of black
pixels in the image.
14 [your label] Define a custom feature based on 3-tiles. Explain and justify the rationale
for your feature (It should capture information not captured by other
features)

Your task in this section is to write code to calculate each of the features above.

Save your calculated features in a file called STUDENTNR_features.csv, where STUDENTNR is
your student number. This file will consist of 169 rows. The first row gives the column names (i.e. the
strings in the Feature Short Name column in the table above, comma-delimited). The remaining 168
rows list the comma-separated feature values for each of your 168 images. The first entry in the row
will be the LABEL word, the second will be the image INDEX, and the remaining 14 entries will be the

Page 5 of 8

calculated features.
For example, the features for your first “=” image may be as follows:
equal,1,24,0,12,2,0,10,15,0,0,0,24,24,48,0
The 8 rows that correspond to the 8 instances of a particular symbol should be grouped together in
the features file, and the order of those 8 rows should correspond to the INDEX used in the image
filenames. In other words, the 168 data rows of STUDENTNR_features.csv should be sorted
first by the label (alphabetical order) and secondly by the index.
If you cannot calculate a particular feature, you may use a random integer between 0 and 10 for the
feature values instead. (You will lose marks for not calculating the feature, but you can use the random
values in the analyses that follow in the subsequent section. You should report that you have done
this in the assignment report).
In your report, very briefly describe and explain the code you have written to calculate the features
above. If you ran into difficulties, you should still explain your thought processes and attempts to
calculate the features. In the case of the custom feature, you should explain your rationale for
choosing the feature you did, as well as how they are calculated (i.e. you should give a justification for
why you think this feature should be useful).
You should put the file STUDETNR_features.csv in a folder called section2_features. Put code
for this section in a folder called section2_code. The working directory for the code should be the
section2_code directory. Your code should use relative paths; i.e. it should read the image matrixes
from “../section1_images” and save the feature file to “../section2_features”.

Section 3: Statistical analyses of feature data (35 marks)
In this section, you will perform statistical analyses of the feature data, in order to explore which
features are important for distinguishing between different kinds of handwritten symbols.
You shall use descriptive statistics (mean, variance, etc.), hypothesis testing, and suitable visualisation
to perform your analysis of the data. You are encouraged to provide tables, figures, and/or graphs in
the report to support your discussions and findings. When performing tests, always consider whether
multiple test correction is needed.
It is your responsibility to define the appropriate assumptions to run the tests, and to choose an
appropriate test according to the data characteristics and the question that you are studying. In
general, you will not be told what tests or R functions to use; it is up to you to explain and justify your
choice. You are not necessarily restricted to the hypothesis tests that were discussed in the lectures.
You may assume a significance level of 0.05 for the analyses when running hypothesis testing.
In particular, in the report you should address each of the following subtasks, using appropriate
statistical tests, tables, graphs, etc.
1. Construct suitable histograms for the nr_pix, height, and cols_with_3p features, for the full
set of 168 items. Briefly describe the shape of the distributions and comment on any
interesting patterns across the datasets.
2. Suppose you randomly sample a digit image from the full set of images. What is the probability
that the number of pixels in the image is greater than 20?
3. Present useful summary statistics (e.g. mean and standard deviation) about all the features,
for (a) the full set of letters, (b) the full set of digits (c) the entire set of 168 items. Briefly
discuss the summary statistics, and whether they already suggest which features may be
useful for discriminating letters and digits. For features you feel may be interesting for

Page 6 of 8

discrimination between groups, consider suitable visualisations (e.g. histogram of feature
values for the groups2).
4. Investigate the relationship between the “height” variable and the “verticalness” variable. Are
these variables linearly associated? Consider suitable visualisation. Describe and conduct a
suitable statistical test to measure the degree of linear association between these two
variables.
5. Are there features which are useful to discriminate between the set of digits and the set of
letters? (Consider a statistical test which test for differences between two groups). List the
three most useful features. Consider suitable visualisation. Briefly interpret your findings (i.e.
why might these features be useful) and the validity of any assumptions.

For all questions above, you shall explain your reasoning, assumptions and steps of the procedure
(including the statistical analysis) when preparing the report. Use statistics to justify your reasoning.
If you are generating p-values for analysing the statistical significance of some features, make sure to
explain how they were obtained. It is your task to decide and justify what the most appropriate
inference to be performed in each case is, and to discuss the results you obtained.
Put code for this section in a folder called section3_code. Your code should use relative paths; i.e. it
should read the feature data from “../section2_features”.

Section 4 Regression and Machine Learning (25 marks)
1. Suppose that instead of calculating horizontalness in Section 2, you instead would like to
predict the horizontalness value from the feature variables 1-12. Fit a multiple regression
model to predict horizontalness as best you can from a subset of these variables (consider
approaches to feature selection). Give the results table of the regression model.
2. Using any 3 features that you think should be useful (justifying your choices, e.g. on the basis
of results and visualisations from section 3.6 above), use logistic regression to build a classifier
that discriminates between the “letter” and “digit” classes. Use 5-fold cross-validation to
evaluate the accuracy of your fitted model. Briefly (1-2 sentences) interpret your results.
3. Does your model in subsection 4.2 distinguish between the “letter” and “digit” categories for
the 112 images significantly more accurately than a “random” model that just randomly
responds “letter” 50% of the time and “number” 50% of the time? Perform a suitable
statistical analysis using the binomial distribution.
Put code for this section in a folder called section4_code. Your code should use relative paths; i.e. it
should read the feature data from “../section2_features”.

Assessment criteria and marking process
The most important criteria in marking is the completeness, accuracy, quality and clarity of your
report (approximately 75% weighting, across the full submission). In your report, you should clearly
demonstrate that you understand the methods used in each sub-task. Explain your chosen
approach, your reasoning, and the assumptions and steps of the procedures used. You should
explain and interpret your results, demonstrating understanding and independent thinking. What

2 A nice example: https://stackoverflow.com/questions/36049729/r-ggplot2-get-histogram-of-difference-
between-two-groups

Page 7 of 8

are your results telling you? Are the results what you would expect? If you ran into difficulties,
explain what they were and the efforts you made to try to overcome them.
Code has a weighting in marking of approximately 15%. Your code should be clear and logically
organised, and do what is required, but code efficiency and code sophistication is not important (this
assignment does not require complex programming). However, you should use loops and variables
(rather than hard-coded values) where appropriate. If you use freely licenced code, packages, or
libraries (which is encouraged), these should be appropriately referenced (e.g. by citing a URL in a
comment). For example, using StackOverflow code snippits is fine, provide you acknowledge the use
and provide the URL to the code snippits in the comments, and follow the MIT licence. The code
must be easy to use and the comments must include information about the required steps to
replicate the results that you have obtained and are presenting in your report (transparency and
replicability are essential in data analysis).
Attention to detail and following the assignment instructions accurately will also be considered in
marking (approximately 10% weighting). Each sub-task has a precise specification. Make sure you
carefully follow the instructions, and use the features specified for each task, the specified
procedures (seed value, data file specifications and file names, directory structure and names, etc).
Make sure you upload your deliverable files precisely in the specified formats.
Your report should explain how you have performed the analysis, but do not explain details of code
implementation in your report - use the source code comments for that.
It should be straightforward for the assessor to rerun your code to produce the same results as
presented in your report. Ensure that the different subsections are clearly labelled in the source code,
and in the report. Each of these subtasks should be addressed in a separate subsection of your report,
following the report template.

Deliverables
You must submit your assignment online, using the module webpage, by 11pm Friday, 19th March
2021. The online uploaded file must be a ZIP file called assignment2_STUDENTNR.zip, containing
multiple files and directories. The contents of the zip file are specified below (bold text indicates
folder names):
• STUDENTNR_assignment2_report.pdf
• section1_code
o [your code files; a single code file is preferred]
• section1_images
o [168 .csv files with the following naming format: STUDENTNR_LABEL_INDEX.csv]
o Optionally, also include the PGM files used to create the csv files, with the same
naming format
• section2_code
o [your code files; a single code file is preferred]
• section2_features
o STUDENTNR_features.txt
• section3_code
o [your code files; a single code file is preferred]
• section4_code
o [your code files; a single code file is preferred]

Page 8 of 8

Please use the provided report template for preparing your report (or create an equivalent LaTEX
format). Ensure that the header and footer information (student name, student number) is clearly
visible on the PDF. The word limit for the report is 4000 words (excluding tables and figures; you can
include as many tables and figures as you feel is appropriate).
A RAR file is not a ZIP file. A broken or corrupt ZIP file is not a ZIP file. Do not include .Rdata files
in your upload; these may make your zip file very large.
It is your responsibility to ensure the assignment is uploaded and double-checked in good time
before the deadline. Standard university penalties apply for late submission.
By submitting this assignment, you acknowledge that it is your own work and that you are aware of
university regulations regarding academic offences, including (but not restricted to) plagiarism and
collusion. Collusion/plagarism will be manually and algorithmically checked for.

学霸联盟