COMP1433: Introduction to Data Analytics
COMP1003: Statistical Tools and Applications
Spring 2021
Answers Submission Due: 23:59, Apr 11, 2021.
Important Notes.
• This is an individual assessment. So, no discussion (in any forms) is allowed
among classmates.
• You should only use R language for the implementation.
• Please submit the compressed folder (in zip or rar) with all the question
answers. Grateful if the folder can be named as the student ID, such as
“20123456D.zip” or ““20123456D.rar”. For the i-th question below, you
may create a folder named as “Qi” (e.g., Q1), which contains the code an-
swers for Qi and the readme.txt file indicating how to run the code.
• It is highly recommended that the codes are well commented, so that the TA
can easily read them. It is for the case that the implementation is imperfect
(with bugs) and we need to somehow find scores from the codes to see if
your algorithm is designed in a correct way.
• The compressed folder should be submitted to the blackboard.1 The full
mark is 100’ and submission entry is: Assessments/Assignment. For Ques-
tion 1 and 3 below, we’ll also provide the input data in the form of the
compressed folder “Assignment Data” available in the same entry. You’ll
find two folders there, one named as Q1 and the other Q3, for the input of
Question 1 and 3 respectively.
• No late submission is allowed and don’t forget to double check if the sub-
mission is saved successfully before leaving.
• When handling the paths for file loading and saving, please use relative
paths for the TAs to run your codes easily in a different environment.
• When visualizing the data or results with the graphs, label the axes, legends,
and titles properly, if any, for easy reading.
• Last but not least, best of the luck for this assignment! :)
Question 1. We learned the K-means clustering methods in the class. You can
implement the algorithm from scratch (without using external packages or the
kmeans() functions in R system) and group the points in 3 clusters. The input
points are put in a file named as “Points.csv”: each line shows a 2D vector (point)
where the first and second entry are separated by the comma “,”. When running
the codes, it is required to generate a figure visualizing the clustering results.
In visualization, you can first draw a scatter plot with the input vectors (x-axis
corresponding to the first entry while y-axis the second entry), and then color
each point in red, green, and blue, indicating the cluster it belongs to.
Note. It is assumed that the “Points.csv” file is put in the same folder as the
codes (with the main function). In K-means implementation, everything should be
handled from scratch, while for visualization purpose, you can attach any graphics
or visualization packages you want. For initialization, the representatives of the
three classes are (40,40), (100,0), and (0,100), respectively. (20’)
Question 2. Recall that in the test held on Mar 18, we discussed the sampling
of a list Λ with replacement, where it is possible to sample repeated letters. In
the test, Λ has only two letters i.e., A and B, while here we consider a Λ with
26 alphabetic letters, from A to Z. Still, we will repeat the process to sample
letters randomly from Λ and maintain another list λ to record the sampled letters.
For example, if the letters we sequentially sampled from Λ is DTTA, then λ =
〈D,T,T,A〉. Once a letter is drawn from Λ in a sampling step, a copy will be
inserted into λ and in the next step, it is likely that the same letter will be drawn
again (in other words λ may contain repeated letters).
In each sampling step, each letter has equal chances, i.e., 126 , to be drawn and
the sampling process won’t stop until A appears in λ . Let X denote the length of
λ (the number of letters therein) when the sampling process stops (X = 4 in the
example of λ = 〈D,T,T,A〉, where we observe A at the forth sampling and stops).
(1) Use sampling to estimate X and visualize the distribution with the his-
togram. (20’)
(2) In the context of (1), there are 26 letters in Λ. Here we assume Λ contains
N letters, where N = 2,3,4, ...,26. Use sampling to estimate E[X ] (X’s expected
value) given varying numbers of N. Visualize the relations between N and E[X ]
with some graph (pick up one that you think is suitable) (10’)
Question 3. We are interested in examining the employee’s salaries via looking
into 35 samples, with their salaries, ages, and whether they hold a MBA degree.
(1) The sample data can be found in attachment Q3/Salaries.csv containing 36
lines, where the title is at the first line (the names of these attributes are “Salary”,
“Age”, and “MBA”) and the data at the rest (a record a line). Different attributes
are separated with comma (,). To analyze the data, the first step is to load the data
into the R system and put it into a data frame for further analyses. Here we assume
that Salaries.csv is at the same level as the R script corresponding to Q2. (10’)
(2) Add one column named as “ID” from 1 to 35 to the data frame in (1). Write
the data frame (including the column names) to a file “newSalaries.csv”, which is
in the same folder as codes and the input “Salaries.csv”. (10’)
(3) Calculate the mean, standard deviation, median value, minimum value,
and maximum value for salaries. Print the results on the screen with the following
template (6 lines altogether) on the screen. Try to use the built-in functions in R
system if possible. (10’)
The statistics for salaries are:
mean value=xxx;
standard deviation=xxx;
median value=xxx
minimum value=xxx
maximum value=xxx
(4) Select the correct graph to visualize the pair-wise relation between Salaries
and Ages, and that between Salaries and MBA. Further, visualize the relations
among Salaries, Ages, and MBA in one figure. (10’)
(5) Use the barplot to visualize the comparison between the median salaries
for employees with and without the MBA degree. Then, visualize the average
salaries, still with the barplot, for employees in the group of age < 30, [30−40),
[40,50), and ≥ 50. (10’)