xuebaunion@vip.163.com

3551 Trousdale Rkwy, University Park, Los Angeles, CA

留学生论文指导和课程辅导

无忧GPA：https://www.essaygpa.com

工作时间：全年无休-早上8点到凌晨3点

微信客服：xiaoxionga100

微信客服：ITCS521

r代写-COMP1433

时间：2021-04-08

COMP1433: Introduction to Data Analytics

COMP1003: Statistical Tools and Applications

Assignment

Spring 2021

Answers Submission Due: 23:59, Apr 11, 2021.

Important Notes.

• This is an individual assessment. So, no discussion (in any forms) is allowed

among classmates.

• You should only use R language for the implementation.

• Please submit the compressed folder (in zip or rar) with all the question

answers. Grateful if the folder can be named as the student ID, such as

“20123456D.zip” or ““20123456D.rar”. For the i-th question below, you

may create a folder named as “Qi” (e.g., Q1), which contains the code an-

swers for Qi and the readme.txt file indicating how to run the code.

• It is highly recommended that the codes are well commented, so that the TA

can easily read them. It is for the case that the implementation is imperfect

(with bugs) and we need to somehow find scores from the codes to see if

your algorithm is designed in a correct way.

• The compressed folder should be submitted to the blackboard.1 The full

mark is 100’ and submission entry is: Assessments/Assignment. For Ques-

tion 1 and 3 below, we’ll also provide the input data in the form of the

compressed folder “Assignment Data” available in the same entry. You’ll

find two folders there, one named as Q1 and the other Q3, for the input of

Question 1 and 3 respectively.

• No late submission is allowed and don’t forget to double check if the sub-

mission is saved successfully before leaving.

1learn.polyu.edu.hk

1

• When handling the paths for file loading and saving, please use relative

paths for the TAs to run your codes easily in a different environment.

• When visualizing the data or results with the graphs, label the axes, legends,

and titles properly, if any, for easy reading.

• Last but not least, best of the luck for this assignment! :)

Question 1. We learned the K-means clustering methods in the class. You can

implement the algorithm from scratch (without using external packages or the

kmeans() functions in R system) and group the points in 3 clusters. The input

points are put in a file named as “Points.csv”: each line shows a 2D vector (point)

where the first and second entry are separated by the comma “,”. When running

the codes, it is required to generate a figure visualizing the clustering results.

In visualization, you can first draw a scatter plot with the input vectors (x-axis

corresponding to the first entry while y-axis the second entry), and then color

each point in red, green, and blue, indicating the cluster it belongs to.

Note. It is assumed that the “Points.csv” file is put in the same folder as the

codes (with the main function). In K-means implementation, everything should be

handled from scratch, while for visualization purpose, you can attach any graphics

or visualization packages you want. For initialization, the representatives of the

three classes are (40,40), (100,0), and (0,100), respectively. (20’)

Question 2. Recall that in the test held on Mar 18, we discussed the sampling

of a list Λ with replacement, where it is possible to sample repeated letters. In

the test, Λ has only two letters i.e., A and B, while here we consider a Λ with

26 alphabetic letters, from A to Z. Still, we will repeat the process to sample

letters randomly from Λ and maintain another list λ to record the sampled letters.

For example, if the letters we sequentially sampled from Λ is DTTA, then λ =

〈D,T,T,A〉. Once a letter is drawn from Λ in a sampling step, a copy will be

inserted into λ and in the next step, it is likely that the same letter will be drawn

again (in other words λ may contain repeated letters).

In each sampling step, each letter has equal chances, i.e., 126 , to be drawn and

the sampling process won’t stop until A appears in λ . Let X denote the length of

λ (the number of letters therein) when the sampling process stops (X = 4 in the

example of λ = 〈D,T,T,A〉, where we observe A at the forth sampling and stops).

(1) Use sampling to estimate X and visualize the distribution with the his-

togram. (20’)

(2) In the context of (1), there are 26 letters in Λ. Here we assume Λ contains

N letters, where N = 2,3,4, ...,26. Use sampling to estimate E[X ] (X’s expected

2

value) given varying numbers of N. Visualize the relations between N and E[X ]

with some graph (pick up one that you think is suitable) (10’)

Question 3. We are interested in examining the employee’s salaries via looking

into 35 samples, with their salaries, ages, and whether they hold a MBA degree.

(1) The sample data can be found in attachment Q3/Salaries.csv containing 36

lines, where the title is at the first line (the names of these attributes are “Salary”,

“Age”, and “MBA”) and the data at the rest (a record a line). Different attributes

are separated with comma (,). To analyze the data, the first step is to load the data

into the R system and put it into a data frame for further analyses. Here we assume

that Salaries.csv is at the same level as the R script corresponding to Q2. (10’)

(2) Add one column named as “ID” from 1 to 35 to the data frame in (1). Write

the data frame (including the column names) to a file “newSalaries.csv”, which is

in the same folder as codes and the input “Salaries.csv”. (10’)

(3) Calculate the mean, standard deviation, median value, minimum value,

and maximum value for salaries. Print the results on the screen with the following

template (6 lines altogether) on the screen. Try to use the built-in functions in R

system if possible. (10’)

The statistics for salaries are:

mean value=xxx;

standard deviation=xxx;

median value=xxx

minimum value=xxx

maximum value=xxx

(4) Select the correct graph to visualize the pair-wise relation between Salaries

and Ages, and that between Salaries and MBA. Further, visualize the relations

among Salaries, Ages, and MBA in one figure. (10’)

(5) Use the barplot to visualize the comparison between the median salaries

for employees with and without the MBA degree. Then, visualize the average

salaries, still with the barplot, for employees in the group of age < 30, [30−40),

[40,50), and ≥ 50. (10’)

3

学霸联盟

COMP1003: Statistical Tools and Applications

Assignment

Spring 2021

Answers Submission Due: 23:59, Apr 11, 2021.

Important Notes.

• This is an individual assessment. So, no discussion (in any forms) is allowed

among classmates.

• You should only use R language for the implementation.

• Please submit the compressed folder (in zip or rar) with all the question

answers. Grateful if the folder can be named as the student ID, such as

“20123456D.zip” or ““20123456D.rar”. For the i-th question below, you

may create a folder named as “Qi” (e.g., Q1), which contains the code an-

swers for Qi and the readme.txt file indicating how to run the code.

• It is highly recommended that the codes are well commented, so that the TA

can easily read them. It is for the case that the implementation is imperfect

(with bugs) and we need to somehow find scores from the codes to see if

your algorithm is designed in a correct way.

• The compressed folder should be submitted to the blackboard.1 The full

mark is 100’ and submission entry is: Assessments/Assignment. For Ques-

tion 1 and 3 below, we’ll also provide the input data in the form of the

compressed folder “Assignment Data” available in the same entry. You’ll

find two folders there, one named as Q1 and the other Q3, for the input of

Question 1 and 3 respectively.

• No late submission is allowed and don’t forget to double check if the sub-

mission is saved successfully before leaving.

1learn.polyu.edu.hk

1

• When handling the paths for file loading and saving, please use relative

paths for the TAs to run your codes easily in a different environment.

• When visualizing the data or results with the graphs, label the axes, legends,

and titles properly, if any, for easy reading.

• Last but not least, best of the luck for this assignment! :)

Question 1. We learned the K-means clustering methods in the class. You can

implement the algorithm from scratch (without using external packages or the

kmeans() functions in R system) and group the points in 3 clusters. The input

points are put in a file named as “Points.csv”: each line shows a 2D vector (point)

where the first and second entry are separated by the comma “,”. When running

the codes, it is required to generate a figure visualizing the clustering results.

In visualization, you can first draw a scatter plot with the input vectors (x-axis

corresponding to the first entry while y-axis the second entry), and then color

each point in red, green, and blue, indicating the cluster it belongs to.

Note. It is assumed that the “Points.csv” file is put in the same folder as the

codes (with the main function). In K-means implementation, everything should be

handled from scratch, while for visualization purpose, you can attach any graphics

or visualization packages you want. For initialization, the representatives of the

three classes are (40,40), (100,0), and (0,100), respectively. (20’)

Question 2. Recall that in the test held on Mar 18, we discussed the sampling

of a list Λ with replacement, where it is possible to sample repeated letters. In

the test, Λ has only two letters i.e., A and B, while here we consider a Λ with

26 alphabetic letters, from A to Z. Still, we will repeat the process to sample

letters randomly from Λ and maintain another list λ to record the sampled letters.

For example, if the letters we sequentially sampled from Λ is DTTA, then λ =

〈D,T,T,A〉. Once a letter is drawn from Λ in a sampling step, a copy will be

inserted into λ and in the next step, it is likely that the same letter will be drawn

again (in other words λ may contain repeated letters).

In each sampling step, each letter has equal chances, i.e., 126 , to be drawn and

the sampling process won’t stop until A appears in λ . Let X denote the length of

λ (the number of letters therein) when the sampling process stops (X = 4 in the

example of λ = 〈D,T,T,A〉, where we observe A at the forth sampling and stops).

(1) Use sampling to estimate X and visualize the distribution with the his-

togram. (20’)

(2) In the context of (1), there are 26 letters in Λ. Here we assume Λ contains

N letters, where N = 2,3,4, ...,26. Use sampling to estimate E[X ] (X’s expected

2

value) given varying numbers of N. Visualize the relations between N and E[X ]

with some graph (pick up one that you think is suitable) (10’)

Question 3. We are interested in examining the employee’s salaries via looking

into 35 samples, with their salaries, ages, and whether they hold a MBA degree.

(1) The sample data can be found in attachment Q3/Salaries.csv containing 36

lines, where the title is at the first line (the names of these attributes are “Salary”,

“Age”, and “MBA”) and the data at the rest (a record a line). Different attributes

are separated with comma (,). To analyze the data, the first step is to load the data

into the R system and put it into a data frame for further analyses. Here we assume

that Salaries.csv is at the same level as the R script corresponding to Q2. (10’)

(2) Add one column named as “ID” from 1 to 35 to the data frame in (1). Write

the data frame (including the column names) to a file “newSalaries.csv”, which is

in the same folder as codes and the input “Salaries.csv”. (10’)

(3) Calculate the mean, standard deviation, median value, minimum value,

and maximum value for salaries. Print the results on the screen with the following

template (6 lines altogether) on the screen. Try to use the built-in functions in R

system if possible. (10’)

The statistics for salaries are:

mean value=xxx;

standard deviation=xxx;

median value=xxx

minimum value=xxx

maximum value=xxx

(4) Select the correct graph to visualize the pair-wise relation between Salaries

and Ages, and that between Salaries and MBA. Further, visualize the relations

among Salaries, Ages, and MBA in one figure. (10’)

(5) Use the barplot to visualize the comparison between the median salaries

for employees with and without the MBA degree. Then, visualize the average

salaries, still with the barplot, for employees in the group of age < 30, [30−40),

[40,50), and ≥ 50. (10’)

3

学霸联盟