R代写-MATH2349-Assignment 3
时间:2021-05-13
MATH2349 Semester 2,
2019
Assignment 3
Required packages
Hide
library(dplyr)
library(tidyr)
library(outliers)
Executive Summary
 It is highly recommended that before starting investigation, one should
understand the data and preprocesss the data in the required form.
 I have merged two datasets namely, worldgdp and Incomegroup data.
 I have renamed the columns names in both the datasets.
 I have merged the datasets and I have done character to factor conversion.
 I have created mutate variable called GDP_Per_Capita by existing variables
GDP_BY_IMF and Population.
 I have scanned for any missing/null values.
 I have scanned for outliers in data frames and found there using Turkey’s
method.using Capping method, I have replaced the outliers and transformed the
GDP_Per_Capita distribution into a symmetric one.
Data
– Worldgdp
The first dataset represent the GDP data of countries in the world.
Source : http://worldpopulationreview.com/countries/countries-by-gdp/
Variable representation:
 rank - rank as per latest GDP value
 country - Name of The country
 imfGDP - GDP value calculated by International Monetary Fund(in US$)
 unGDP - GDP value calculated by United Nation(in US$)
 pop - population of the country
– Income Group
The second dataset represent the income classification of a country in the world by
region.
Source : https://www.kaggle.com/uddipta/world-bank-unemployment->data.csv)
Variable representation:
 Country Name - Name of The country
 Region - Region the world
 IncomeGroup - Income classification of the country
Hide
# read gdp data set
worldgdp <- read.csv("worldgdp.csv",stringsAsFactors = FALSE)
colnames(worldgdp)[colnames(worldgdp)=="country"] <- "Country Name"
colnames(worldgdp)[colnames(worldgdp)=="imfGDP"] <- "GDP_BY_IMF (in
US$)"
colnames(worldgdp)[colnames(worldgdp)=="unGDP"] <- "GDP_BY_UN (in
US$)"
colnames(worldgdp)[colnames(worldgdp)=="pop"] <- "population"
head(worldgdp)


rank

Country Name

GDP_BY_IMF (in US$)

GDP_BY_UN (in US$)
1 1 United States 21344700000000 18624500000000
2 2 China 14216500000000 11218300000000
3 3 Japan 5176210000000 4936210000000
4 4 Germany 3963880000000 3477800000000
5 5 India 2972000000000 2259640000000


rank

Country Name

GDP_BY_IMF (in US$)

GDP_BY_UN (in US$)
6 6 United Kingdom 2829160000000 2647900000000
6 rows
Hide
# read incomegroup data set
incomegroup <- read.csv("incomecluster.csv",stringsAsFactors = FALSE)
incomegroup <- incomegroup %>%
select(Country.Name,Region,IncomeGroup)
colnames(incomegroup)[colnames(incomegroup)=="Country.Name"] <-
"Country Name"
head(incomegroup)


Country Name

Region

IncomeGroup

1 Afghanistan South Asia Low income
2 Angola Sub-Saharan Africa Lower middle income
3 Albania Europe & Central Asia Upper middle income
4 Arab World
5 United Arab Emirates Middle East & North Africa High income
6 Argentina Latin America & Caribbean Upper middle income
6 rows
Hide
# Merge two dataset
data_1<-merge(x=worldgdp,y=incomegroup,by="Country Name")
head(data_1)


Country Name

rank

GDP_BY_IMF (in US$)

GDP_BY_UN (in US$)

population

Region

1 Afghanistan 116 19990000000 20235063330 38041.754 South Asia
2 Albania 122 15960000000 11863866791 2880.917 Europe & Central Asia
3 Algeria 55 183687000000 159049000000 43053.054 Middle East & North Africa


Country Name

rank

GDP_BY_IMF (in US$)

GDP_BY_UN (in US$)

population

Region

4 Angola 65 92191000000 106918000000 31825.295 Sub-Saharan Africa
5 Argentina 27 477743000000 545866000000 44780.677 Latin America & Caribbean
6 Armenia 131 13105000000 10572299425 2957.731 Europe & Central Asia
6 rows | 1-7 of 7 columns
Hide
# Rearranging columns order and sorting the data as per the rank
data_1<- data_1[,c(2,1,6,7,5,3,4)]
# sorting the data as per the rank
data_1<-data_1[order(as.integer(data_1$rank),decreasing = FALSE),]
# Remove by default rownumber generated by r
row.names(data_1) <- NULL
head(data_1)


rank

Country Name

Region

IncomeGroup

population

1 1 United States North America High income 329064.92
2 2 China East Asia & Pacific Upper middle income 1433783.69
3 3 Japan East Asia & Pacific High income 126860.30
4 4 Germany Europe & Central Asia High income 83517.04
5 5 India South Asia Lower middle income 1366417.75
6 6 United Kingdom Europe & Central Asia High income 67530.17
6 rows | 1-7 of 7 columns
Understand
Hide
#Checking class of attributes
class(data_1$rank)
[1] "integer"
Hide
class(data_1$`Country Name`)
[1] "character"
Hide
class(data_1$Region)
[1] "character"
Hide
class(data_1$IncomeGroup)
[1] "character"
Hide
class(data_1$population)
[1] "numeric"
Hide
class(data_1$`GDP_BY_IMF (in US$)`)
[1] "numeric"
Hide
class(data_1$`GDP_BY_UN (in US$)`)
[1] "numeric"
Hide
# Converting charactor datatype of IncomeGroup column to ordinal
factor.
data_1$IncomeGroup<- factor(data_1$IncomeGroup,levels=c("High
income","Upper middle income","Lower middle income","Low
income"),labels=c("High income","Upper middle income","Lower middle
income","Low income"),ordered=TRUE)
class(data_1$IncomeGroup)
[1] "ordered" "factor"
Hide
#To check the structure of dataset
str(data_1)
'data.frame': 159 obs. of 7 variables:
$ rank : int 1 2 3 4 5 6 7 8 9 10 ...
$ Country Name : chr "United States" "China" "Japan"
"Germany" ...
$ Region : chr "North America" "East Asia & Pacific"
"East Asia & Pacific" "Europe & Central Asia" ...
$ IncomeGroup : Ord.factor w/ 4 levels "High income"<..: 1 2
1 1 3 1 1 1 2 1 ...
$ population : num 329065 1433784 126860 83517 1366418 ...
$ GDP_BY_IMF (in US$): num 21344700000000 14216500000000
5176210000000 3963880000000 2972000000000 ...
$ GDP_BY_UN (in US$) : num 18624500000000 11218300000000
4936210000000 3477800000000 2259640000000 ...
Hide
head(data_1)


rank

Country Name

Region

IncomeGroup

population

1 1 United States North America High income 329064.92
2 2 China East Asia & Pacific Upper middle income 1433783.69
3 3 Japan East Asia & Pacific High income 126860.30
4 4 Germany Europe & Central Asia High income 83517.04
5 5 India South Asia Lower middle income 1366417.75
6 6 United Kingdom Europe & Central Asia High income 67530.17
6 rows | 1-7 of 7 columns
Tidy & Manipulate Data I
As per Hadley Wickham and Grolemund (2016), the three tidy data rules are:
 Each variable must have its own column.
 Each observation must have its own row.
 Each value must have its own cell
By following above rules, we can say that data_1 is in tidy format.
Tidy & Manipulate Data II
Hide
# Mutating GDP_Per_Capita variable
data_1<-mutate(data_1,`GDP_Per_Capita (in US$)`=(data_1$`GDP_BY_IMF
(in US$)`/data_1$population))
head(data_1)


rank

Country Name

Region

IncomeGroup

population

1 1 United States North America High income 329064.92
2 2 China East Asia & Pacific Upper middle income 1433783.69
3 3 Japan East Asia & Pacific High income 126860.30
4 4 Germany Europe & Central Asia High income 83517.04
5 5 India South Asia Lower middle income 1366417.75
6 6 United Kingdom Europe & Central Asia High income 67530.17
6 rows | 1-7 of 8 columns
GDP Per capita is the average amount of goods and services produced per person.GDP
Per capita can be calculated by dividing GDP by population of a country.
Mutate() is used to create GDP_Per_capita variable by diving GDP_BY_IMF (in
US$) to population.
Scan I
Hide
#scan for missing value
colSums(is.na(data_1))
rank Country Name
0 0
Region IncomeGroup
0 0
population GDP_BY_IMF (in US$)
0 0
GDP_BY_UN (in US$) GDP_Per_Capita (in US$)
0 0
Hide
#scan for errors
sum(is.nan(data_1$`Country Name`))
[1] 0
Hide
sum(is.nan(data_1$Region))
[1] 0
Hide
sum(is.nan(data_1$IncomeGroup))
[1] 0
Hide
sum(is.nan(data_1$population))
[1] 0
Hide
sum(is.nan(data_1$`GDP_BY_IMF (in US$)`))
[1] 0
Hide
sum(is.nan(data_1$`GDP_BY_UN (in US$)`))
[1] 0
Hide
sum(is.nan(data_1$`GDP_Per_Capita (in US$)`))
[1] 0
we have 0 Missing values in each variable. The output is zero hence, there is no error in
dataset.
Scan II
Hide
# outliers detection in numeric variable
GDP_IMF<-boxplot(data_1$`GDP_BY_IMF (in US$)`, main = "GDP by IMF")
Hide
GDP_per_Cap<-boxplot(data_1$`GDP_Per_Capita (in US$)`, main = "Box
plot of GDP Per Capita")
Hide
z_scores_GDP_per_Cap <- data_1$`GDP_Per_Capita (in US$)` %>%
scores(type = "z")
z_scores_GDP_per_Cap %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.7074 -0.6152 -0.4367 0.0000 0.1349 4.7630
Hide
length (which( abs(z_scores_GDP_per_Cap) >3 ))
[1] 4
Hide
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95))
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[3]+1.5*IQR(x)
x
}
GDP_per_Capita_capped <- sapply(data_1 %>% select(`GDP_Per_Capita (in
US$)`),FUN = cap)
data_1_capped <- cbind (data_1 %>% select(-`GDP_Per_Capita (in
US$)`),GDP_per_Capita_capped)
boxplot(data_1_capped$`GDP_Per_Capita (in US$)` ,main ="Box Plot for
GDP_Per_Capita (in US$)")
Transform
Hide
hist(data_1$`GDP_Per_Capita (in US$)`, main = "GDP per Capita before
transformation",xlab="GDP per Capita",col="lightblue",breaks=10)
Hide
hist(log(data_1$`GDP_Per_Capita (in US$)`), main = "GDP per Capita
after transformation",xlab="GDP per Capita",col="orange",breaks=10)
GDP_Per_Capita Distribution is right-skewed.In order to tranform it into symmetric One,
log() function is used to lower the right-Skewness.




















































Country Name

GDP_BY_IMF (in US$)

GDP_BY_UN (in US$)
1 1 United States 21344700000000 18624500000000
2 2 China 14216500000000 11218300000000
3 3 Japan 5176210000000 4936210000000
4 4 Germany 3963880000000 3477800000000
5 5 India 2972000000000 2259640000000


rank

Country Name

GDP_BY_IMF (in US$)

GDP_BY_UN (in US$)
6 6 United Kingdom 2829160000000 2647900000000
6 rows
Hide
# read incomegroup data set
incomegroup <- read.csv("incomecluster.csv",stringsAsFactors = FALSE)
incomegroup <- incomegroup %>%
select(Country.Name,Region,IncomeGroup)
colnames(incomegroup)[colnames(incomegroup)=="Country.Name"] <-
"Country Name"
head(incomegroup)


Country Name

Region

IncomeGroup

1 Afghanistan South Asia Low income
2 Angola Sub-Saharan Africa Lower middle income
3 Albania Europe & Central Asia Upper middle income
4 Arab World
5 United Arab Emirates Middle East & North Africa High income
6 Argentina Latin America & Caribbean Upper middle income
6 rows
Hide
# Merge two dataset
data_1<-merge(x=worldgdp,y=incomegroup,by="Country Name")
head(data_1)


Country Name

rank

GDP_BY_IMF (in US$)

GDP_BY_UN (in US$)

population

Region

1 Afghanistan 116 19990000000 20235063330 38041.754 South Asia
2 Albania 122 15960000000 11863866791 2880.917 Europe & Central Asia
3 Algeria 55 183687000000 159049000000 43053.054 Middle East & North Africa


Country Name

rank

GDP_BY_IMF (in US$)

GDP_BY_UN (in US$)

population

Region

4 Angola 65 92191000000 106918000000 31825.295 Sub-Saharan Africa
5 Argentina 27 477743000000 545866000000 44780.677 Latin America & Caribbean
6 Armenia 131 13105000000 10572299425 2957.731 Europe & Central Asia
6 rows | 1-7 of 7 columns
Hide
# Rearranging columns order and sorting the data as per the rank
data_1<- data_1[,c(2,1,6,7,5,3,4)]
# sorting the data as per the rank
data_1<-data_1[order(as.integer(data_1$rank),decreasing = FALSE),]
# Remove by default rownumber generated by r
row.names(data_1) <- NULL
head(data_1)


rank

Country Name

Region

IncomeGroup

population

1 1 United States North America High income 329064.92
2 2 China East Asia & Pacific Upper middle income 1433783.69
3 3 Japan East Asia & Pacific High income 126860.30
4 4 Germany Europe & Central Asia High income 83517.04
5 5 India South Asia Lower middle income 1366417.75
6 6 United Kingdom Europe & Central Asia High income 67530.17
6 rows | 1-7 of 7 columns
Understand
Hide
#Checking class of attributes
class(data_1$rank)
[1] "integer"
Hide
class(data_1$`Country Name`)
[1] "character"
Hide
class(data_1$Region)
[1] "character"
Hide
class(data_1$IncomeGroup)
[1] "character"
Hide
class(data_1$population)
[1] "numeric"
Hide
class(data_1$`GDP_BY_IMF (in US$)`)
[1] "numeric"
Hide
class(data_1$`GDP_BY_UN (in US$)`)
[1] "numeric"
Hide
# Converting charactor datatype of IncomeGroup column to ordinal
factor.
data_1$IncomeGroup<- factor(data_1$IncomeGroup,levels=c("High
income","Upper middle income","Lower middle income","Low
income"),labels=c("High income","Upper middle income","Lower middle
income","Low income"),ordered=TRUE)
class(data_1$IncomeGroup)
[1] "ordered" "factor"
Hide
#To check the structure of dataset
str(data_1)
'data.frame': 159 obs. of 7 variables:
$ rank : int 1 2 3 4 5 6 7 8 9 10 ...
$ Country Name : chr "United States" "China" "Japan"
"Germany" ...
$ Region : chr "North America" "East Asia & Pacific"
"East Asia & Pacific" "Europe & Central Asia" ...
$ IncomeGroup : Ord.factor w/ 4 levels "High income"<..: 1 2
1 1 3 1 1 1 2 1 ...
$ population : num 329065 1433784 126860 83517 1366418 ...
$ GDP_BY_IMF (in US$): num 21344700000000 14216500000000
5176210000000 3963880000000 2972000000000 ...
$ GDP_BY_UN (in US$) : num 18624500000000 11218300000000
4936210000000 3477800000000 2259640000000 ...
Hide
head(data_1)


rank

Country Name

Region

IncomeGroup

population

1 1 United States North America High income 329064.92
2 2 China East Asia & Pacific Upper middle income 1433783.69
3 3 Japan East Asia & Pacific High income 126860.30
4 4 Germany Europe & Central Asia High income 83517.04
5 5 India South Asia Lower middle income 1366417.75
6 6 United Kingdom Europe & Central Asia High income 67530.17
6 rows | 1-7 of 7 columns
Tidy & Manipulate Data I
As per Hadley Wickham and Grolemund (2016), the three tidy data rules are:
 Each variable must have its own column.
 Each observation must have its own row.
 Each value must have its own cell
By following above rules, we can say that data_1 is in tidy format.
Tidy & Manipulate Data II
Hide
# Mutating GDP_Per_Capita variable
data_1<-mutate(data_1,`GDP_Per_Capita (in US$)`=(data_1$`GDP_BY_IMF
(in US$)`/data_1$population))
head(data_1)


rank

Country Name

Region

IncomeGroup

population

1 1 United States North America High income 329064.92
2 2 China East Asia & Pacific Upper middle income 1433783.69
3 3 Japan East Asia & Pacific High income 126860.30
4 4 Germany Europe & Central Asia High income 83517.04
5 5 India South Asia Lower middle income 1366417.75
6 6 United Kingdom Europe & Central Asia High income 67530.17
6 rows | 1-7 of 8 columns
GDP Per capita is the average amount of goods and services produced per person.GDP
Per capita can be calculated by dividing GDP by population of a country.
Mutate() is used to create GDP_Per_capita variable by diving GDP_BY_IMF (in
US$) to population.
Scan I
Hide
#scan for missing value
colSums(is.na(data_1))
rank Country Name
0 0
Region IncomeGroup
0 0
population GDP_BY_IMF (in US$)
0 0
GDP_BY_UN (in US$) GDP_Per_Capita (in US$)
0 0
Hide
#scan for errors
sum(is.nan(data_1$`Country Name`))
[1] 0
Hide
sum(is.nan(data_1$Region))
[1] 0
Hide
sum(is.nan(data_1$IncomeGroup))
[1] 0
Hide
sum(is.nan(data_1$population))
[1] 0
Hide
sum(is.nan(data_1$`GDP_BY_IMF (in US$)`))
[1] 0
Hide
sum(is.nan(data_1$`GDP_BY_UN (in US$)`))
[1] 0
Hide
sum(is.nan(data_1$`GDP_Per_Capita (in US$)`))
[1] 0
we have 0 Missing values in each variable. The output is zero hence, there is no error in
dataset.
Scan II
Hide
# outliers detection in numeric variable
GDP_IMF<-boxplot(data_1$`GDP_BY_IMF (in US$)`, main = "GDP by IMF")
Hide
GDP_per_Cap<-boxplot(data_1$`GDP_Per_Capita (in US$)`, main = "Box
plot of GDP Per Capita")
Hide
z_scores_GDP_per_Cap <- data_1$`GDP_Per_Capita (in US$)` %>%
scores(type = "z")
z_scores_GDP_per_Cap %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.7074 -0.6152 -0.4367 0.0000 0.1349 4.7630
Hide
length (which( abs(z_scores_GDP_per_Cap) >3 ))
[1] 4
Hide
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95))
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[3]+1.5*IQR(x)
x
}
GDP_per_Capita_capped <- sapply(data_1 %>% select(`GDP_Per_Capita (in
US$)`),FUN = cap)
data_1_capped <- cbind (data_1 %>% select(-`GDP_Per_Capita (in
US$)`),GDP_per_Capita_capped)
boxplot(data_1_capped$`GDP_Per_Capita (in US$)` ,main ="Box Plot for
GDP_Per_Capita (in US$)")
Transform
Hide
hist(data_1$`GDP_Per_Capita (in US$)`, main = "GDP per Capita before
transformation",xlab="GDP per Capita",col="lightblue",breaks=10)
Hide
hist(log(data_1$`GDP_Per_Capita (in US$)`), main = "GDP per Capita
after transformation",xlab="GDP per Capita",col="orange",breaks=10)
GDP_Per_Capita Distribution is right-skewed.In order to tranform it into symmetric One,
log() function is used to lower the right-Skewness.

学霸联盟


essay、essay代写