MH3511-R代写|学霸联盟

MH3511-R代写

时间：2024-03-09

MH3511 Data Analysis with Computer Group Project
MH3511 Group 11: Top Instagram Accounts Data
Name Matriculation Number
Lam Hon Wei U2040382A
Jee Li Xuan, Holly U1940353L
Jin Shinan U2040886E
Abstract:
With billions of users worldwide registering Instagram accounts, Instagram is one of the
most common social media platforms showcasing photos and videos in various social
networks. The frequent number of Posts uploaded led to an increase in the average likes
and engagement ratings by the audience. However, there is a difference on how different
genders utilize the number of posts uploaded in Instagram to garner the right proportion of
likes by their followers. It is easy to understand that the number of followers for the two
different genders differ and hence the number of posts. However, the average likes or
engagement rating from different categories such as Entertainment, Health, Fitness and
Sports, Fashion etc depends on the number of interesting Posts uploaded. Hence, we would
like to examine if there are any relationship between the different genders uploading the
posts in different category platforms through statistical analysis.
Team Structure
Name Contribution
Lam Hon Wei Abstract, Introduction, Data Description, Exploratory Data Analysis,
Statistical Analysis : (Correlation between Numerical variables after
Transformation)
Jee Li Xuan, Holly Statistical Analysis : (Statistical Tests), Conclusion
Jin Shinan Data Cleaning, Conclusion
2
Table of Content
Table of Content .......................................................................................................................... 2
1. Introduction ............................................................................................................................. 3
2. Data Description ...................................................................................................................... 3
3. Exploratory Data Analysis ......................................................................................................... 4
3.1 Overview of Data Visualisation................................................................................................... 4
3.2 Using Data Visualisation in Statistical Analysis ........................................................................... 4
3.3 Correlations between Numerical Variables before Transformation .......................................... 5
4. Description and Cleaning of Dataset ......................................................................................... 5
4.1 Summary Statistics for the Main Variable of Interest, Followers............................................... 6
4.2 Summary Statistics for Other Variables...................................................................................... 7
4.2.1 The total number of posts on the account, Posts ............................................................. 8
4.2.2 The average number of likes that the account’s posts receive per post, Ave Likes ......... 8
4.2.3 The account’s engagement rate, Eng.Rate ....................................................................... 9
4.2.4 The brief description of the account, channel_Info ......................................................... 9
4.2.5 The account’s category based on its primary theme or subject matter, Category ........ 10
4.3 Final Dataset for Analysis ......................................................................................................... 10
5. Statistical Analysis .................................................................................................................. 10
5.1 Correlations between Numerical variables after Transformation ........................................... 10
5.2 Statistical Tests ......................................................................................................................... 11
5.2.1 Relation between number of Posts and the Type of Channel Info ................................. 11
5.2.2 Relation between the Average Likes and the Type of Channel Info ............................... 14
5.2.3 Relation between Channel Info and Category ................................................................ 17
5.2.4 Is there a particular Channel Info that influences the Category based on the
engagement rating? ................................................................................................................. 19
5.2.5 Relation between Number of Posts and Average Likes .................................................. 20
6. Discussion and Conclusion ...................................................................................................... 22
7. Appendices ............................................................................................................................ 23

3
1. Introduction
Instagram is a popular app used for sharing photos and videos in different social networks owned by
American Company Meta Platforms. This allows users to upload and apply filters on their desired
photo effects before they create hashtags and geographical tagging. Posts can be shared among
followers or in public to enhance the usage of social networking and likes. Users can simply browse
through content by tag or location to view popular content and follow various users to get notified in
their area of interest for their own feeds.

In our project, a dataset containing the Top Instagram Accounts contain various popular influencers
and users around the world. Such datasets garner information on the number of posts uploaded by the
users as well as the number of people following them. Based on this dataset, we seek to answer
following popular questions in the aspect of Top Instagram Accounts:
1. Is there a relationship in the various Numerical Variables?
2. Does the number of Posts and Average Likes depend on the type of Channel Info?
3. Is there a relationship between Channel Info and the type of Categories?
4. Is there a particular Channel Info that influences the Category based on the engagement
rating?
5. Is there a relationship between number of posts and average likes?

This report will cover the exploratory data analysis as well as the relevant results in R language. For
each of our research objectives, we performed statistical analysis with explanations and elaborations
on the different numerical and categorical variables.

2. Data Description
The Top Instagram Accounts Dataset is a collection of 200 rows of data that provides valuable
insights into the most popular Instagram accounts across different categories. The dataset
contains several columns that provide comprehensive information on each account's
performance, engagement rate, and audience size.

1. The "rank": column ranks the most number of followers in their account.
2. The "name": column displays the Instagram handle of the account .
3. The "channel_info": column summarises the various types of content featured.
4
4. The "Category": column categorizes the account based on its primary theme or subject matter,
such as fashion, sports, entertainment, or food.
5. The "posts": column indicates the total number of posts uploaded from the user account.
6. The "followers": column indicates the number of people who follow the account on Instagram.
7. The "avg likes": column displays the average number of likes per post uploaded by users.
8. The "eng rate": column calculates the account's engagement rate by dividing the total number of
likes and comments received by the total number of followers, expressed as a percentage.
3. Exploratory Data Analysis
In this section, data visualisation such as bar plots and pie charts illustrate the overview of the relevant
datasets in Top Instagram Account before performing data cleaning in the next section as well as
Statistical Analysis.
3.1 Overview of Data Visualisation
The bar chart in Appendix A shows the top 5 greatest number of followers based on the Instagram
handle of the account. The greatest number of followers is the Instagram handle named Instagram
which has a total of 580.1 million followers and the second greatest number of followers is Cristiano
which has a total of 519.9 million followers.
The bar chart in Appendix B displays the top 5 most number of posts uploaded based on the
Instagram account handle. The most number of posts uploaded is saraalikhan95 which has a total
number of 889,000 posts, followed by the second most number of posts uploaded is roses_are_rosie
which has a total of 879,00 posts. Generally, the difference in the number of posts shown in
Appendix B are almost the same.
3.2 Using Data Visualisation in Statistical Analysis
A Pie Chart was constructed in Appendix C to display the weightage of different Channel_Info for the
number of posts and the number of followers. The Channel_Info consists of 4 categories which
includes Brand, Community, Females and Males. Based on the number of Followers, the most
weightage of followers is under the Male Category which makes up 45.04% while the second most
weightage is under the Female Category 40.49%. As for the number of Posts, the weightage of Males
is the most followed by Females and a very low weightage in Brand and Community. Based on these
5
findings, the Categories on the weightage for Males and Females prove to be useful in Statistical
Analysis since they make a large percentage in the number of Followers and Posts.
A Bar Chart was plotted in both Appendix D and Appendix E, and we can observe that the
Entertainment Category has the greatest number of followers and posts of about 10.86 billion and
14.43 million respectively. The second largest will be the Health, Sports, and Fitness category of
about 3.44 billion Followers and 2.217 million Posts. Hence, we can make use of Entertainment,
Health Sports, and Fitness and perhaps Fashion Category in our statistical analysis since it makes up
the third largest number of Posts and Followers.
3.3 Correlations between Numerical Variables before Transformation
Figure 1: Correlation Matrix in Numerical Variables Instagram
As shown in Figure 1, we can see that the highest correlation is between the Avg_Likes and
Posts which constitute close to a value of r = 1. However, the rest of the numerical variables
are not highly correlated to each other. With that, we need to perform Data Cleaning in log or
square root transformation to elicit the unwanted outliers in our dataset. Hence, this to ensure
that we can perform Statistical Analysis as accurately as possible.
6
4. Description and Cleaning of Dataset
In this section, we shall look into the data in more detail by checking the summary statistics and
visualization of the distribution pattern as well as the missing values. Each variable is investigated
individually to look for possible outliers, and/or to perform a transformation to avoid highly skewed
data. Imputation is performed for variables with missingness.
4.1 Summary Statistics for the Main Variable of Interest, Followers

Figure 2: Distribution of variable Followers
It appears that the variable Followers is highly skewed, hence we apply a log-transformation (base e)
to the variable. The log-transformed data appears to have some outlying values at the right tail in the
histogram and also showed some extremely large values in the boxplot. Upon further investigation,
we notice that those Instagram accounts with an extremely big number of followers are the official
accounts of Instagram, Instagram and also some top celebrities, such as Cristiano Ronaldo, Lionel
Messi and Kylie Jenner. Their big number of followers attributes to their popularity and social
awareness instead of their Instagram account contents. Also, we would like to note that the number of
followers is actually count data which could be modelled with Poisson distribution by its nature. Since
we would like to perform linear regression, we would like to remove the outliers by applying the log-
transformation to get a normal distributed variable. The transformed variables, however, will
inevitably deviate from normal distribution to some extent. Therefore, we remove the records of those
users whose number of followers are above 300 million, approximately 4% of the data.
The histogram and boxplot of the log-transformed variable, with the outliers removed are shown
below with summary statistics. The dataset is now more symmetric and has less outliers. As we
discussed above, however, the deviation from normal distribution is inevitable and acceptable in this
case. And we would be cautious when interpreting the results from linear regression which requires
the assumption of normal distribution.
7

Figure 3: Distribution of variable Followers after log-transformation and trimming
Min 1St Quantile Median Mean 3rd Quantile Max
17.37 17.55 17.77 17.92 18.08 19.48
Table 1: Summary statistics of variable Followers after log-transformation and trimming
We shall proceed to the next section with this trimmed data.
4.2 Summary Statistics for Other Variables
The histogram, the boxplot, the applied transformation and the outliers removed from the variables or
imputed missing values are tabulated in the following sub-section.
8
4.2.1 The total number of posts on the account, Posts
Min 1StQuantile Median Mean 3rdQuantile Max
6.802 7.650 8.630 9.383 10.883 13.698
• The log-transformation
(base e) is applied due to the
original value of Posts being
highly skewed.
• No outlying value is
removed.
• Similar to the variable
Followers, the log-
transformed Posts still show
deviation from normal
distribution on the right tail
due to its nature of count
data while it is acceptable in
this case and we want to be
cautious when interpreting
the linear regression model.
4.2.2 The average number of likes that the account’s posts receive per post, Ave Likes
Min 1StQuantile Median Mean 3rdQuantile Max
6.802 7.633 8.604 9.350 10.623 13.698
• The log-transformation
(base e) is applied and the
transformed variable is
closer to the normal
distribution except the right
tail.
• No outlying value is
removed.
9
4.2.3 The account’s engagement rate, Eng.Rate
Min 1StQuantile Median Mean 3rdQuantile Max
-6.908 -5.162 -4.383 -4.430 -3.719 -1.324
• The sqrt-transformation is
applied and the transformed
variable is closer to the
normal distribution except
the right tail.
• No outlying value is
removed.
4.2.4 The brief description of the account, channel_Info
female male
70 95
• No outlying value of
channel_Info is removed.
• Five missing values of
channel_Info were imputed
after inspection of the
account.
• There are much more
personal accounts, female and
males, than organization
accounts, brandand
community.
• In this report, data from
females and male are the main
target analysis groups. The
remaining types are filtered
out.
10
4.2.5 The account’s category based on its primary theme or subject matter, Category
Beauty,
Makeup
&
Fashion
Entertain
ment
Health,
Sports &
Fitness
Lifestyle
and
others
15 127 38 6
• No outlying value of
Category is removed.
• Seven missing values of
Category were imputed after
inspection of the account.
• The levels of Category are
quite sparse and some levels
only have few observations.
Such levels would have
problems with estimation in
the following statistical
analysis step, therefore we
merged them together based
on the content.
• In this report, data from the
four categories shown in the
table are the main target
analysis groups. The
remaining types are filtered
out.
• There are much greater
numbers of accounts in the
Entertainment category than
others.
11

4.3 Final Dataset for Analysis
Based on the above analysis, the dataset is further reduced to 159 observations with the suggested
transformations applied to continuous variables. Namely, log-transformation(base e) is applied to
Followers, Posts, Ave_Likes and sqrt-transformation is applied to Eng.Rate. For categorical variables,
missing values are imputed for Category and channel_Info and sparse levels in Category were merged
for statistical analysis purposes.
5. Statistical Analysis
5.1 Correlations between Numerical variables after Transformation
In this section, we will aim to address the first question of our research objective “Is there a
relationship in the various Numerical Variables?”
A correlation plot displays the strength of the relationship between pairs of variables. We use it to
identify the degree of association between multiple variables in this dataset.

12

Figure 3: Correlation Matrix in Numerical Variables Instagram

After performing the relevant data cleaning and log transformation on number of Followers, Posts and
Average Likes as well as square root of Engagement Rating, we found the top 3 most correlated data
where:
● log(Avg_Likes) vs log(Posts) are slightly negatively correlated (r = -0.31)
● sqrt(Eng_Rate) vs log(Posts) are very highly negatively correlated (r = -0.66)
● sqrt(Eng_Rate) vs log(Avg_Likes) are slightly correlated (r = 0.27)
With that, we will perform necessary statistical analysis based on the above observations.
5.2 Statistical Tests
5.2.1 Relation between number of Posts and the Type of Channel Info
Knowing the relationship between the number of posts and the average number of likes on Instagram
can provide valuable insights for businesses and individuals who are looking to improve their social
media strategy. By understanding the relationship between these two variables, one can determine the
optimal frequency of posts to maximize engagement from their audience. Additionally, this
information can provide insights into the type of content that resonates best with their audience and
help guide content creation decisions.
In this section we will discuss whether the number of posts depends on the type of Channel Info.
Using a Variance test and a 2 sample t-test to investigate if there is a significant association between
the Engagement rate and the type of channel information, where the type of channel information is
13
considered categorical variables. As we are more interested in male and female accounts, we will
filter Channel Info. To visualize the distribution of Engagement rate across different channel types,
we have created a box plot in the following figure.

Figure 4: Box Plot Distribution of log(Posts) against Channel Info

We aim to conduct a deeper analysis and examine the association between the male and female
categories of Channel Info and the logarithm of the number of posts log(Posts). We do a variance test
to check if the two samples have equal variances.

F test to compare two variances

data: unlist(channel_male_posts) and unlist(channel_female_posts)
F = 1.8406, num df = 56, denom df = 64, p-value = 0.0187
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
1.107866 3.085117
sample estimates:
ratio of variances
1.840606
14
The test resulted in an F statistic of 1.8406, with a p-value of 0.0187. This p-value is less than the
significance level of 0.05, thus we can reject the null hypothesis that the variances are equal. Hence
we can conclude that there is a significant difference in the variances of the number of posts between
the male and female groups in the Entertainment category of Instagram accounts.
Next, we shall do a Two Sample t-test assuming variances are not equal.
Welch Two Sample t-test
data: channel_male_posts and channel_female_posts
t = -2.4096, df = 101.84, p-value = 0.01776
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.2160388 -0.1179458
sample estimates:
mean of x mean of y
6.785378 7.452370
Based on the results, we can conclude that there is no significant difference in the mean log of the
number of posts between male and female Instagram accounts in the “Entertainment” category. The
P-value is 0.01776, which is lower than the significance level of 0.05, thus we reject the null
hypothesis of equal means. Therefore, we can infer that gender has a significant impact on the number
of posts in this category of Instagram accounts.
5.2.2 Relation between the Average Likes and the Type of Channel Info
Understanding the relationship between the average likes and the type of channel information in
Instagram accounts is crucial for content creators and marketers using Instagram as a platform. The
average amount of likes on social media platforms such as Instagram is a key metric for measuring the
success of marketing campaigns. By analyzing the relationship between the average likes and the type
of channel information, content creators are able to cater their content to the preferences of their target
audience.
In this section, we shall be conducting an analysis similar to the number of posts and the type of
channel info, where we have also conducted variance and 2 sample t-tests to investigate if there is a
significant association between the average likes and the type of channel information. We filter the
channel information to focus on the relationship between “male” and “female”.
Our objective is to conduct a thorough analysis to investigate the relationship between the type of
channel information in male and female categories and the logarithm of the average likes. To achieve
15
this, we would first conduct a variance test to then determine whether the two samples have equal
variances.
Figure 5: Box Plot Distribution of log(Avg_Likes) against Channel Info
Conducting the variance test to check equality of variances, we have the following results:
F test to compare two variances
data: unlist(channel_male_likes) and unlist(channel_female_likes)
F = 1.3886, num df = 56, denom df = 64, p-value = 0.2038
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.8357791 2.3274256
sample estimates:
ratio of variances
1.388561

Based on the above results from the F test, the F-statistic is 1.3886 and the p-value is 0.2038. Since
the p value is greater than 0.05, we fail to reject the null hypothesis, indicating that the variances are
16
equal. Therefore, there is not enough evidence to suggest that the variance of likes is significantly
different between male and female accounts.

Next, we shall do a Two Sample t-test assuming variances are equal.

Two Sample t-test

data: channel_male_likes and channel_female_likes
t = 2.2461, df = 120, p-value = 0.02652
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.1049197 1.6655616
sample estimates:
mean of x mean of y
9.825939 8.940698
Based on the results obtained from the 2 sample t-test conducted on the male and female categories of
Instagram Channel Info and their respective average likes, we can conclude that there is a significant
difference in the average likes between these two categories. The calculated P-value of 0.02652 is less
than the significance level of 0.05, indicating that we reject the null hypothesis of equal means. Thus,
we can infer that the gender of the Instagram account owner has a significant impact on the number of
average likes received.
5.2.3 Relation between Channel Info and Category
Investigating how the Channel Info can impact the Category based on engagement rating can reveal
valuable insights for businesses and individuals. In this section, we will discuss whether the number
of followers is dependent on the type of category.
We use a contingency table to further analyze the relationship between gender and the type of
category. By presenting the number of males and females in each category, it allows for a comparison
of the distribution of genders across different categories.
female male Total
Beauty, Makeup &
Fashion
5 3 8
Entertainment 65 57 122
Health, Sports &
Fitness
0 29 29
Lifestyle and others 0 2 2
Total 70 91 161
17
Based on the contingency table, these few observations are found:
● There are more females than males in the “Beauty, Makeup & Fashion” category
● The “Entertainment” category has a more even distribution of males and females.
● There are no females in the “Health, Sports & Fitness” and “Lifestyle and others” categories.
This analysis provides insight into the gender distribution across different Instagram categories and
can be useful in identifying potential gender biases or patterns.
Using a proportion test to test the equality of proportions between “male” and “female” and category:
4-sample test for equality of proportions without continuity
correction
data: Cat_channelInfo_sub
X-squared = 29.792, df = 3, p-value = 0.000001526
alternative hypothesis: two.sided
sample estimates:
prop 1 prop 2 prop 3 prop 4
0.6250000 0.5327869 0.0000000 0.0000000
Based on the proportion test, we can infer that the proportion of females in the “Beauty, Makeup &
Fashion” category (prop 1) is 62.5% while the proportion of females in the “Entertainment” category
is 53.3%. The “Health, Sports & Fitness” and “ Lifestyle and others” had no females.
The p-value of 0.000001526 suggests that we can reject our null hypothesis that the proportions are
equal across all four categories. Thus we can conclude that there is a significant difference in the
proportion of females among the different categories.
As both Category and Channel Info are categorical variables, the best statistical test suited to test the
significance between them is a Chi-squared test to test the independence of variables.
Pearson's Chi-squared test
data: Cat_channelInfo_sub
X-squared = 29.792, df = 3, p-value = 0.000001526
Based on the results obtained from the Chi-Square Test, we see that the p-value is less than 0.05. With
that, we are able to reject the Null Hypothesis and accept the Alternative Hypothesis that there is a
large difference between the Channel Info and Category at the 5% significance level.
18
5.2.4 Is there a particular Channel Info that influences the Category based on the engagement
rating?
Investigating the relationship between Channel Info and Category based on engagement rating can
help identify which types of channels are most effective in engaging their audience, providing
valuable insights for businesses and individuals looking to build their social media presence.
In this section, we will investigate whether there is a significant association between the engagement
rating and the type of Channel Info and Category.
First, we will visualize the distribution of the engagement rating across different Channel Info and
Category using box plots.
Figure 6: Boxplot of Channel info and Category
We will use a two-way ANOVA to test the significance of the effect of Channel Info and Category on
the average likes and engagement rating. The interaction between the two factors will also be
investigated.

19
Df Sum Sq Mean Sq F value Pr(>F)
channel_Info 1 0.0111 0.011141 1.362 0.2449
Category 4 0.0972 0.024298 2.971 0.0212 *
channel_Info:Category 2 0.0098 0.004920 0.602 0.5492
Residuals 157 1.2840 0.008178
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The results of the ANOVA indicate that there is no significant effect of channel info on the
engagement rating, with a p-value of 0.2449, which is greater than the standard significance level of
0.05. However, the analysis showed a significant effect of category on the average likes and
engagement rating, with a p-value of 0.0212, indicating that the category of the channel influences the
engagement rate.
Furthermore, the p-value for the interaction between channel_Info and Category is 0.5492, which
suggests that there is no significant interaction effect between channel info and category on the
average likes and engagement rating. Hence, we can conclude that category has a significant effect on
the average likes and engagement rating, while channel info and the interaction between channel info
and category do not have significant effects.
A line plot is created for further visual analysis and to see the interaction between Category and
Engagement Rating based on Channel Info.
20
Figure 7: Line plot to see the interaction between engagement rating and channel info
As there are no female accounts in the top 200 that are “Health,Sports & Fitness” and “Lifestyle and
others” based, we shall focus on “Beauty, Makeup & Fashion” as well as the “Entertainment”
category. Hence, based on the above line plot, we can see that the highest engagement rating from
both categories are from male instagram accounts.
5.2.5 Relation between Number of Posts and Average Likes
Exploring the correlation between the number of posts and average likes can provide insights into the
content strategy of the channel and the audience's preferences, which can inform decisions on the
frequency and type of content to post for optimal engagement.
A linear regression model was used to examine the relationship between the logarithm of the number
of posts and the average likes on Instagram.
21
Call:
lm(formula = insta_processed$logPosts ~ insta_processed$Avg_Likes)
Residuals:
Min 1Q Median 3Q Max
-4.7377 -0.5355 0.1206 0.8571 3.4847
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.5406349206 0.1270611835 59.35 < 0.0000000000000002 ***
insta_processed$Avg_Likes -0.0000018963 0.0000004801 -3.95 0.000117 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.451 on 159 degrees of freedom
Multiple R-squared: 0.08934, Adjusted R-squared: 0.08361
F-statistic: 15.6 on 1 and 159 DF, p-value: 0.0001174

The intercept of 7.54 means that when the average number of likes is zero, the expected value of
log(Posts) is 7.54. The coefficient of -0.0000018963 for insta_processed$Avg_Likes means that for
every one unit increase in log(Avg_Likes), the expected value of log(Posts) decreases by
0.0000018963 units.
The p-value of 0.000117 for the coefficient of insta_processed$Avg_Likes suggests that this
relationship is significant. The adjusted R-squared value of 0.08361 implies that only about 8.4% of
the variation in log(Posts) can be explained by log(Avg_Likes). Hence, the linear correlation between
log(Posts) and log(Avg_Likes) is only -0.314.

22
6. Discussion and Conclusion
Instagram, owned by Meta Platforms, is a widely used app for sharing photos and videos across
different social networks. It allows users to upload and apply filters to their photos, add hashtags and
location tags, and share posts with followers or publicly. Users can browse content by tags or
locations tags, follow users of interest, and receive notifications in their feed. In our project, we
analyzed a dataset of Top Instagram Accounts, which includes popular influencers and users
worldwide, to answer various questions related to the number of followers.
We conclude that:
● Variables log(Avg_Likes) and log(Posts) are slightly negatively correlated, sqrt(Eng_Rate)
and log(Posts) are very highly negatively correlated, sqrt(Eng_Rate) and log(Avg_Likes) are
slightly correlated.
● Number of Posts and Average Likes depend on the type of Channel Info
● There a relationship between Channel Info and the type of Categories
● Based on our available data, male accounts have a higher engagement rating.
● There is a relationship between number of posts and average likes.

Furthermore, we found that engagement rate can be used to model Followers via a linear model. On
the other hand, average likes and posts and average likes do not exhibit the same level of significance.
Although an interesting conclusion has been drawn from the data published on Kaggle, potential bias
in the selection of top Instagram accounts in the dataset should be discussed. The dataset may only
conclude accounts that have large following and may not be representative of the entire Instagram
ecosystem, potentially leading to biased or incomplete findings. Additionally, validating the dataset
with additional sources and conducting data normalization can be valuable steps in ensuring the
reliability and accuracy of research findings.This could involve cross-referencing the dataset with
other reputable sources such as official Instagram accounts, industry reports or academic studies.
23
7. Appendices
Appendix A: Top 5 greatest number of followers in Instagram
Appendix B: Top 5 greatest number of posts in Instagram
24
Appendix C: Pie Chart based on the number of Followers and Posts in different Channel Info
Appendix D: Top 5 greatest number of Followers in different Categories
25
Appendix E: Top 5 greatest number of Posts in different Categories
library(ggplot2)
library(dplyr)
library(lessR)
library(corrplot)
# Exploratory Data Analysis
# Plotting Barcharts for most Followers
insta = read.csv(file = "data.csv"); insta
Followers = insta$Followers; Followers
Posts = insta$Posts; Posts
followers_filter = head(insta,5); followers_filter
followers_filter1 = data.frame(Followers = followers_filter$Followers, name =
followers_filter$name); followers_filter1
ggplot(followers_filter1, aes(y=Followers, x=name)) + geom_bar(stat = "identity", width = 0.5, fill
=c('steelblue')) + geom_text(aes(label=Followers), vjust=-0.3, size=3.5)
# Plotting Barcharts for most Posts
Most_Posts = insta[order(insta$Posts, decreasing=TRUE,na.last=NA),]; Most_Posts
Posts_filter = head(Most_Posts, 5); Posts_filter
Posts_filter1 = data.frame(Posts = Posts_filter$Posts, name = Posts_filter$name); Posts_filter1
ggplot(Posts_filter1, aes(y=Posts, x=name)) + geom_bar(stat = "identity", width = 0.5, fill
=c('steelblue')) + geom_text(aes(label=Posts), vjust=-0.3, size=3.5)
#Pie Chart for channel info
channel_pie <- insta %>% group_by(insta$channel_Info) %>% summarise(Total_Followers =
sum(Followers)) ;channel_pie
df = data.frame(channel_pie); df
df = df %>% na_if("") %>% na.omit
26
pie(df$Total_Followers, labels = paste0(round(100 * df$Total_Followers/sum(df$Total_Followers),
2), "%"),
col=rainbow(length(df$insta.channel_Info)), radius = 2.5,
main="Pie Chart of Channel Info", xlab = "No.of Followers")
legend("left", legend = df$insta.channel_Info, fill = rainbow(length(df$insta.channel_Info)))
channel_pie2 <- insta %>% group_by(insta$channel_Info) %>% summarise(Total_Posts =
sum(Posts)) ;channel_pie2
df2 = data.frame(channel_pie2); df2
df2 = df2 %>% na_if("") %>% na.omit
pie(df2$Total_Posts, labels = paste0(round(100 * df2$Total_Posts/sum(df2$Total_Posts), 2), "%"),
col=rainbow(length(df2$insta.channel_Info)), radius = 3,
main="Pie Chart of Channel Info", xlab = "No.of Posts")
legend("topleft", legend = df2$insta.channel_Info, fill = rainbow(length(df2$insta.channel_Info)))

#Barplot for category based on Posts
bar1 = insta %>% group_by(insta$Category) %>% summarise(Total_Posts = sum(Posts)); bar1
df3 = data.frame(bar1); df3
df3 = df3 %>% na_if("") %>% na.omit
df4 = data.frame(Category = df3$insta.Category, Posts = df3$Total_Posts)
Most_Posts_Cat = df4[order(df4$Posts, decreasing=TRUE,na.last=NA),]; Most_Posts_Cat
ggplot(head(Most_Posts_Cat,5), aes(x = Category, y= Posts)) + geom_bar(stat = "identity", width =
0.5, fill = "green") + geom_text(aes(label=Posts), vjust=-0.3, size=3.5)

#Barplot for category based on Followers
bar2 = insta %>% group_by(insta$Category) %>% summarise(Total_Followers = sum(Followers));
bar2
df5 = data.frame(bar2); df5
df5 = df5 %>% na_if("") %>% na.omit
df6 = data.frame(Category = df5$insta.Category, Followers = df5$Total_Followers)
Most_Follow_Cat = df6[order(df6$Followers, decreasing=TRUE,na.last=NA),]; Most_Follow_Cat
ggplot(head(Most_Follow_Cat,5), aes(x = Category, y= Followers)) + geom_bar(stat = "identity",
width = 0.5, fill = "green") + geom_text(aes(label=Followers), vjust=-0.3, size=3.5)

# Performing Correlations with numerical variables
df7 = data.frame(insta$Followers, insta$Posts, insta$Avg_Likes, insta$Eng.Rate); df7
cor_insta = cor(df7, method = "pearson"); cor_insta
corrplot(cor_insta, type="lower",method="color",addCoef.col = "red",number.cex = 0.8)

###############################
#* data cleaning starts from here
###############################

###############################
#* check followers
###############################
#* summary statistics and distribution plot
summary(insta$Followers)
par(mfrow = c(1, 3))
hist(insta$Followers, breaks = 30,
main = "Histrogram of Followers", xlab = "Followers")
27
hist(log(insta$Followers), breaks = 30,
main = "Histrogram of log(Followers)", xlab = "log(Followers)")
boxplot(log(insta$Followers),
main = "Boxplot of log(Followers)")
#* apply trimming and log-transformation
insta_processed <- insta %>%
filter(Followers < 300000000) %>%
mutate(logFollowers = log(Followers))
par(mfrow = c(1, 2))
hist(insta_processed$logFollowers, breaks = 30,
main = "Histrogram of trimmed log(Followers)", xlab = "log(Followers)")
boxplot(insta_processed$logFollowers,
main = "Boxplot of trimmed log(Followers)")
summary(insta_processed$logFollowers)
###############################
#* check Posts
###############################
summary(insta_processed$Posts)
par(mfrow = c(1, 3))
hist(insta_processed$Posts, breaks = 20,
main = "Histrogram of Posts", xlab = "Posts")
hist(log(insta_processed$Posts), breaks = 20,
main = "Histrogram of log(Posts)", xlab = "log(Posts)")
boxplot(log(insta_processed$Posts),
main = "Boxplot of log(Posts)")

#* apply log-transformation
insta_processed <- insta_processed %>%
mutate(logPosts = log(Posts))
summary(insta_processed$logPosts)

###############################
#* check Avg_Likes
###############################
summary(insta_processed$Avg_Likes)
par(mfrow = c(1, 3))
hist(insta_processed$Avg_Likes, breaks = 20,
main = "Histrogram of Avg_Likes", xlab = "Avg_Likes")
hist(log(insta_processed$Avg_Likes), breaks = 20,
main = "Histrogram of log(Avg_Likes)", xlab = "log(Avg_Likes)")
boxplot(log(insta_processed$Avg_Likes),
main = "Boxplot of log(Avg_Likes)")

28
#* apply log-transformation
insta_processed <- insta_processed %>%
mutate(logAvg_Likes = log(Avg_Likes))
summary(insta_processed$logAvg_Likes)

###############################
#* Eng.Rate
###############################
summary(insta_processed$Eng.Rate)
par(mfrow = c(1, 3))
hist(insta_processed$Eng.Rate, breaks = 20,
main = "Histrogram of Eng.Rate", xlab = "Eng.Rate")
hist(log(insta_processed$Eng.Rate),
main = "Histrogram of log(Eng.Rate)", xlab = "log(Eng.Rate)")
boxplot(log(insta_processed$Eng.Rate),
main = "Boxplot of log(Eng.Rate)")
#* apply sqrt-transformation
insta_processed <- insta_processed %>%
mutate(sqrtEng.Rate = sqrt(Eng.Rate))
summary(insta_processed$sqrtEng.Rate)

###############################
#* check channel_Info
##############################
table(insta_processed$channel_Info)
#* inspect accounts with missing channel info and fill in the values
insta_processed$channel_Info[insta_processed$channel_Info == ""] <- c("male", "male", "male",
"male","male")
#* filter and plot channel info distribution
insta_processed <- insta %>%
filter(channel_Info %in% c("female","male"))
par(mfrow = c(1, 1))
barplot(table(insta_processed$channel_Info), main = "Channel Information")
table(insta_processed$channel_Info)
###############################
#* check Category
###############################
table(insta_processed$Category)
#* inspect accounts with missing category
insta_processed[insta_processed$Category == "",]
par(mfrow = c(1, 2), mar=c(12,4,2,1))
barplot(table(insta_processed$Category), main = "Account category", las=2)
#* fill in missing categories
insta_processed$Category[insta_processed$Category == ""] <- c("Health, Sports & Fitness",
"fashion", "fashion",
"photography",
29
"entertainment",
"entertainment",
"entertainment")
#* merge sparse categories
insta_processed$Category[insta_processed$Category %in% c("fashion", "Beauty & Makeup")] <-
"Beauty, Makeup & Fashion"
insta_processed$Category[insta_processed$Category %in% c("Lifestyle", "food", "photography",
"Craft/DIY")] <- "Lifestyle and others"
insta_processed$Category[insta_processed$Category %in% c("News & Politics", "technology",
"Finance")] <- "Others"
insta_processed$Category[insta_processed$Category == "entertainment"] <- "Entertainment"
#* filter and plot processed category
insta_processed <- insta_processed %>%
filter(Category %in% c("Beauty, Makeup & Fashion","Lifestyle and
others","Entertainment","Health, Sports & Fitness"))
barplot(table(insta_processed$Category), main = "Account category (levels merged)", las=2)
table(insta_processed$Category)
############################################
#* final dataset to be analyzed
############################################
head(insta_processed)
############################################
#Start of Statistical Analysis
############################################
############################################
#Relation between number of Posts and the Type of Channel Info
############################################
#Box plot to show Mean Post count per Channel Info
options(scipen = 999)
boxplot(logPosts ~ channel_Info, data = insta_processed,
main = "Post Count by Channel Info",
xlab = "Channel Info", ylab = "log(Posts)")
#Relation between male and female in entertainment
channel = insta_processed$channel_Info; channel
cat_ent = insta_processed[insta_processed$Category == 'Entertainment',]; cat_ent
channel_male = cat_ent[cat_ent$channel_Info == "male",]; channel_male
channel_female = cat_ent[cat_ent$channel_Info == "female",]; channel_female
channel_male_posts = data.frame(channel_male$logPosts); channel_male_posts
channel_female_posts = data.frame(channel_female$logPosts); channel_female_posts
var.test(unlist(channel_male_posts), unlist(channel_female_posts))
t.test(channel_male_posts, channel_female_posts, var.equal = FALSE)

############################################
#Relation between Average Likes and the Type of Channel Info
############################################

30
#Box plot to show Avg Likes per Channel Info
options(scipen = 999)
boxplot(logAvg_Likes ~ channel_Info, data = insta_processed,
main = "Post Count by Channel Info",
xlab = "Channel Info", ylab = "log(Avg_Likes)")

#Relation between males and females in entertainment
channel = insta_processed$channel_Info; channel
cat_ent = insta_processed[insta_processed$Category == 'Entertainment',]; cat_ent
channel_male = cat_ent[cat_ent$channel_Info == "male",]; channel_male
channel_female = cat_ent[cat_ent$channel_Info == "female",]; channel_female
channel_male_likes = data.frame(channel_male$logAvg_Likes); channel_male_likes
channel_female_likes = data.frame(channel_female$logAvg_Likes); channel_female_likes
var.test(unlist(channel_male_likes), unlist(channel_female_likes))
t.test(channel_male_likes, channel_female_likes, var.equal = TRUE)

############################################
#Channel Info that influences the Category based on the average likes and engagement rating
############################################

# Create a boxplot to visualize the distribution of Engagement Rate across different channel types
par(mar = c(15, 4, 4, 2) + 0.1,mgp = c(13, 1, 0))
par(cex.axis = 1)
boxplot(sqrtEng.Rate ~ channel_Info * Category, data = insta_processed,
xlab = "Channel Info and Category", ylab = "Engagement Rate", las = 2)
# Two-way ANOVA to test significance of Channel Info and Category on engagement rating
two_way_anova <- aov(sqrtEng.Rate ~ channel_Info * Category, data = insta_processed)
# Summary of the two-way ANOVA
summary(two_way_anova)
ggplot(insta_processed, aes(x = Category, y = sqrtEng.Rate, color = channel_Info)) +
stat_summary(fun.y = "mean", geom = "point", size = 3) +
stat_summary(fun.y = "mean", geom = "line", aes(group = channel_Info), size = 1) +
labs(title = "Line Plot for Channel Info and Category",
x = "Category",
y = "Average Likes and Engagement Rating") +
theme_bw()

############################################
#Relation between Category and Channel Info
############################################
# Create a contingency table of channel_Info and Category
Cat_channelInfo <- table(insta_processed_sub$Category,
insta_processed_sub$channel_Info);Cat_channelInfo

#Perform a proportion test
prop.test(Cat_channelInfo)
# perform chi-square test
chisq.test(Cat_channelInfo)

############################################
#Linear Regression
############################################
31
# Fit the linear regression model
lm_likes = lm(formula = insta_processed$logPosts ~ insta_processed$logAvg_Likes); lm_likes
summary(lm_likes)
###########################################
#Multiple Linear Regression
###########################################
#Fit the MLR model
mlr_model <- lm(logFollowers ~ logPosts + sqrtEng.Rate + logAvg_Likes, data = insta_processed)
step(mlr_model, direction="backward")