COMM5000-英文代写|学霸联盟

COMM5000-英文代写

时间：2023-03-06

COMM5000
Lecturer in Charge: Dr. Rachida Ouysse
Lecturer: Rachel Erde
Assessments
Task Weighting Length Due Date
Case study insight development 20% TBA Week 4
Case study project proposal 20% TBA Week 6
Smarthinking feedback 0 (formative hurdle task) N/A Week 9
Case study business report 60% TBA Week 10
Other optional (non-graded) learning tasks
Diagnostic lessons
• 3 lessons available in Week 0 on Moodle
• Gauge your level of mathematics competency in necessary skills
Self-study quizzes
• Four quizzes in Weeks 1, 2, 5, 6 available on Moodle
• Instant feedback to test your understanding of the material
Gmetrix
• Excel training platform with option for certification exam
• Enrolment instructions available on Moodle
Week 1:
Exploring data through graphical and
numerical summaries
Week 1 Goals
• Represent data graphically in the form of bar charts, pie charts
and scatterplots
• Construct frequency distributions and histograms of data
• Calculate and interpret measures of central tendency for
numerical data
• Calculate and interpret measures of variation
• Calculate and interpret the covariance and the coefficient of
correlation for bivariate data
Vocabulary
Datum: a single measurement of something on a scale that is
understandable to both the recorder and the reader
Data: multiple items of datum
Statistics: the science of exploring, analysing and summarising
the data; designing appropriate ways of collecting data and
extracting meaningful information and insights from them; and
communicating and translating these insights into business
solutions, policy evaluations and decision-making
Common Data Issues
Missing data
• Imputation – filling a missing value with a reasonable approximation given
the other data (mean/median/etc.)
• Deletion
• Pairwise – removing an entire row, for example, of an individual with missing data
• Deletion of missing characteristics (many respondents skipped a certain question)
Wrong values
• Garbage in, garbage out
Population vs. Sample
Population: the entire set of objects or events under study
• Difficult to gather data due to time, accessibility, cost
• Values summarising characteristics are called population parameters
• Uses Greek and capital letters (μ, σ, N)
Sample: representative subset or the objects or events under study.
• Needed because it's impossible or impractical to obtain or compute with
population data
• Values summarising characteristics are called sample statistics
• Uses regular and lower case letters ( ҧ, s, n)
Sampling
Representative sample: A subset of the population from which data are
collected that accurately reflects the population
Bias: Systematic favouring of certain outcomes
Sampling bias: Systematic favouring of certain outcomes due to the methods
employed to obtain the sample
Ex: Is this class a good sample of the university?
Age/Commute time/Salary/Area of study
If not, what’s a better way to get a representative sample of UNSW students?
Types of Data
Types of Data
Cross-sectional: a sample of observations on individual units taken at a
single point in time or over a single period of time
• No natural ordering of observations
Time-series: a sample of observations on one or more variables over
successive periods or intervals of time
• Observations are ordered chronologically, collected at certain
frequencies
• Pooled cross-section time series: two or more different samples of cross-
sectional observations from the same population taken at two or more points
in time
• Panel/longitudinal data: two or more sets of observations on the same
sample of cross-sectional members at two or more points in time
Frequency Distributions
Frequency: how often something happened
Frequency distributions: visual displays that organise and present
frequency counts so that the information can be interpreted more
easily
• Categories must be mutually exclusive (each observation must fall into
exactly 1 category) and collectively exhaustive (the categories must represent
all possible outcomes, summing to 100%)
Bar Chart
Bar charts are useful for displaying categorical data, either in absolute
or relative terms.
Pie Chart
Pie charts can be used for the same data types as bar charts. They
emphasise the relative frequency of the categories.
Histogram
A histogram is a bar graph-like representation of data that buckets a
range of outcomes into columns along the x-axis.
Histograms are appropriate for quantitative data.
While bar charts have clear categories, the ‘bins’ for histograms may
not be immediately obvious and will depend on the data context and
distribution. Creating the right amount/size bins is important, but there
aren’t set rules – it’s an art. However, bins must be of equal size.
Describing the Distribution
Symmetry
• Does the left side mirror the right?
Skewness
• Is there a long tail on one side?
• Right tail – positively skewed
• Left tail – negatively skewed
Modality
• The mode is the highest frequency
class
• How many modes are there?

Contingency Table
Contingency tables display the
frequency distribution of
categorical variables
Summarises 3 probability
distributions:
• Joint
• Marginal
• Conditional
Contingency Table
Contingency tables display the
frequency distribution of
categorical variables
Summarises 3 probability
distributions:
• Joint
• Marginal
• Conditional
Joint Probability
What’s the probability that a student is both
Male and a Resident?
P(Male and Resident) = 12/83 = 14%
Contingency Table
Contingency tables display the
frequency distribution of
categorical variables
Summarises 3 probability
distributions:
• Joint
• Marginal
• Conditional
Marginal Probability
What’s the probability that a student takes the
Bus?
P(Bus) = 8/83 = 10%
Contingency Table
Contingency tables display the
frequency distribution of
categorical variables
Summarises 3 probability
distributions:
• Joint
• Marginal
• Conditional
Conditional Probability
What’s the probability that a student Walks to
campus, given that they are Female?
P(Walk given Female) = 13/39 = 33%
Scatterplot
Scatterplots can be used to display
the potential correlations
between quantitative variables.
Look out for the direction of the
relationship and the strength of
the relationship.
Numerical summaries
Measures of central tendency: the extent to which the data values are
grouped around a central value
• Mean, median, mode, percentile
Measures of variation: the spread, scattering or dispersion of data
values
• Range, IQR, standard deviation, variance, CV, z-score
Measures of shape: the pattern of the distribution of data values from
the lowest to the highest
Mean
Arithmetic mean: calculated by adding all the values of the variable
and dividing the sum by the number of observations in the data set.
i
1 5
2 9
3 14
4 3
5 2
6 1
7 70
Population mean:
=
σ=1

Sample mean:
ҧ =
σ=1

=
5 + 9 + 14 + 3 + 2 + 1 + 70
7
= 14.86
Note: the average of these values is higher
than all but 1. The mean weighs each value
equally and can be highly influenced by
outliers (70).
Median
Median: the middle value in a set of data that has been ordered from
lowest to highest value
i
1 5
2 9
3 14
4 3
5 2
6 1
7 70
i
1 1
2 2
3 3
4 5
5 9
6 14
7 70
Note: the median is not influenced by
extreme values like the mean. Median is
often preferred when the distribution is
skewed. (Think income)
i
1 1
2 2
3 3
4 5
5 9
6 14
If there are an even number of
observations, take the average of the
middle two:
3+5/2 = 4
Mean vs. Median
In skewed data, the mean will be
pulled towards the tail.
• If right/positive skew,
mean > median
• If left/negative skew,
mean < median
• If distribution is symmetric,
mean = median
0
5
10
15
20
25
30
Fr
eq
u
en
cy
Score
Exam Score Distribution
Mode
Mode: the most frequent value in a dataset.
• If all values occur only once, there will be no mode
• Bell-shaped curves are unimodal
• Distributions can also be bimodal or multimodal (think heights for men vs
women)
0
5
10
15
20
25
30
5%
10
%
15
%
20
%
25
%
30
%
35
%
40
%
45
%
50
%
55
%
60
%
65
%
70
%
75
%
80
%
85
%
90
%
95
%
10
0
%
Fr
eq
u
en
cy
Score
Exam Score Distribution
Quartiles and Percentiles
Quartiles: divide the data into
four equal quartiles;
25%, 50%, 75%, 100%
• This can be displayed using a box
and whiskers plot
Percentile: a generalisation of the
concept of a quartile to any
fraction of the population
Variation
Measures of central tendency tell us where the ‘middle’ of our
distribution is. We also want to know where the rest of the data fall,
particularly in relation to the mean.
Range and IQR
Range: the difference between the largest and smallest value
Range = 100 – 5 = 95%. The range is heavily influenced by the extremes.
IQR: The interquartile range is defined as the difference between
the upper and lower quartile values in a set of data
IQR = 69 - 38 = 31%
Variance and Standard Deviation
Variance and standard deviation are measures of dispersion, how
spread out the values are
Variance: the population variance sums the squared deviations from
the population mean and divides by the population size N
• For samples, the denominator is n-1
Standard deviation: the square root of the variance
• Easier to interpret than variance, as it is expressed in the same units of
measurement as the original data
• Can be understood as the ‘average’ distance from the mean of the data points
Variance and Standard Deviation Calculation
Coefficient of Variation (CV) and Z-score
Coefficient of Variation: a statistical measure of the relative dispersion
of data points in a data series around the mean
=

ത
Z-score: represents the difference between a given observation and the
mean expressed in standard deviations
=
− ҧ

Covariance and Correlation
Covariance: a numerical measure of the linear association between
two variables
• Unit: x unit * y unit
Correlation coefficient: a statistical measure of the strength of the
linear relationship between two variables, standardises the covariance
by the variables’ standard deviations
• Unit: unit free
• Bounds: -1 to 1
• -1 means perfect negative correlation, 1 means perfect positive correlation,
close to 0 means weak correlation
Covariance and Correlation
Population Covariance:
=
σ=1
( − )( − )

Sample Covariance:
=
σ=1
( − ҧ)( − ത)
− 1
Population Correlation:
=

Sample Correlation:
=