AMS2640-无代写
时间:2022-11-21
HSUHK 1 AMS2640 Study Notes
The Hang Seng University of Hong Kong
Department of Mathematics, Statistics and Insurance
AMS2640 Statistical Computing in Practice
Study Notes
Lesson 5 – SAS Graphics & Testing
5.1 Plot Basic SAS Graphs
5.2 Point and Interval Estimations
5.3 Hypothesis Testing
HSUHK 2 AMS2640 Study Notes
5.1 Plot Basic SAS Graphs
SAS can create bar charts, histograms, box plots, series plots, step plots, needle plots,
scatter plots, and confidence limits. SAS also allows overlay multiple graphs of the same
or different type as long as combining them makes sense. Additional statements and options
allow you to control the axes and legends and add reference lines.
5.1.1 Bar Charts
The bar charts are designed to show the distribution of categorical data.
To plot bar chart with vertical/ horizontal bars
o Use PROC SGPLOT procedure
(1) For horizontal bars, specify the keyword HBAR instead of VBAR
PROC SGPLOT dataset;
VBAR variable-name / options;
(2) Selected options are listed below:
RESPONSE=variable-
name
specifies a numeric variable to be summarized
STAT=statistic
specifies a statistic, either FREQ, MEAN, or SUM.
FREQ is the default if there is no response
variable. SUM is the default when you specify a
response variable
GROUP=variable-
name
specifies a variable used to group the data in a
stacked bar chart
BARWIDTH=n sets the width of bars. Values range from 0.1 to 1
with a default of 0.8
TRANSPARENCY=n
sets the transparency of graph features such as
bars, lines, or markers. Values range from 0 to 1
with a default of 0.
SAS Code Example 5.1:
A chocolate manufacturer is considering whether to add four new varieties of chocolate (80%
cacao, Earl Grey, ginger, and pear) to its line of products. Each person rated the uniqueness of
that favorite flavor (from 1 for not unique to 5 for very unique) and the price they would buy.
The data contain each person’s age, age group, followed by their favorite flavor, its uniqueness
score, and price willing to buy the chocolate bar. Notice that each line of data contains three
responses. Plot a bar chart for variable Flavor.
32 A Pear 4 12 54 A 80%Cacao 1 20 66 A EarlGrey 4 19
4 C 80%Cacao 2 15 45 A Ginger 5 17 11 C Pear 4 25
8 C 80%Cacao 3 14 7 C Pear 2 19 9 C Pear 2 18
40 A EarlGrey 3 28 23 A 80%Cacao 1 21 4 C 80%Cacao 2 13
43 A Ginger 5 17 38 A Pear 3 23 5 C EarlGrey 3 31
6 C 80%Cacao 1 14 51 A 80%Cacao 2 23 22 A EarlGrey 5 16
49 A 80%Cacao 2 23 11 C Pear 1 21 7 C Pear 3 22
29 A 80%Cacao 2 20 10 C Pear 3 17 12 C 80%Cacao 3 28
12 C Ginger 4 10 18 A Pear 2 15 9 C EarlGrey 2 25
13 C 80%Cacao 2 18 37 A Ginger 3 11 14 C Pear 3 13
HSUHK 3 AMS2640 Study Notes
DATA choco5_1;
INFILE 'C:\Users\elainemo\chocolate.dat';
INPUT Age AgeGroup $ Flavor $ Unique Buy @@;
PROC PRINT DATA = choco5_1;
RUN;
* Bar charts for favorite flavor;
PROC SGPLOT DATA = choco5_1;
VBAR Flavor / GROUP = AgeGroup;
TITLE 'Example 5.1a Chocolate Flavors by Age Group';
RUN;
PROC SGPLOT DATA = choco5_1;
VBAR Flavor / RESPONSE = Unique STAT = MEAN;
TITLE 'Example 5.1b Unique Ratings for Flavors';
RUN;
5.1.2 Histograms and Box Plots
For continuous data, we can use two kinds of distribution plots: histograms and box and-
whisker plots.
To plot histogram
o Use PROC SGPLOT procedure
(1) You can overlay a density curve on a histogram using the DENSITY statement
where the TYPE option is normal by default
PROC SGPLOT dataset;
HISTOGRAM variable-name / options;
DENSITY variable-name / TYPE=distribution-type;
(2) Selected options are listed below:
SCALE=scaling-type
specifies the scale for the vertical axis, either
PERCENT (the default), COUNT, or PROPORTION
SHOWBINS
places tick marks at the midpoints of the bins. By
default, tick marks are placed at regular intervals
based on minimum and maximum values
HSUHK 4 AMS2640 Study Notes
To plot vertical/ horizontal box and- whisker plots
o Use PROC SGPLOT procedure
(1) For horizontal box, specify the keyword HBOX instead of VBOX
PROC SGPLOT dataset;
VBOX variable-name / options;
(2) Selected options are listed below:
CATEGORY=variable
specifies a categorical variable. One box plot will
be created for each value of the categorical
variable.
SAS Code Example 5.2 (same dataset as example 5.1):
Use the previous example chocolate.dat with variable Age, AgeGroup, Flavor, Unique, and
Buy. Plot a histogram for variable Unique and a boxplot for variable Unique by Flavor.
DATA choco5_2;
SET choco5_1;
RUN;
* Create histogram;
PROC SGPLOT DATA = choco5_2;
HISTOGRAM Unique / SHOWBINS;
DENSITY Unique;
TITLE 'Example 5.2a Histogram';
RUN;
* Create box plot;
PROC SGPLOT DATA = choco5_2;
VBOX Unique / CATEGORY = Flavor;
TITLE 'Example 5.2b Box Plot';
RUN;
HSUHK 5 AMS2640 Study Notes
5.1.3 Scatter Plots
Scatter plots are an effective way to show the relationship between two continuous
variables.
To plot scatter plots
o Use PROC SGPLOT procedure
(1) The general SCATTER statement is
PROC SGPLOT dataset;
SCATTER X=horizontal-var Y=vertical-var / options;
(2) Selected options are listed below:
GROUP = variable
specifies a third variable to be used for grouping the
data
SAS Code Example 5.3 (same dataset as example 5.1):
Use the previous example chocolate.dat with variable Age, AgeGroup, Flavor, Unique, and
Buy. Plot a scatter plot for variable Age by Buy.
DATA choco5_3;
SET choco5_1;
RUN;
* Scatter plot Unique by Buy;
PROC SGPLOT DATA = choco5_3;
SCATTER X=Age Y=Buy;
LABEL Age = 'Age'
Buy = 'Price willing to buy';
TITLE 'Example 5.3 Scatter plot';
RUN;
5.1.4 Series Plots
A series plot is appropriate for data that must be displayed in a specific order. Dates and
times of any kind are good candidates for series plots. In a series plot, data points are
connected by a line.
HSUHK 6 AMS2640 Study Notes
To plot series plot
o Use PROC SGPLOT statement
(1) In order to have the points connected properly, your data needed to be sorted by
your horizontal variable using PROC SORT
PROC SGPLOT dataset;
SERIES X=horizontal-var Y=vertical-var / options;
(2) Selected options are listed below:
GROUP = variable
specifies a third variable to be used for grouping the
data. A separate line is created for each unique
value of the grouping variable
MARKERS adds data point markers to the lines
SAS Code Example 5.4:
The following data tempeture.dat compares average high temperatures in three cities:
International Falls, Minnesota; Raleigh, North Carolina; and Yuma, Arizona. The variables are
month and high temperature in International Falls, Raleigh, and Yuma. Temperatures are in
Fahrenheit, and each line of data contains data for two months.
1 12.2 50.7 68.5 2 20.1 54.5 74.1
3 32.4 63.7 79.0 4 49.6 72.7 86.7
5 64.4 79.7 94.1 6 73.0 85.8 103.1
7 78.1 88.7 106.9 8 75.6 87.4 105.6
9 64.0 82.6 101.5 10 52.2 72.9 91.0
11 32.5 63.9 77.5 12 17.8 54.1 68.9
DATA temp5_4;
INFILE 'C:\Users\elainemo\tempeture.dat';
INPUT Month MN NC AZ @@;
RUN;
* Plot average high and low temperatures by city;
PROC SGPLOT DATA = temp5_4;
SERIES X = Month Y = MN;
SERIES X = Month Y = NC;
SERIES X = Month Y = AZ;
TITLE 'Example 5.4 Series plot';
RUN;
HSUHK 7 AMS2640 Study Notes
5.1.5 Options available for plot
Sometimes when we are plotting the above graphs, it is useful to add reference lines to a
graph can help to show which points are above or below important levels and specify
options for the axes.
To add horizontal or vertical reference lines
o Use REFLINE statement
(3) The values can be specified either as a list, 0 5 10 15 20, or a range, 0 TO 20 BY 5
REFLINE value-list / options;
(4) PROC SGPLOT draws plot elements in the order they are specified. Usually we
add REFLINE statement after, so that the reference line will be drawn on top
(5) Options that you may want to add to a REFLINE statement include
AXIS = axis
specifies the axis that contains the reference line
values. The default is Y which produces horizontal
lines. For vertical reference lines, specify AXIS=X
LABEL=(label-list)
specifies one or more text strings (each enclosed in
quotes and separated by spaces) to be used as
labels for the reference lines
TRANSPARENCY = n
sets the transparency of the line. Values range
from 0 to 1 with a default of 0
To specify options for the axes
o Use XAXIS and YAXIS statement
(1) You can control the features of the axis (for example, the axis label, grid lines, and
minor tick marks). You can also control the structure of the axis (for example, the
data range, data type, and tick mark values)
XAXIS options;
YAXIS options;
(2) Options that you may want to add to an XAXIS or YAXIS statement include
LABEL = 'label'
specifies a text string enclosed in quotes to be used
as the label for an axis in place of the variable name
or variable label. The default is the variable label,
or if there is no variable label, then the variable
name
TYPE = axis-type
specifies the type of axis. DISCRETE is the default
for character variables. LINEAR is the default for
numeric variables. TIME is the default for variables
that have date, time, or datetime formats associated
with them. LOG specifies a logarithmic scale and is
not a default
VALUES=(value-
list)
specifies values for tick marks on axes. Values must
be enclosed in parentheses and can be specified
either as a list (0 5 10 15 20) or a range (0 TO 20
BY 5)
HSUHK 8 AMS2640 Study Notes
SAS Code Example 5.5 (same dataset as example 5.1):
Use the previous example chocolate.dat and tempeture.dat. Here we repeat example 5.1, 5.4,
and 5.5 by adding options.
* Example 5.1;
PROC SGPLOT DATA = choco5_1;
VBAR Flavor / GROUP = AgeGroup;
LABEL Flavor = 'Flavor of Chocolate';
TITLE 'Example 5.5a Chocolate Flavors by Age Group';
RUN;
* Example 5.3;
PROC SGPLOT DATA = choco5_3;
SCATTER X=Age Y=Buy;
XAXIS LABEL = 'Customer age' VALUES = (0 TO 70 BY 10);
YAXIS LABEL = 'Price willing to buy';
TITLE 'Example 5.5b Scatter plot';
RUN;
* Example 5.4;
PROC SGPLOT DATA = temp5_4;
SERIES X = Month Y = MN;
SERIES X = Month Y = NC;
SERIES X = Month Y = AZ;
REFLINE 32 75 / TRANSPARENCY = 0.5
LABEL = ('32 degrees' '75 degrees');
XAXIS TYPE = DISCRETE;
TITLE 'Example 5.5c Series plot';
RUN;
HSUHK 9 AMS2640 Study Notes
5.2 Point and Interval Estimations
5.2.1 Point Estimation and Normality Test
When you are doing statistical analysis such as hypothesis testing, it is a good idea to pause
and do a little exploration. Apart from basic statistics such as mean and standard deviation,
we will explore on coefficient of variation, skewness, kurtosis, extreme values, different
quantiles, and confidence interval to discover the distribution of the data and give some
insight before we conduct any statistical tests.
Statistical methods are based on various underlying assumptions. One common assumption
is that a random variable is normally distributed. Normality of data is assumed in many
statistical methods. When this assumption is violated, interpretation and inference may not
be reliable or valid, so we need to conduct the normality test.
To calculate statistics and graph distribution of a single variable
o Use PROC UNIVARIATE procedure
(1) Specify one or more numeric variables in a VAR statement. Without a VAR
statement, SAS will calculate statistics for all numeric variables in your data set
PROC UNIVARIATE dataset;
VAR variable-list;
plot-request variable-list / options;
To test normality of a single variable
o Use PROC UNIVARIATE procedure
(1) The UNIVARIATE procedure can produce several graphs that are useful for data
exploration. All variables specified in the variable list for the plot must also appear
in the variable list on the VAR statement. Selected plot-request listed below:
CDFPLOT requests a cumulative distribution function plot
HISTOGRAM requests a histogram
PPPLOT requests a probability-probability plot
PROBPLOT requests a probability plot
QQPLOT requests a quantile-quantile plot
(2) All plots use the normal distribution as the default. Other distribution can be
specified with option including: BETA, EXPONENTIAL, GAMMA, LOGNORMAL,
and WEIBULL
SAS Code Example 5.6:
Use tempeture.dat with variable Month, Minnesota, North Carolina, and Arizona. Observe the
statistics and graph distribution, then test normality and calculate confidence interval for
variable North Carolina.
DATA temp5_6;
SET temp5_4;
RUN;
PROC UNIVARIATE DATA = temp5_6;
VAR NC;
TITLE 'Example 5.6a';
RUN;
HSUHK 10 AMS2640 Study Notes
/* Test normality with histogram and pp plot */
PROC UNIVARIATE DATA = temp5_6;
VAR NC;
HISTOGRAM NC/NORMAL;
PROBPLOT NC;
TITLE 'Example 5.6b';
RUN;
The output starts with basic information about your distribution: number of observations (N),
mean, and standard deviation. Skewness indicates how asymmetrical the distribution is
(whether it is more spread out on one side) while kurtosis indicates how flat or peaked the
distribution is. The normal distribution has values of 0 for both skewness and kurtosis. Other
sections of the output contain three measures of central tendency: mean, median, and mode;
tests of the hypothesis that the population mean is 0; quantiles; and extreme observations (in
case you have outliers).
HSUHK 11 AMS2640 Study Notes
Part b creates a histogram of the NC variable with the normal distribution overlaid and a
probability plot using the normal distribution. The UNIVARIATE procedure provides a variety
of descriptive statistics, and draws normal probability, and box plots. The NORMAL option
performs normality tests; the PLOT option draws a box plot.
SAS provides three different statistics for testing normality. Kolmogorov-Smirnov of 0.1433
does not reject the null hypothesis that the variable is normally distributed at the 0.05 level (p
> 0.15). Similarly, Cramer-von Mises and Anderson-Darling tests do not reject the null
hypothesis. If the assumption of normality is not reasonable, you should analyze the data with
the nonparametric Wilcoxon rank sum test by using PROC NPAR1WAY.
5.2.2 Confidence Interval
Confidence interval gives an estimated range of values which is likely to include an
unknown population parameter, the estimated range being calculated from a given set of
sample data. Apart from using PROC UNIVARIATE procedure, we can actually access
confidence interval using PROC MEAN procedure.
To create confidence interval of a single variable
o Use PROC MEAN procedure
(1) The default confidence level for the confidence limits is 0.05 or 95%. If you want a
different confidence level, then request it with the ALPHA= option
PROC MEANS ALPHA=value CLM;
SAS Code Example 5.7:
Use tempeture.dat with variable Month, Minnesota, North Carolina, and Arizona. Observe the
statistics and graph distribution, then test normality and calculate confidence interval for
variable North Carolina.
DATA temp5_7;
SET temp5_4;
RUN;
/* Calculate confidence interval */
HSUHK 12 AMS2640 Study Notes
PROC MEANS DATA=temp5_7 N MEAN CLM ALPHA=.10;
VAR NC;
TITLE 'Example 5.7';
RUN;
The confidence limits tell us that we are 90% certain that the true population mean falls between
64.25 and 78.54 degree Fahrenheit.
5.2.3 Correlation
A correlation coefficient measures the strength of the relationship between two variables,
or how corelated they are. If two variables were completely unrelated, they would have a
correlation of 0. If two variables were perfectly correlated, they would have a correlation
of 1 or –1.
To compute correlations between all the numeric variables
o Use PROC CORR procedure
(1) By default, PROC CORR computes Pearson product-moment correlation
coefficients. You can add SPEARMAN options to request Spearman’s rank non-
parametric correlation coefficients
PROC CORR dataset PLOTS = (plot-list);
VAR variable-list;
WITH variable-list;
(2) Variables listed in the VAR statement appear across the top of the table of
correlations, while variables listed in the WITH statement appear down the side of
the table. If you use a VAR statement but no WITH statement, then the variables
appear both across the top and down the side
(3) The CORR procedure can produce two types of plots
SCATTER
creates scatter plots for pairs of variables. Prediction or confidence
ellipses are overlaid on the plot. By default, the scatter plots have
prediction ellipses. For confidence ellipses, twe can specify the
ELLIPSE=CONFIDENCE option or ELLIPSE=NONE option.
MATRIX
creates a matrix of scatter plots for all variables. If you do not have
a WITH statement, then matrix plots will show a symmetrical plot
with all variable combinations appearing twice. By default the
diagonal cells in the matrix will be empty. If you use the
HISTOGRAM option for the matrix plot, then a histogram will be
produced for each variable and displayed along the diagonal.
HSUHK 13 AMS2640 Study Notes
SAS Code Example 5.8 (same dataset as example 5.1):
Use the previous example chocolate.dat with variable Age, AgeGroup, Flavor, Unique, and
Buy. Calculate the Pearson correlation for variable age and unique with price willing to buy.
DATA choco5_8;
SET choco5_1;
RUN;
* Calculate correlation;
PROC CORR DATA = choco5_8;
VAR Age Unique;
WITH Buy;
TITLE 'Example 5.8a';
RUN;
* Calculate correlation without with variable;
PROC CORR DATA = choco5_8 PLOTS=MATRIX(HISTOGRAM);
VAR Age Unique Buy;
TITLE 'Example 5.8b';
RUN;
This report starts with descriptive statistics for each variable and then lists the correlation
matrix which contains the first row is Pearson correlation coefficients and the second row is
probability of getting a larger absolute value for each correlation assuming the population
correlation is zero.
In this example, age and price willing to buy is not as correlated as uniqueness and price willing
to buy, while the first one is positively correlated the second is negatively correlated. This
means the older the age the slightly higher price they are willing to buy, while the more unique
of the chocolate taste the lower price people are willing to buy.
HSUHK 14 AMS2640 Study Notes
Below shows part b where the variables appear both across the top and down the side of the
correlation matrix. Variable age and unique shows a positive correlation of 0.23541. Variable
unique and price willing to buy shows negative correlation of -0.13040 while almost no
correlation of 0.08687 between age and price willing to buy.
Notice PLOTS=MATRIX(HISTOGRAM)option plots the graph below, as unique is actually a
discrete score of 1 to 5, the scatter plots shows not much information in this case.
HSUHK 15 AMS2640 Study Notes
5.3 Hypothesis Testing
Hypothesis testing is an act in statistics whereby an analyst tests an assumption regarding
a population parameter. Here we will introduce hypothesis testing in SAS with one and two
sample test for mean and proportion, analysis of variance (ANOVA), and chi-square test
of independence.
5.3.1 One & Two Sample T-test for Mean
SAS t-test looks at the t-statistic, the t-distribution and degrees of freedom to determine the
probability of difference between populations. A one-sample t test can be used to compare
a sample mean to a given value. The test has several assumptions.
o The dependent variable must be continuous (interval/ratio)
o The observations are independent of one another
o The dependent variable should be approximately normally distributed
o The dependent variable should not contain any outliers
Two sample t-test can be used to test is two population means are different. In additional
to the assumption in one sample t-test, the following need to be satisfied
o The variances of the two populations are equal (Otherwise unequal-variance test
should be considered)
o The two samples are independent (Otherwise paired t-test should be considered)
To conduct one or two sample t-test for continuous variables
o Use PROC TTEST procedure
(1) Carry out t-test on a single variable and pair of variables. The default hypothesis for
a single variable is H0 : µ1 = 0 vs. H1 : µ1 ≠ 0.
PROC TTEST dataset options;
CLASS variable;
PAIRED variable-list;
VAR variable-list;
(2) Selected options can include:
ALPHA = p specifies the confidence interval p-value, by default p = 0.05
SIDES = k specifies whether it is U upper or L lower, by default is 2 sided
H0 = m requests tests against m instead of default H0 : µ1 =0
COCHRAN Cochran approximation for two sample unequal variance
(3) The typical hypotheses for a two-sample t-test are: H0 : µ1 = µ2 vs. H1 : µ1 ≠ µ2.
The CLASS statement defines the grouping variable and VAR statement indicate
which variable mean will be compared.
(4) Assumptions of the two-sample t-test are that the random samples are independent
and that the populations are normally distributed with equal variances.
(5) When data is collected twice on the same subjects (or matched subjects) the proper
analysis is a paired t-test in a before-after measure. Include the PAIRED statement
such as PAIRED A*B C*D for A-B and C-D paired comparisons.
HSUHK 16 AMS2640 Study Notes
SAS Code Example 5.9 (same dataset as example 5.1):
Use the previous example chocolate.dat with variable Age, AgeGroup, Flavor, Unique, and
Buy. Conduct a one sample test for price willing to buy different from $15 and greater than
$15.
DATA choco5_9;
SET choco5_1;
RUN;
* One sample 2 sided t-test;
PROC TTEST DATA = choco5_9 H0=15 PLOTS(SHOWH0);
VAR Buy;
TITLE 'Example 5.9a';
RUN;
* One sample 1-sided t-test;
PROC TTEST DATA = choco5_9 H0=15 SIDES=U PLOTS(SHOWH0);
VAR Buy;
TITLE 'Example 5.9b';
RUN;
In part a and b, variable buy is assumed to be normally distributed. The 95% confidence
intervals of mean and standard deviation are conducted by default. The limits for the standard
deviation are the equal-tailed variety, per the default CI= EQUAL option.
At the bottom of the output are the degrees of freedom, t statistic value, and p-value for the t-
test. A Q-Q plot is also graphed to check general normality. The linear shape of the Q-Q plot
suggests the data generally follows normal. You could use the UNIVARIATE procedure with
the NORMAL option to numerically check the normality assumptions.
In part a, at the 5% α level, this test indicates that the mean price willing to is significantly
different from $15 as t score = 4.60 and p-value = 0.0003 which is less than 0.05. In part b, the
SIDES= U option reflects the focus of the research question, namely whether the mean is
greater than $15, rather than different from $15. The interval for the mean is an upper one-
sided interval with a finite lower bound $17.2879. Again, the p-value indicate the price willing
to buy is significantly greater than $15.
HSUHK 17 AMS2640 Study Notes
The summary panel above shows comparative histograms, normal and kernel densities, and
box plots, comparing the distribution of prices willing to buy and the 95% confidence interval.
The one on left is two sided and the one on right is upper side. The PLOTS(SHOWH0) option
requests that this null value be displayed on the relevant graphs. The Q-Q plots below assess
the normality assumption for price willing to buy.
SAS Code Example 5.10 (same dataset as example 5.1):
Use the previous example chocolate.dat with variable Age, AgeGroup, Flavor, Unique, and
Buy. Conduct a two-sample t-test for variable buy by factor age group.
DATA choco5_10;
SET choco5_1;
RUN;
* Two sample t-test;
PROC TTEST DATA = choco5_10;
CLASS AgeGroup;
VAR Buy;
TITLE 'Example 5.10';
RUN;
For the mean differences, both pooled (assuming equal variances for age group Adult and Child)
and Satterthwaite (assuming unequal variances) 95% intervals are shown. The confidence
limits for the standard deviations are of the equal-tailed variety. The test statistics, associated
degrees of freedom, and p-values are displayed.
HSUHK 18 AMS2640 Study Notes
The pooled test assumes that the two populations have equal variances and uses degrees of
freedom n1+n2-2, where n1 and n2 are the sample sizes for the two populations. The remaining
two tests do not assume that the populations have equal variances. The Satterthwaite test uses
the Satterthwaite approximation for degrees of freedom. Two tests result in not significant p-
values of 0.9964, supporting there is no significant difference between adult and child in price
willing to buy. The Equality of Variances test reveals insufficient evidence of unequal
variances (the Folded F statistic F’ = 1.65, with p-value = 0.3706.
The summary panel below shows comparative histograms, normal and kernel densities, and
box plots, comparing the distribution of prices willing to buy between age group. The Q-Q
plots assess the normality assumption for each age group. The plots for both Adult and Child
show no obvious deviations from normality.
HSUHK 19 AMS2640 Study Notes
5.3.2 Analysis of Variance
Previously, two sample t-test look at quantitative outcomes with a categorical explanatory
variable that has only two levels of treatment. The one-way ANOVA can be used for the
case of a quantitative outcome with a categorical explanatory variable that has two or more
levels of treatment. If there are only two treatment levels, one-way ANOVA is equivalent
to a t-test comparing two group means.
The term one-way refers to one-factor, indicates that there is a single explanatory variable
which means one independent variable. The independent variable is a categorical (discrete)
variable used to form the groupings of observations. In one-way ANOVA, there is only one
dependent variable and we hypotheses on the means of the groups of the dependent variable.
We use the term two-way or two-factor ANOVA, when the levels of two different
explanatory variables or two independent variables are being assigned, and each subject is
assigned to one level of each factor. However, we will not discuss this advanced topic in
this class.
To conduct one-way ANOVA
o Use PROC ANOVA procedure
(1) PROC ANOVA enable options to test one-way or two-way ANOVA, for one-way
ANOVA with k population means H0 : µ1 = · · · = µk vs. H1 : not all µ i equals.
PROC ANOVA dataset;
CLASS variable-list;
MODEL variable1=variable2;
(2) The independent variable is specified in the CLASS statement
SAS Code Example 5.11 (same dataset as example 5.1):
Use the previous example chocolate.dat with variable Age, AgeGroup, Flavor, Unique, and
Buy. Conduct one-way ANOVA for price willing to buy as dependent variable and chocolate
flavor as independent variable.
DATA choco5_11;
SET choco5_1;
RUN;
* One-way ANOVA;
PROC ANOVA DATA=choco5_11;
CLASS Flavor;
MODEL Buy=Flavor;
TITLE 'Example 5.11a';
RUN;
MEANS Flavor / TUKEY;
RUN;
The degrees of freedom (DF) column should be used to check the analysis results. The model
degrees of freedom for a one-way analysis of variance are the number of levels minus 1. In this
case we have 4 chocolate flavor so 4 – 1 = 5. The Corrected Total degrees of freedom are
always the total number of observations minus one; in this case 30 – 1 = 29. The sum of Model
and Error degrees of freedom equals the Corrected Total.
HSUHK 20 AMS2640 Study Notes
The overall F test is significant F=3.36 with p-value = 0.034, indicating that the model as a
whole accounts for a significant portion of the variability in the dependent variable. The F test
for Flavor is significant, indicating that some contrast between the means for the different
chocolate flavor is different from zero. Notice that the Model and Flavor F tests are identical,
since Flavor is the only term in the model. The ANOVA procedure output includes a box plot of
the dependent variable values within each classification level of the independent variable.
The overall F test is significant F=3.36 with p-value = 0.034 suggests that there are differences
among the chocolate flavor, but it does not reveal any information about the nature of the
differences.
Mean comparison methods can be used to gather further information. After you specify a model
with a MODEL statement and execute the ANOVA procedure with a RUN statement, you can
execute a variety of statements (such as MEANS, MANOVA , TEST , and REPEATED) without
PROC ANOVA recalculating the model sum of squares. The following additional statements
request means of the Flavor levels with Tukey’s studentized range procedure.
Examples of implications of the above multiple comparisons results: price willing to buy earl
grey flavor is significantly more than ginger flavor. While price willing to buy 80%Cacao is
less than earl grey flavor, the difference is not statically significant.
HSUHK 21 AMS2640 Study Notes
5.3.3 One Sample Z-test for Proportion
The One-Sample Proportion Test is used to assess whether a population proportion is
significantly different from a hypothesized value. The hypotheses may be stated in terms
of the proportions, their difference, their ratio, or their odds ratio, but all four hypotheses
result in the same test statistics.
To conduct one sample z-test for proportions
o Use PROC FREQ procedure
(1) PROC FREQ enable options to test one sample proportion where H0 : p1 = 0.5 vs.
H1 : p1 ≠ 0.5
PROC FREQ dataset;
TABLES variable-lists / BINOMIAL(option);
(2) Selected options can include:
p = p0 requests tests against p0 instead of default H0 : p1 = 0.5
LEVEL='name'
modify the group level on which the test of proportions is
performed, name is the name of the category to look for
SAS Code Example 5.12 (same dataset as example 5.1):
Use the previous example chocolate.dat with variable Age, AgeGroup, Flavor, Unique, and
Buy. Conduct a Z-test for variable age group.
DATA choco5_12;
SET choco5_1;
RUN;
* One sample proportion test for agegroup;
PROC FREQ DATA = choco5_12;
TABLES AgeGroup / BINOMIAL;
TITLE 'Example 5.12';
RUN;
The result of z-test with test statistics -0.3651 and two sided p-value 0.7150 do not provide
evidence that the proportion is significantly different from 0.5.
HSUHK 22 AMS2640 Study Notes
5.3.4 Two Sample Proportion and Odds
Different from proportion, odds of an event is simply the probability of the event occurring
divided by the probability of the event not occurring. If there are two categories within one
variable say p1 and p2, then p1 and p2 are proportions out of overall p1+p2 = p where odds
are p1/(1-p1) and p2/(1-p2). We also define the following for odds:
o Risk Difference (RD) = p1 - p2
o Relative Risk or Risk Ratio (RR) = p1 / p2
o Odds Ratio (OR) = odds1/ odds2 = [p1/(1-p1)] / [p2/(1-p2)]
Relative risk is a better measure of association than the difference in proportions when cell
probabilities are close to 0 and 1. For two sample z-test, specify the CHISQ and RISKDIFF
option in the TABLES statement of PROC FREQ is equivalent to the well-known Z test for
comparing two independent proportions in SAS.
The most widely used analysis of categorical data is chi-square. It is used to test the
hypothesis of no association between the variables. Another use is to compute measures of
association, which indicate the strength of the relationship between the variables.
To conduct two sample z-test for proportions
o Use PROC FREQ procedure
(1) Specify the CHISQ and RISKDIFF option in the TABLES statement
PROC FREQ dataset;
TABLES variable-lists / CHISQ RISKDIFF;
(2) In TABLE statement, the first variable defines the rows of the table and the second
variable defines the columns
To create statistics and graph distribution of a categorical variable
o Use PROC FREQ procedure
(1) PROC FREQ enable options to test categorical data where H0 : There is no
association between the two variables vs. H1 : There is an association between the
two variables.
PROC FREQ dataset;
TABLES variable-lists / options;
(2) Selected options listed below:
CHISQ
requests chi-square tests of homogeneity and measures of
association
CL requests confidence limits for measures of association
EXACT requests Fisher’s exact test for tables larger than 2X2
MEASURES
requests measures of association including Pearson and Spearman
correlation coefficients, gamma, Kendall’s tau-b, Stuart’s tau-c,
Somer’s D, lambda, odds ratios, risk ratios, and confidence
intervals
PLCORR requests polychoric correlation coefficient
RELRISK requests relative risk measures for 2X2 tables
RISKDIFF
requests estimates of risks (binomial proportions) and risk
differences for 2X2 tables
HSUHK 23 AMS2640 Study Notes
SAS Code Example 5.13 (same dataset as example 5.1):
Use the previous example chocolate.dat with variable Age, AgeGroup, Flavor, Unique, and
Buy. Recode variable Flavor into Cacao or not. Conduct a chi-square test for variable age group
and flavor.
DATA choco5_13;
SET choco5_1;
IF Flavor = '80%Cacao' THEN Cacao = 1;
ELSE Cacao = 0;
RUN;
* Chi-square test for agegroup by Cacao;
PROC FREQ DATA = choco5_13;
TABLES AgeGroup * Cacao / CHISQ RISKDIFF;
TITLE 'Example 5.13a';
RUN;
* Chi-square test for agegroup by flavor;
PROC FREQ DATA = choco5_13;
TABLES AgeGroup * Flavor / CHISQ;
TITLE 'Example 5.13b';
RUN;
For the chi-square test to be valid, the cell counts must not be too small. The usual rule of
thumb is that all cell counts should be at least 5. When some cell counts are too small, you can
use Fisher's exact test which is also provided by the CHISQ option. The Fisher test, while more
conservative, also does not show any significant difference between the proportions with p-
value = 1.0000.
As chi-square = Z2. p-value for the two-sided Z test is the same as for the chi-square test. And
since the p-value for a two-sided test is double the one-sided p-value when the test statistic's
distribution is symmetric in normal distribution, If your alternative hypothesis was that the
proportion of adult non-80%Cacao is less than the proportion of child non-80%Cacao, then the
p-value for this one-sided test would be 0.9193/2 = 0.45965.
If you are interested in the difference in the probability of 80% Cacao flavour between adult
and child, the RISKDIFF option provides an estimate as well as a confidence interval. Since
non-80%Cacao flavour is in Column 1, the "Column 1 Risk Estimates" table provides the
desired estimates. The estimated difference in probabilities (Adult in Row 1 - Child in Row 2)
HSUHK 24 AMS2640 Study Notes
is 0.0179 with 95% confidence limits (-0.3275, 0.3632). Since the interval include zero as a
likely value of the population mean difference, the difference is not significant at the 0.05 level.
In the part b results below, a table of statistics includes the Pearson chi-square test labelled
“Chi-Square”. The large p-value=0.4285 for the test indicates that the null hypothesis of equal
proportions cannot be rejected and that the proportions are equal.
Assuming age group and chocolate flavor are independent, the probability of obtaining a chi-
square this large or larger by chance alone is 0.4285, so the data do not support the idea that
there is an association between age group and chocolate flavor. In our example, since our
sample size is small for easy presentation, so the test may not give a good conclusion.