电子电气代写-ELEC2103|学霸联盟

电子电气代写-ELEC2103

时间：2022-08-31

ELEC2103 Tutorial 5 – Probability, Statistics, and Data Analysis
This tutorial presents some of the MATLAB functions used in probability calculations, basic statistics,
and data analysis. The discrete Markov chain model is introduced and its stationary points identified using
eigenanalysis, linking linear algebra and this class of stochastic model. Based on this, the PageRank algorithm
is discussed, which underpins Google’s websearch engine. Importantly, reading data from and writing data to
a file i s d emonstrated. Curve fitting is introduced, as is the correlation coefficient.
For a more detailed treatment of statistics and and machine learning topics, consult The Elements of
Statistical Learning: Data Mining, Inference, and Prediction, by Hastie, Tibshirani and Friedman (2009, 2nd
Ed, Springer Series in Statistics):
http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/ESLII_print10.pdf
5.1 Combinatorics and Counting
Probability calculations often involve counting combinations of objects, and the study of combinations of
objects is the realm of combinatorics. Many formulas in combinatorics are derived simply by counting the
number of objects in two different w ays a nd s etting t he r esults e qual t o e ach o ther. O ften t he resulting
proof is obtained by a “proof by words.” Formulas are not necessarily derived from manipulations of factorial
functions as some students might think. Some of MATLAB’s combinatorial functions are illustrated in this
section, which are likely familiar to you.
5.1.1 Permutations and combinations
A permutation is an ordered arrangement of the objects in set S. If |S| = n, then there are:
n(n − 1)(n − 2) . . . (2)(1) = n!
different p ermutations o f t hese n o bjects. A n e xample i s t he s ize o f t he s et o f d ifferent or ders th at a group
n of people could appear in a queue, or that labeled balls are drawn from a bucket.
For these types of counting problems, MATLAB provides a factorial function. However, it is generally
preferable (because it is quicker and can be used for much larger numbers), to use the gamma function to
calculate factorials:
Γ (n) =
∫ ∞
0
xn−1e−xdx
and use the fact that Γ (n+ 1) = n! for positive integer values of n. For example,
f a c t o r i a l (5)
gamma(6)
ans =
120
ans =
120
More generally, a k-permutation arises when we choose k objects in order from a set of n distinct objects.
The total number of different permutations of size k of these n objects is:
n(n− 1)(n− 2) . . . (n− k + 1) = n!(n− k)!
1
Tutorial 4 ELEC2103/ELEC9103
where dividing by the number of remaining items (n−k)! truncates the factorial computation at the appropriate
point in the series.
Alternatively, we may not care about the order in which the objects are selected. There are two cases.
First, if we sample with replacement, the number of combinations is simply nk.
The second is referred to as sampling without replacement (i.e. as if we are picking balls from a bucket
and not returning them). In this case, we consider a combination of k objects dawn from n, which is written:(
n
k
)
. We can figure out the formula for
(
n
k
)
just by counting.
First, we know there are n! permutations of all of the objects; set this to the LHS of our expression. Next,
let us construct the RHS by first, counting the number of k-sized permutations of n objects by dividing the
n objects into a group of k selected objects and the remaining (n− k) objects. We know that there is a total
of k! permutations of the k selected objects, and likewise, a total of (n − k)! permutations of the (n − k)
remaining objects. Therefore, the total number of permutations of the n objects is:
n! =
(
n
k
)
k!(n− k)!
Rearranging this, we have: (
n
k
)
= n!
k!(n− k)!
different permutations of size k of these n objects. Note that we have derived this formula simply by
counting, not by expanding factorial functions. However, now we know the answer, we can also see that the
same expression can be found from the size of the set of k-permutations of n objects, divided by the number
of permutations of the drawn objects themselves, because we don’t care about their order:(
n
k
)
= n!
k!(n− k)! =
n(n− 1)(n− 2) . . . (n− k + 1)
k(k − 1)(k − 2) . . . (2)(1)
MATLAB provides the nchoosek function for calculating combinations, for example, for n = 5, k = 3:
nchoosek (5 ,3)
Exercise 1.
a) Verify that by using MATLAB’s factorial and nchoosek functions for a few example numbers, e.g.
(n=12,k=7) and (n=24,k=6).
b) Test the equation below using MATLAB’s nchoosek function and a few example numbers: (n=10, k=4),
(n=20, k=8). (
n
k − 1
)
+
(
n
k
)
=
(
n+ 1
k
)
If you want to know how to prove the relationship above, read the following, otherwise just skip it.
Proof: By definition
(
n+ 1
k
)
, is the number of ways to choose k items out of n+1 items. We can break
the items into one group, Group A, with n items an another group, Group B, with only 1 item, and consider
the number of ways to choose items from these two groups. Assume we pick the 1 item from Group B, we
then have to choose k − 1 items from Group A. The number of ways we can do this is clearly
(
n
k − 1
)
. We
could also choose not to pick the 1 item in Group B, so that we have to select all items from Group A. The
number of ways we can do this is
(
n
k
)
. We must conclude then that the total number of ways to choose k
items out of n+ 1 items is the sum of these two cases, given by:
(
n
k − 1
)
+
(
n
k
)
=
(
n+ 1
k
)
.
Page 2
Tutorial 4 ELEC2103/ELEC9103
5.2 Stochastic matrices and Markov chains
A Markov chain is stochastic discrete-time, discrete-state transition system. Markov chains are used extensive
to model systems with predictable, stochastic, state transistions, including activity models, target tracking,
regression models with mode- or regime-switching, for simulating wind and weather patterns, and in finance.
They are also used extensively as a computational routine for implementing Baysian models, which require
efficient resampling methods, an approach called the Monte Carlo Makov chain method.
A three-state Markov chain system is illustrated below:
1 2 3
Here, the three states are illustrated as circles, with chance transitions between them, as indicated by edges.
Each edge has associated with it a probability, such that at each time step, the probability of moving from
state s to state s′ is given by P (s′|s), and edges indicate transitions that have strictly positive probabilities.
Note that, in this example:
• not all states are directly linked with all others, but
• each state can be reached from all others by some path, and
• no two states have a periodic cycle between them.
When the second and third condition above hold, we say that the Markov chain is ergodic. We are going to
mix an application of probability and eigendecomposition to analyse the steady state behaviour of an ergodic
Markov chain.
Let the state-transition probabilities for the diagram above be given by a matrix:
P =
0.5 0.5 00.2 0.5 0.3
0 0.5 0.5

The entries in P are read as the probability that, starting in an initial state corresponding to a given row,
the system moves to the state indicated by the column. Importantly, note that all rows sum to 1; this is
the definition of a stochastic matrix. Indeed, you can think of each row of a stochastic matrix as a valid
probability distribution, specifically a categorical distribution. (In the lectures, categorical distributions over
k possible outcomes were mentioned as the basic probability model underpinning multinomial distributions,
in the same way that a Bernoulli distribution over binary outcomes underpin the binomial distribution. What
we see here is a much more general model that includes a notion of state.)
Exercise 2.
a) Write a script encoding P as a MATLAB matrix, and use a logical test to check that it indeed is a
stochastic matrix.
b) An initial state x0 can be encoded as vector, with a value of 1 indicating the initial state and zeros
everywhere else. Write a vector x encoding an initial state of 1.
c) Compute the distribution of states of the Markov chain 2 time steps after starting in state 2 by multi-
plying again by P . Repeat the multiplication for 4, 10, 20 and 30 time steps. What is happening to
the resulting distribution of states?
Page 3
Tutorial 4 ELEC2103/ELEC9103
5.2.1 Stationary distribution of a Markov chain
A stationary distribution of a Markov chain is a probability distribution that remains unchanged in the Markov
chain as time progresses. Typically, it is represented as a row vector whose entries are probabilities summing
to 1, and given transition matrix P, it satisfies:
xP = x
In other words, x is invariant by the matrix P .
An important result in matrix theory known as the Perron–Frobenius theorem applies to stochastic matrices.
It concludes that a nonzero solution of the equation above exists and is unique to within a scaling factor. If
this scaling factor is chosen so that: ∑
i
xi = 1,
then x is a state vector of the Markov chain, and is itself a stationary distribution over states.
In the exercise above, we saw an iterative approach to computing the stationary distribution of a Markov
chain. This process is called the power method, for a reason that should be obvious.
However, note that the equation for the stationary distribution looks very similar to the column vector
equation Pq = λq for eigenvalues and eigenvectors, with λ = 1. Also you can test that P is positive-definite,
so P has an eigendecomposition — this is a general characteristic of stochastic matrices. In fact, for a
technical reason, we transpose the matrices to get them into an appropriate form for eigenanalysis. This
allows us to find the stationary distribution as a left eigenvector (as opposed to the usual right eigenvectors)
of the transition matrix. The operations are as follows:
(xP )T = xT = PTxT
In other words, the transposed transition matrix PT has eigenvectors with eigenvalue 1 that, when normalized
to sum to 1, are stationary distributions expressed as column vectors. Therefore, if the eigenvectors of PT are
known, then so are the stationary distributions of the Markov chain with transition matrix P. In an ergodic
Markov chain, this stationary distribution is unique.
Exercise 3.
a) Write a script to: (i) compute the left eigenvectors of P , (ii) check that there is at least one eigenvalue
of 1, and (iii) display the associated stationary distribution of the Markov chain P .
b) Compare the answers you got in a) to those from Exercise 2 c).
When there are multiple eigenvectors associated to an eigenvalue of 1, each such eigenvector gives rise to
an associated stationary distribution. However, this can only occur when the Markov chain is reducible (i.e.
can be broken into smaller independent chains).
5.2.2 PageRank algorithm (Google)
The PageRank algorithm was developed by Google’s founders, Larry Page and Sergey Brin. PageRank is
determined entirely by the link structure of the World Wide Web. In the past, it was recomputed about once
a mont and did not involve the actual content of any Web pages or individual queries. Google has continued
to improve on PageRank since, but their underlying methods are very much based on the algorithm that is
about to be described. For any particular query, Google finds the pages on the Web that match that query,
and lists those pages in the order of their PageRank, and it works as follows.
Imagine going from page to page by randomly choosing an outgoing link from one page to get to the next.
In this way, webpages can be thought of as states on a Markov chain, and web links out of a webpage each
have equal positive probability. However, the random walk can lead to dead ends at pages with no outgoing
links, or cycles around cliques of interconnected pages. If this can happen, the Markov chain is not ergodic.
So, a certain fraction of the time, the algorithm simply chooses a random page from the web.
Formally, let G be an n-by-n connectivity matrix, with gij indicating a web link from page i to page j by
a one, and zeros everywhere else. We are going to fill in the values of P , the n-by-n ergodic Markov chain
transition matrix associated with G for the PageRank algorithm.
Page 4
Tutorial 4 ELEC2103/ELEC9103
Define:
ri =
∑
j
gij i = 1, ..., n
to be the row sum of G, or page i’s outdegree. If a state’s outdegree is 0, then it is a dead-end.
Next, for those states that are not dead-ends, let p be the probability that a particular link is followed.
This means that with probability 1− p some arbitrary edge is chosen, and denote:
δ = (1− p)/n
as the probability that a particular edge is followed. Assuming a uniform distribution over links followed, the
probability of traversing a particular edge is given by:
pij =
p
ri
+ δ if ri > 0.
For a dead end, the algorithm picks a new state at random with uniform probability, so that:
pij =
1
n
if ri ≤ 0.
These probabilities together make up all the elements of P , the transition matrix of PageRank.
Given P , the limiting probability x∗, satisfying the stationarity condition x∗P = x∗, that an infinitely
procrastinating visitor lands on any particular page, is its PageRank. Thus, a page has higher PageRank if
other pages with high rank link to it.
Exercise 4. Set p = 0.85, and let the start and end points of edges be given in the following two vectors:
s =
(
2 6 3 4 4 5 6 1 1
)
and
t =
(
1 1 2 2 3 3 3 4 6
)
a) Use the sparse() function to build a sparse matrixG with this data. Consult the MATLAB documentation
on how to use sparse with edge index vectors. Then convert G to a full matrix using full and inspect
it.
b) Calculate P using a for loop over the rows of G. Explain why a for loop is needed here.
c) Redo steps b) and c) from Exercise 2 on P . Discuss what you find. (Hint, the PageRanks for this small
section of the WWW are:
(0.2635 0.2150 0.1580 0.1352 0.0233 0.1352 0.0233 0.0233 0.0233))
5.3 Reading and Writing Data
Data analysis relies on being able to read an write data to a file. Three formats for reading and writing data
to a file are described in this lab: (i) ASCII, (ii) binary, and (iii) MATLAB binary format (.mat file). Hints
on other approached to getting data into MATLAB are at the bottom of this section. Note: this material is
essential for your assignment.
5.3.1 ASCII data files
The following MATLAB coe shows how to save data in ASCII format (mydata is a 10x10 matrix containing
the numbers up to 100):
mydata = [ 1 : 1 0 ] ;
f o r i i =2:10
mydata = [ mydata ; i i *mydata ( 1 , : ) ] ;
end
save −−ASCII myfi le mydata
Page 5
Tutorial 4 ELEC2103/ELEC9103
Note that the filename is specified first and is then followed by the MATLAB data variable. The data is
saved in 8 digit ASCII format. The data can also be saved in 16 digit ASCII format:
save −−a s c i i −−double myfi le mydata
Sometimes it is convenient to specify the delimiter that is used to separate the different numbers in one
row of text. The command dlmwrite is use to set the delimiter. For example, to save the data in the variable
mydata in ASCII format using a tab as a delimiter, one would use:
dlmwrite ( ’ myfi le ’ , mydata , ’ \ t ’ )
If one wanted to use a comma, ’,’, as a delimiter, one would use the following command:
dlmwrite ( ’ myfi le ’ , mydata , ’ , ’ )
ASCII data can be read from a file using two commands: (i) load or (ii) dlmread. If the data is separated
using spaces or tabs, then the command load can be use. For example,
load myfi le
In this case, the name of the MATLAB variable in the workspace that contains the data is taken from the
filename.
If the delimiter that has been use to separate the numbers in one row is known, then one can use the
command dlmread to read in the data. For example, the following code can be used to read in data that has
been written using a semicolon ’;’ as a delimiter.
data = dlmread ( ’ myfi le ’ , ’ ; ’ )
Exercise 5.
a) Use the following commands to create the matrix mydata:
mydata = [ 1 : 1 0 ] ;
f o r i i =2:10
mydata = [mydata ; i i *mydata ( 1 , : ) ] ;
end
Write the data in mydata in ASCII format to a file called mydatafile.dat. Use a text editor to have a
look at the file. What is the exact format of the number in the fifth row and second column?
b) Clear the MATLAB workspace (clear) and try to read in the data using the load command:
load mydataf i le .dat
What is the name of the variable that is loaded into the workspace?
c) Create the mydata matrix again and use the command dlmwrite to write it to a file called mydatafile.dat
in ASCII format using a colon, ’:’, as the delimiter. Use a text editor to have a look at the file. What
is the exact format of the number in the fourth row an third column?
d) Clear the MATLAB workspace and try to read in the data using the dlmread command. What is the
exact command to read in the data?
Page 6
Tutorial 4 ELEC2103/ELEC9103
5.3.2 MATLAB binary data files
Data save in ASCII format requires substantially more space on the hard drive and memory than data saved in
binary format. There are several types of binary formats available such as: int16, uint16, float32 and float64.
Type help fread to the see the different binary data formats that are available. Type help fopen to see the
format used for opening a file.
While writing and reading custom binary files is necessary, especially when working with data produced by
another program or data collection device, MATLAB has fortunately simplified the procedure when working
within MATLAB. MATLAB binary files are written as follows:
mydata = [ 1 : 1 0 ] ;
f o r i i =2:10
mydata = [ mydata ; i i *mydata ( 1 , : ) ] ;
end
% write the data
save myfi le mydata % save the data in MATLABs proprietary binary format
The binary data file is read as follows:
load myfi le % th i s loads in the var iab le mydata into the MATLAB workspace
If one wants to save more than one variable to a MATLAB binary file, it is just as easy:
save myfi le mydata1 mydata2 mydata3 % etc
The three MATLAB variables are read in just as before, using load myfile.
Moreover, one can save the entire workspace:
save myfi le % This comman saves a l l var iab le s in the workspace
Exercise 6. Create the following data:
mydata = [ 1 : 1 0 ] ;
f o r i i =2:10
mydata = [mydata ; i i *mydata ( 1 , : ) ] ;
end
What are the commands to save this data in MATLAB’s binary format and then load the data back into the
workspace?
In addition to the data types addressed above, MATLAB has a host of features for connecting to data sources:
• For small files, you can use MATLAB’s Data Import tool. From the Home tab, click on the Import Data
command. Choose the file you wish to import, and the Data Import tools will arrange the information
for you. Similarly, you can right-click directly on a file appearing in your Current Folder, and select to
open the Data Import tools from the pop-up menu.
• For MS excel files (e.g. xls, xlsx) there is xlsread and xlswrite
• For larger or inconsistently formatted files might require the use of textscan – it is worth looking at the
demonstration of this function by typing open textscanDemo
• MATLAB can connect to databases through the Database Explorer App (e.g. using a visual query
builder), which works for databases that use ODBC or JDBC drivers.
• Automated web-scraping and url link downloads, unzipping and other data munging features…
Page 7
Tutorial 4 ELEC2103/ELEC9103
Exercise 7. Go to the course page on Canvas and download the SARS data for the number of cases of
severe acute respiratory syndrome (this data was obtained from the World Health Organization website:http:
//www.who.int/csr/sarscountry/2003_04_04/en/). The data is stored in two forms: (1) as an ascii file with
a text header, sars_data.txt; (2) as an Excel spreadsheet data file, sars_data.xls. Download both forms of
the data.
Reading in the text file: Clear all data from the workspace and try to read in the text file. From the
Home tab, click on the Import Data command. In the file selection window, choose sars_data.txt. A preview
window should then pop up showing the data. It will all be highlighted, but you could choose to select a
subset of the data. Click on Import Selection.
Reading in the Excel file: Clear all data from the workspace and try to read in the Excel data file. In
the Current Folder window, choose sars_data.xls. Right click and select Import Data. A preview window
should then pop up showing the data. Click on Import Selection. Now compare what happens if you double
click the files or select Open after right-clicking in the Current Folder window. Why might MATLAB treat
.txt and .xls files differently?
Type whos at the command prompt. You should now have the following variables in your workspace: Day,
Month, HongKong, Singapore, Canada, Australia, China, Worldwide. Each variable contains the data in the
corresponding column. If one must do a substantial amount of linking of data between Excel and MATLAB,
then one should really use the Excel link that MATLAB provides (or use a database). The Excel link is not
described here.
5.4 Fitting a Curve to Data (SARS Epidemic)
Frequently it is important to estimate the functional form of data by fitting a curve through the data points.
This is known as regression analysis. MATLAB provides a simple GUI interface via the plot window to perform
this task. Start with plotting the data (we are using the SARS data that was loaded into the workspace in
the exercise above).
Exercise 8. We would like to examine the growth in the number of SARS cases as a function of time. The
first two columns of the data variable contain the Day and Month. We have data for the months of March
2003 and April 2003. We can convert the Day and Month data to a serial timeline as follows:
time = (‐Month3)*31 + Day;
Explain in your own words the calculation contained in the above command line.
The best data available are those for Hong Kong, Singapore, and Canada. Plot the data for ’HongKong’:
plot ( time ,HongKong, ’ o ’ )
Within the plot window, click on the Tools menu and select Basic Fitting. This opens a curve fitting window.
Click on the arrow at the bottom right of this window and then once again to expand the window to its
full extent. Check the method linear. The numerical results of the curve fit are shown in the middle window
pane and the curve fitted to the data is shown in the plot window. Check the box marked Show equations.
This shows the functional form of the curve in the plot window. Change Bar plot (under the plot residuals
box) to Scatter plot. Check the box marked plot residuals. The residual error in the curve approximation is
shown in the plot window.
Exercise 9. The curve fitted to the data is derived using a minimum least-squares error approximation. For
example, suppose there are only three data points: (x1, y1), (x2, y2), (x3, y3). Let:
ei = yi −mxi − b
be called the regressions’s residuals, that is, the parts of the data’s variation that is not explained by the
model. The least-squares sum of squared errors (SSE) in the linear fit, y = mx+ b, is given by:
SSE = (y1 −mx1 − b)2 + (y2 −mx2 − b)2 + (y3 −mx3 − b)2.
Page 8
Tutorial 4 ELEC2103/ELEC9103
The squared terms are the square residuals, e2i , and the fitting problem is to minimise the sum of these values
(this is what least squares refers to). That is, the fitted curve minimizes the least-square error.
Think about how this is related to the definition of variance, and how it is related to the QR decomposition
for an overdetermined system (i.e. with more equations than variables).
5.4.1 Function Fitting:
Suppose we would like to know whether the growth in the number of SARS cases is following a linear, quadratic
(power of 2), or exponential growth law. We can apply the following information:
1. Linear functional dependence (y = mx + b) is shown by obtaining a straight line when plotting the
linear data, y versus x.
2. Power law functional dependence y = bxm is shown by obtaining a straight line when plotting log(y)
versus log(x).
3. Exponential functional dependence (y = b emx) is shown by obtaining a straight line when plotting
log(y) versus x.
Exercise 10.
a) Explain in your own words what the statements 1–3 above mean.
b) Check if it is possible that there is a power law growth in the number of SARS cases for Hong Kong;
plot the data as follows:
f i gu re (2)
plot ( log ( time ) , log (HongKong) , ’ o ’ ) ;
Does the plot look linear? Perform a linear curve fit as before. What power of x does the plot predict?
c) Check if it is possible that there is an exponential growth in the number of SARS cases for Hong Kong.
Plot the data as follows:
plot ( time , log (HongKong) , ’ o ’ ) ;
Does the plot look linear? Perform a linear curve fit as previously. What exponential power does the
plot predict?
5.4.2 Goodness-of-fit
The quality of a curve fit is characterized by its coefficient of determination, aka r-squared or r2 value. To
understand this, it’s useful to consider the following expression, which relates the total variation of y around
its mean y¯ to the variation explained by the model and the variation in the residuals:
N∑
i=1
(y¯ − yi)2 =
N∑
i=1
(f(xi)− y¯)2
N∑
i=1
(yi − f(xi))2
where y¯ is the mean of the y-values and there are N data points. The term on the left is called the total
sum of squared (TSS), interpreted as the total squared variation. The middle term is the sum of squares of
the regression (SSR), and the final term is the sum of squares of the errors (SSE), or sometimes the residual
sum of squares (you saw the SSE term in Exercise 9). Given these terms, we can rewrite the expression above
simply as:
TSS = SSR+ SSE
Page 9
Tutorial 4 ELEC2103/ELEC9103
from which we see the total sum of squares (TSS) is divided among the variation explained by the model
(SSR) and that which is left to the residuals (SSE).
With this intuition in mind, the r-squared value is defined equivalently as follows:
r2 = 1−
∑N
i=1
(f(xi)−yi)2∑N
i=1
(yi−y¯)2
= 1− SSESST
=
∑N
i=1
(f(xi)−y¯)2∑N
i=1
(yi−y¯)2
= SSRSST .
Exercise 11. Calculate the r-squared value for the three different curve fits of the Hong Kong SARS data.
You will need to write a MATLAB script or function file to perform the calculations. Which curve appears to
fit the best?
5.5 Correlation Analysis
With the curve fitting analysis performed in the previous section, one of the variables, time, was deterministic.
We can measure it basically without error. Such an analysis with an independent, deterministic variable is
referred to as regression analysis. Sometimes, however, both the x and y variables are random. In this case,
it is customary to perform what is known as correlation analysis.
Load in MATLAB’s demo data file count.dat:
load count.dat
s i z e ( count )
ans =
24 3
These values tell is that in this example data, there are 24 observations of 3 variables: The count data refers
to the number of lines, words, and byte count of a document. As such, we expect that the data are strongly
correlated. Let us see if we can show it.
Correlation data analysis techniques often require the data to be normalized using a z-score method. The
z-score method produces data with a mean of zero and a standard deviation equal to one. It is calculated for
each data point by subtracting the mean, x¯, and dividing by the standard deviation, sX :
zi =
xi − x¯
sX
The z-score normalization can be performed without a for loop using outer products and built-in functions
mean and std:
e = ones ( s i z e ( count , 1 ) ,1) ;
countz = count – e*mean( count ) ;
countz = countz . / ( e*std ( countz ) ) ;
Exercise 12. This exercise is for you to work through, and is not assessed. Type the following commands at
the MATLAB prompt and explain what they do and how they work in your own words. For z-score normalized
data, the correlation coefficient (aka Pearson’s r) is calculated as:
countz ’* countz‐/(241) % 24 i s the number of data points
ans =
1.0000 0.9331 0.9599
0.9331 1.0000 0.9553
0.9599 0.9553 1.0000
Page 10
Tutorial 4 ELEC2103/ELEC9103
For unnormalized data, the function corrcoef performs the same calculation:
corrcoe f ( count )
ans =
1.0000 0.9331 0.9599
0.9331 1.0000 0.9553
0.9599 0.9553 1.0000
What does this correlation matrix mean and represent? First of all, correlation coefficients range between
-1.0 and 1.0. A value of 1.0 means perfect positive correlation and a value of -1.0 means perfect inverse
correlation. A value of 0.0 indicates that no correlation exists.
Consider now the matrix of correlation values. The (1,1)-element of the matrix is the correlation of the
first column of data (of the count matrix) with itself. This correlation is 1.0 (perfect correlation, as one would
expect). The (2,2)-element and the (3,3)-element of the matrix are the correlation of the second column of
data with itself and the correlation of the third column of data with itself, respectively. These are also 1.0 as
expected.
The (1,2) and (2,1) elements of the matrix are the same and give the correlation of the first column of
data with the second column of data. The (1,3) and (3,1) elements of the matrix are the same and give the
correlation of the first column of data with the third column of data. Similarly, the (2,3) and (3,2) elements
of the matrix are the correlation of the second column of data with the third column of data. The correlation
coefficient tests whether there is a strong linear dependence among the three columns of data. As the count
data refers to the number of lines, words, and byte count of a document, the data are strongly correlated.
This can be seen in the correlation coefficients which are all close to 1.
As a final comment, regarding your assignment, note that there are a group of functions that provide
basic data analysis capabilities: max, min, mean, median, std, sort, sum, prod, cumsum, histogram and more.
These functions are column-oriented. That is to say, they operate on columns of the matrix. Use MATLAB’s
help function to learn more about these functions and their usage. Also read the text (Moore, Ch 13.1-13.3)
to learn more about how to use MATLAB’s curve fitting tools, and explore the classification learner and
distribution fitting apps.
Page 11