高级数据分析代写-FIT3154|学霸联盟

高级数据分析代写-FIT3154

时间：2021-11-15

FIT3154 Studio 12
Sample Exam Questions
Daniel F. Schmidt
October 19, 2021
Contents
1 Short Answer Questions 2
2 Statistical Decision Theory 3
3 Bayesian Inference 5
4 Shrinkage and Regression 7
5 Function Approximation and Nonlinear Models 10
6 Neural Networks 12
7 Big Data and Optimisation 13
8 Appendix: Reference Sheet 16
1
1 Short Answer Questions
Please provide a short (2-3 sentences) description of the following terms:
A: General comment. When answering short-answer questions of this form (in general), it is a good
idea to use the following basic structure: your first sentence should describe what the object/item of
interest is. The second and third sentences (or fourth, the 2−3 is a guide and not a strict requirement!)
should describe one or two properties of the object. This allows a marker to clearly see that you can
(i) identify the object of interest, and (ii) you know something about the object of interest. All the
answers below follow this basic structure.
1. Posterior distribution
A: The posterior distribution is the central quantity in Bayesian inference. It is a probability
distribution over the various values of the population parameter that takes into account the
information in the data sample and the information encoded in our prior distribution.
2. Shrinkage estimator
A: A shrinkage estimator is a statistical estimation technique that “shrinks” or regularises the
maximum likelihood/least-squares estimates. A shrinkage estimator usually works by reducing
the size of the parameters or coefficients and pushing them towards zero. This is done to reduce
variance at the expense of increased bias, and reduce overall estimation error.
3. Weakly informative prior
A: A weakly informative prior is a type of prior distribution used in Bayesian inference. It is
a prior distribution that allows us to encode priors beliefs about the value of the population
parameter, but that has limited impact on the posterior distribution if the information in the
sample is very strongly at odds with our prior beliefs. A good example of a weakly informative
prior is the Cauchy distribution.
4. Minimax estimator
A: A minimax estimator is an estimator that, among all possible estimators, has the smallest
possible worst-case estimation risk for a particular problem. A minimax estimator solves the
problem
θˆMM = arg min θˆ
{
max
θ
R(θ, θˆ)
}
5. Lasso regression
A: Lasso regression is a type of penalized linear regression estimator. It works by minimising the
residual sum-of-squared errors plus the sum of the absolutes of the regression coefficients times
a regularisation hyperparameter. The bigger the hyperparameter, the more shrinkage is applied.
The lasso can estimate coefficients to be exactly equal to zero.
6. Curse of dimensionality
A: The curse of dimensionality refers to the collection of problems that arise when trying to
perform statistical inference in high dimensions. It often refers specifically to the fact that the
complexity of many methods increases exponentially as the number of predictors p increases.
7. Dominance and admissability
A: Dominance: an estimator θˆA dominates an estimator θˆB if its risk function is never greater,
and is lower for at least one value of the population parameter.
Admissability: an estimator is said to be admissable if there exists no other estimator that
dominates it.
2
2 Statistical Decision Theory
Consider obtaining a sample of n observations y = (y1, . . . , yn), and assume that yi ∼ N(µ, σ2). Let
us imagine we estimate the population mean µ using
µˆc(y) = c y¯
where y¯ = (1/n)
∑n
i=1 yi is the sample mean, and c > 0 is a positive scaling factor.
1. Derive the squared error risk (expected squared error loss) for the estimator µˆc(y) for arbitrary
c.
A: The bias of the estimator is:
bias = E [cy¯ − µ]
= cE [y¯]− µ
= (c− 1)µ
and the variance is
V [cy¯] = c2V [y¯] = c
2σ2
n
so the squared-error risk is
E
[
(µ− cy¯)2] = (c− 1)2µ2 + c2σ2
n
2. Does the estimator with c = 0.5 dominate the estimator when c = 1, or not? You must justify
your answer.
A: When c = 0.5 we have
E
[
(µ− (1/2)y¯)2] = (1/2)2µ2 + (1/2)2σ2
n
= (1/4)(µ2 + σ2/n)
and when c = 1 we have E
[
(µ− y¯)2] = σ2/n. Therefore, when µ = 0 the risk for c = 0.5 is
σ2/n/4 which is smaller than σ2/n (for c = 1) which tells us that c = 1 does not dominate
c = 0.5. To show that c = 0.5 does not dominate c = 1, we just note that the risk for c = 0.5 is
unbounded as |µ| → ∞, while the risk for c = 1 is constant in µ. Therefore, neither estimator
dominates the other.
3. If c ≡ cn is allowed to depend on n (i.e., c is a function of the sample size and is not a constant
function, i.e., is not cn = 1), choose a function such that the squared error risk of µˆcn(y) goes to
zero as n→∞.
A: The variance goes to zero as n→∞ so we only need the bias to disappear. The bias term of
the squared error risk is
(c− 1)2µ2
so if we take any function of n that goes to one as n→∞ we are good, i.e., cn = n/(n+ 1) for
example.
3
4. A random variable Y ∈ {0, 1, 2, . . . } is said to follow a geometric distribution with mean µ if
P(Y = y |µ) = µy(µ+ 1)−y−1.
If a random variable Y follows a geometric distribution with mean µ then E [Y ] = µ. Derive the
Fisher information Jn(µ) for µ for the geometric distribution. Remember that observations from
a geometric distribution are independent and identically distributed.
A: Begin by taking the negative log-likelihood of the geometric pmf:
− log p(y |µ) = −y logµ+ (y + 1) log(µ+ 1)
Now we need to differentiate this twice with respect to µ:
d2
dµ2
{− log p(y |µ)} = d
2
dµ2
{−y logµ}+ d
2
dµ2
{(y + 1) log(µ+ 1)}
= d
dµ
{
− y
µ
}
+ d
dµ
{
y + 1
µ+ 1
}
= y
µ2
− y + 1(µ+ 1)2
Finally we take expectations and simplify:
J1(µ) = E
[
y
µ2
− y + 1(µ+ 1)2
]
= E [y]
µ2
− E [y] + 1(µ+ 1)2
= µ
µ2
− µ+ 1(µ+ 1)2
= 1
µ
− 1
µ+ 1
= µ+ 1
µ(µ+ 1) −
µ
µ(µ+ 1)
= 1
µ(µ+ 1)
and the Fisher information for n data points is just nJ1(µ):
Jn(µ) =
n
µ(µ+ 1)
4
3 Bayesian Inference
Consider the beta-Bernoulli Bayesian hierarchy for estimating the probability of success of a series of
binary trials:
y1, . . . , yn | θ ∼ Be(θ)
θ | a, b ∼ Beta(a, b)
where a and b are hyperparameters. It is known that the posterior distribution of θ, after observing a
sample y is
θ |y ∼ Beta(a+ k, b+ n− k)
where k =
∑n
i=1 yi. To answer this question the following facts regarding a beta distribution will be
useful:
Fact (I) if X ∼ Beta(α, β), then
E [X |α, β] = α
α+ β , V [X |α, β] =
αβ
(α+ β)2(α+ β + 1)
Fact (II) if X ∼ Beta(α, β) and α > 2, β > 2 then X approximately follows
X ∼ N(E [X |α, β] ,V [X |α, β])
Using these facts, please answer the following questions:
1. In Bayesian inference, what is a hyperparameter? How is it different from a model parameter?
A: A hyperparameter is a parameter of the prior distributions that allows us to set or control
our prior beliefs about the model parameters, before we see the data. A model parameter is a
parameter of the distribution we are using to model the data, i.e., it is the thing we are trying
to estimate/infer from the data.
2. How can we interpret the hyperparameters a and b in the above hierarchy?
A: We can interpret the hyperparameter a as the number of fake heads we have observed before
seeing our data.
We can interpret the hyperparameter b as the number of fake tails we have observed before seeing
our data.
3. What is the posterior mean of θ?
A: Using Fact (I) and setting α = a+ k and β = b+ n− k we have
E [θ |y] = a+ k
a+ k + b+ n− k =
a+ k
n+ a+ b
4. What happens to the posterior mean as k →∞ and n→∞, with k/n fixed to a constant value
θ0?
A: If the above occurs, then we have
lim
k,n→∞
{
a+ k
n+ a+ b
}
= θ0
i.e., the posterior mean becomes equal to the sample mean k/n.
5
Imagine that we work for a company that manufactures components. The company collects infor-
mation on failures of the component in a three month period from several consumers, and find that
out of 16 components, 2 failed. You are asked to use a Bernoulli distribution to model the failure rate,
θ, of this part, and choose to use the beta-Bernoulli Bayesian hierarchy described on the previous page
to analyse the data.
5. After consulting with the engineers that designed the component, you are told that during
development and testing, 2 out of 10 of the prototype components tested failed during use. Use
this information to choose your prior distribution and find the posterior distribution for θ given
the data collected from the three month testing period described above.
A: First set prior hyperparameters, a = 2 and b = 10 − 2 = 8, as this captures the number of
prototype components that failed and the number of prototype components that did not fail,
during the testing phase. The data we have collected since had k = 2 out of n = 16 components
fail. So the posterior distribution is
θ |y ∼ Beta(4, 22)
6. What is the posterior mean estimate of θ for this experiment?
A: The posterior mean is
E [θ |y] = 2 + 216 + 2 + 8 =
2
13 ≈ 0.1538
7. Provide an (approximate) 95% credible interval for θ.
A: Here we use Fact (II). We already calculated the posterior mean; from Fact (I) the posterior
variance is
V [θ |y] = 4× 22(4 + 22)2(4 + 22 + 1) ≈ 0.004821
so using the “plus-minus 1.96” rule for 95% intervals of normal distributions, i.e., that if X ∼
N(µ, σ2), then
P(X ∈ (µ− 1.96σ, µ+ 1.96σ)) ≈ 0.95
and we have
CI ≈ (0.1538− 1.96
√
0.004821, 0.1538 + 1.96
√
0.004821) ≈ (0.0177, 0.289)
6
4 Shrinkage and Regression
-3 -2 -1 0 1 2 3
-3
-2
-1
0
1
2
3
Least squares
Estimator A
(a) Estimator A
-3 -2 -1 0 1 2 3
-3
-2
-1
0
1
2
3
Least squares
Estimator B
(b) Estimator B
-3 -2 -1 0 1 2 3
-3
-2
-1
0
1
2
3
Least squares
Estimator C
(c) Estimator C
Figure 1: Three shrinkage estimators of µ as functions of the observation yj .
1. Imagine we observe an observation yj ∼ N(µ, 1). The least squares estimate of µ using this
sample is µˆ(yj) = yj . Figure 1 shows the shrinkage profile of three different estimators of µ that
we could construct using the single sample yj . For each subfigure, identify the method (ridge,
lasso or thresholding), and briefly explain how the shrinkage estimator works in comparison to
the least-squares estimate.
(a) Figure 1(a)
A: Lasso estimator. It works by translating the least-squares estimating towards the origin
by a fixed amount, and setting the result to zero if it changes sign.
(b) Figure 1(b)
A: This is the thresholding estimator. It works by either leaving the least-squares estimates
untouched if they exceed a threshold, or setting them to be equal to zero if they are below
the threshold.
(c) Figure 1(c)
A: Ridge estimator. This works by scaling the least-squares estimates by a constant value
less than one.
7
2. In Bayesian analysis of regression models we need to put prior distributions on the regression
coefficients. Explain several properties that a good prior distribution for regression coefficients
should have?
A: There are several properties a good prior distribution should have:
(a) It should be centered at β = 0, so that the expected prior guess is “no association”
(b) It should be symmetric, so that there is no a priori preference for negative or positive
coefficients.
(c) The prior probability should tail away as |β| → ∞, to express a preference for smaller
coefficients (shrinkage)
(d) The scale of the prior should be proportional to the standard deviation of the noise σ, to
ensure that the resulting inferences are unaffacted by changes of scale of our target variable.
3. Using a Laplace prior distribution for the regression coefficients is equivalent to what non-
Bayesian regression procedure?
A: This is equivalent to the lasso estimator.
4. Imagine we are trying to fit a high dimensional linear regression model. Why is the co-ordinate-
wise descent algorithm particularly appropriate for fitting regression models with very large
number of predictors?
A: The co-ordinate-wise descent algorithm is appropriate for high dimensional regression prob-
lems because it breaks the problem up into iteratively solving very simple optimisation problems.
It works by iteratively optimising each coordinate of the coefficient vector, holding all the other
coordinates fixed. This reduces the problem to solving a series of one-dimensional optimisation
problems, which is potentially a lot easier than trying to adjust all p coefficients simultaneously.
8
β0 SNP1 SNP2 SNP3 SNP4 SNP5
Posterior mean 0.28 0.27 -0.36 0.47 -0.02 -0.90
Posterior s.d. 0.08 0.05 0.05 0.19 0.04 0.19
Bayesian t-stat. - 5.42 -7.21 2.47 -0.05 -4.73
CI 12.50% 0.18 0.21 -0.41 0.25 -0.07 -1.11
CI 87.50% 0.36 0.33 -0.30 0.68 0.02 -0.67
Table 1: Logistic regression model estimated using the Bayesian inference.
Imagine that we have gathered data on corn plants; we have measured their genetic mutation at
five places along their genome (“SNPs”) and the time they took to grow to maturity. We are interested
in finding out which mutations appear to influence the probability that a corn plant will grow rapidly
to maturity, or not. Imagine we use a Bayesian logistic regression to analyse the data. The output of
our Stan run is summarised in Table 1.
5. Which SNPs appear unlikely to be associated with rapid growth to maturity, and why?
A: The 75% credible interval for SNP4 contains “0” (i.e., both ends are different signs), which is
suggestive that it is unlikely this SNP is associated with rapid growth to maturity.
6. Which SNP appears to have the strongest effect on increasing the probability of rapid growth to
maturity, and why?
A: SNP2 has the largest t-statistic, in absolute value, and therefore has the strongest effect on
increasing probability of rapid growth.
7. Write down the regression equation of the model estimated by our Bayesian regression procedure.
A: The regression equation estimated by our Bayesian procedure (using posterior means as our
best guesses of the coefficients) is
log.odds(rapid.growth) = 0.28 + 0.27 SNP1− 0.36 SNP2 + 0.47 SNP3− 0.02 SNP4− 0.9 SNP5
8. Consider a corn plant with no mutations at SNP1, SNP4 and SNP5, one mutation at SNP2 and two
mutations at SNP3. What are the odds, and what is the probability, of this corn plant growing
rapidly to maturity as predicted by our Bayesian logistic regression model?
A: Plugging in the numbers, we have
log.odds = 0.28− 0.36 + 2 · 0.47 = 0.86
which means the odds are e0.86 ≈ 2.363, and the probability is
Pr(rapid.growth) = 11 + e−0.86 ≈ 0.702
9
5 Function Approximation and Nonlinear Models
The following questions relate to function approximation and non-linear modelling.
1. Many methods, such as polynomials and decision trees are universal approximators based on
adding together a number, say K, of basic elements/building blocks.
(a) What does it mean if one universal approximator has a better rate of convergence asK →∞
than another universal approximator?
A: If a universal approximator has a better rate of convergence as K → ∞ as the level
of approximation error we want to achieve using our approximation method decreases, the
number of elements required will increase more slowly than for a method with a worse rate
of convergence.
(b) Why is a fast rate of convergence as K →∞ a good property to have? Justify your answer
in terms of bias and variance.
A: If a method has a fast rate of convergence as K grows, it means that roughly speaking,
it will require less elements to achieve a prescribed level of approximation accuracy (bias).
Every extra element/basis we include brings with it extra parameters to estimate, and every
extra parameter we estimate increases our variance. So if method A has a faster rate of
convergence than method B, it means method A will incur less estimation error due to
variance than method B if both achieve the same level of bias; therefore the overall error of
method A will be lower.
2. Spatial inhomogeneity is known to make a function difficult to approximate; what does it mean
if a function is spatially inhomogeneous?
A: A function f(x) is spatially inhomogenous if the degree of roughness or smoothness varies
with x. Essentially, it means that in some parts of the input space, the function is rougher or
more wiggly than in other parts of the input space, where it may be flatter and less varying.
This can cause problems for methods like polynomials which cannot vary their wiggliness across
the input space.
3. The smoothness of a function is another attribute that can effect how easy/difficult it is to
learn the function. Please explain, very generally, in what sense a function might be smooth, or
non-smooth, and why this can be a problem?
A: Smoothness of a function refers to how rapidly the function changes or varies. For example, a
very smooth function will have bounded derivatives so that the degree of variation is very smaller,
while a non-smooth function might have unboundedly large derivatives, or non-differentiable
points (like the absolute value function), or even discontinuities (like the step function).
This can cause problems for methods like polynomial approximations as they require many
polynomials to adequately approximate non-smooth functions.
10
4. Signal-to-noise ratio is a concept frequently used when the target variable is numeric to describe
how much information is present, relative to the random noise, in a problem. It can also be
extended to classification problems. For a binary classification problem, describe what a high
and low signal-to-noise ratio would imply about the problem.
A: In binary classification problem, the signal-to-noise ratio refers to how well the best classifier
could perform if we knew the truth.
A high signal-to-noise ratio problem would be one like hand-writing recognition; a human could
identify the various letters/numbers with almost perfect accuracy, so the best classification al-
gorithm would work extremely well at seperating the various classes.
A low signal-to-noise ratio problem is one where even the best possible classifier (based on the
features we have) does not achieve a particular high rate of correct. classifications, that is, no
decision surface exists that cleanly seperates individuals into their particular classes. An exam-
ple would be breast cancer classification from genomic information, where even given a person’s
genome the current classification accuracy into people who will and won’t develop breast cancer
is around 60%.
5. Additive models are powerful tools for regression.
(a) What is the key difference between an additive model and a linear model?
A: A linear regression models the expected value of an individual’s target as a weighted
linear combination of their features, plus an intercept.
An additive model models the expected value of the individual’s target as a sum of smooth,
univariate non-linear functions of the individual’s features, plus an intercept.
Therefore, an additive model can capture non-linear relationships between predictors and
targets, and the linear model cannot.
(b) What is the key assumption underlying an additive model?
A: The key assumption is that the effect of the features on the target are independent of
each other, in the sense that the change on E [y] given a change of feature xj from xj = x
to xj = x + δ is the same regardless of the value of any of the other features and depends
only on the values x and x + δ; i.e., the additive model assumes there are no interactions
between the predictors .
(c) Write down the general model equation for an additive model relating p predictors x1, . . . , xp
to a target y. Assume y is real-valued and follows a Gaussian distribution with variance σ2.
A: The general model equation is
y = β0 +
p∑
j=1
fj(xj) + ε
where ε ∼ N(0, σ2) and fj(·) are potentially non-linear functions of the predictors.
Predicting with an additive model then simply involves evaluating each function fj(·) at
the appropriate value of xj and summing these together, plus the intercept.
(d) If our target y was a non-negative integer (i.e., a count), how could we modify our additive
model so that we could model our target using a Poisson regression?
A: We can use the standard technique in Poisson regression, and say that
log λ = β0 +
p∑
j=1
fj(xj) and y |λ ∼ Poi(λ)
11
6 Neural Networks
The following questions involve neural networks.
1. Polynomial regressions and neural networks both work by adding together linear combination
of basis functions (polynomial terms for polynomial regession, neurons for neural networks). In
what way do the basis functions of a neural network differ from those used in a polynomial
regression, and what advantage does this difference confer the neural network?
A: In a polynomial regression, the basis functions are specified as-is without any extra, tunable
parameters, i.e., x2, x3, x4, etc.
In a neural network, the basis functions (neurons) have tunable parameters that can be used to
change their behaviour (i.e., their position/width, etc.)
The advantage this confers is that a smaller number of basis functions is generally needed to
approximate a target function well if they are tunable, than if they are not; this means less
parameters and smaller variance for the same bias.
2. In 1992, A. Barron proved the following result regarding the squared-error risk (MSE) obtained
when using a single layer neural network with M neurons to approximate a true function f∗:
MSE(f∗, fˆM ) = O(1/M) +O(pM/n) logn
where p is the number of predictors, fˆM denotes a fitted neural network and n is the size of the
data sample we used to train the neural network.
(a) What happens to the MSE as the sample size increases, if all other variables are held
constant?
A: As n→∞, if all other parameters are held constant, the second term vanishes but the
first term (the bias) goes to a constant.
(b) What happens to the MSE as the number of predictors increases, if all other variables are
held constant?
A: As p→∞, if all other parameters are held constant, the first term is constant and the
second term (the variance) tends to infinity. Therefore, the MSE tends to infinity.
(c) Imagine we let M be a function of the sample size, i.e., we use more neurons as our sample
size increases. Can we choose M to grow in such a way that ensures that the MSE goes to
zero as n→∞?
A: Yes; if we chooseMn to be a function of n that increases with increasing n, but increases
slower than linearly in n (for example, Mn = logn), then as n→∞ the first term will tend
to zero, and the second term will tend to zero.
12
7 Big Data and Optimisation
1. Big p learning refers to the problem of statistical/machine learning when the number of predictors
is large. Briefly describe below the three main problems that can occur when the number of
predictors p gets very large
(a) Problem 1:
A: Boundary effects. Points in predictor space that lie on the “boundary” of our set of
samples are more difficult to deal with, as we have less information about the neighbourhood
in which the points are located. As p→∞, all points in our sample lie on the boundary of
our sample space, which can make learning in high dimensions more challenging.
(b) Problem 2:
A: Exponential increase in basis functions. For many basic standard non-linear approxi-
mation methods, such as polynomial regressions, the number of terms required to achieve
a good degree of approximation accuracy of an arbitrary non-linear p dimensional func-
tion grows exponentially with the dimension p. This means for even moderate dimensions,
there are potentially millions of possible basis functions we need to construct which become
impossible to work with.
(c) Problem 3:
A: Discovering true effects becomes much more difficult. As the number of potentially
associated predictors increases, it becomes increasingly more difficult to identify which pre-
dictors are important and which ones are not. Predictors with effect sizes that are below
the threshold of identifiability are very likely to be lost amongst the large numbers of noisy,
unassociated predictors.
2. The gradient descent algorithm is a simple, general and powerful tool for finding values of a
parameter vector θ that minimise a function g(θ).
(a) Describe the gradient descent algorithm.
Tip: When asked to “describe” an algorithm, it is generally a good idea to both summarise
the algorithm in pseudo-code, as well as describe its motivation/behaviour/properties.
A: The gradient descent algorithm is a simple and general algorithm for finding the values
of a function, say g(θ), that minimise the function. It works by starting with some initial
guess θˆ, and iteratively updating the current best guess by moving in the negative direction
of the gradient of the function g(·) evaluated at θˆ, multipled by some step-size κ. The idea
is that if the step-size is small enough, then moving in the opposite direction of the gradient
will decrease the value of g(·) at each step. The gradient descent algorithm is very general
and widely applicable, and is appropriate for big p problems, and is often applied to models
like neural networks.
More formally, the algorithm is: given a κ > 0, > 0
i. Initialise θˆ with some starting guess.
ii. θ′ ← θˆ
iii. Update guess:
θˆ ← θˆ − κg
iv. Check for convergence; if ||θ′ − θˆ|| > , go to Step 2
where
gj =
∂g(θ)
∂θj
∣∣∣∣
θ=θˆ
13
are the gradients.
(b) What is one limitation of the gradient descent algorithm?
A: There are several reasonable answers here:
• The algorithm is only guaranteed to converge to a local minima (i.e., it may not find
the overall best value of θ that minimises g(θ).
• The algorithm is very sensitive to the choice of κ, and may not converge if κ is too large
• The algorithm does not work if n is too large to be loaded into memory, and needs to
be modified (i.e., stochastic gradient descent).
14
15
8 Appendix: Reference Sheet
|z| P(Z < −|z|) P(Z < |z|) |z| P(Z < −|z|) P(Z < |z|)
0.000 0.500000 0.500000 2.047 0.020353 0.979647
0.093 0.462943 0.537057 2.140 0.016196 0.983804
0.186 0.426204 0.573796 2.233 0.012789 0.987211
0.279 0.390096 0.609904 2.326 0.010020 0.989980
0.372 0.354912 0.645088 2.419 0.007790 0.992210
0.465 0.320924 0.679076 2.512 0.006009 0.993991
0.558 0.288375 0.711625 2.605 0.004598 0.995402
0.651 0.257471 0.742529 2.698 0.003491 0.996509
0.744 0.228382 0.771618 2.791 0.002630 0.997370
0.837 0.201237 0.798763 2.884 0.001965 0.998035
0.930 0.176125 0.823875 2.977 0.001457 0.998543
1.023 0.153093 0.846907 3.070 0.001071 0.998929
1.116 0.132151 0.867849 3.163 0.000781 0.999219
1.209 0.113273 0.886727 3.256 0.000565 0.999435
1.302 0.096403 0.903597 3.349 0.000406 0.999594
1.395 0.081455 0.918545 3.442 0.000289 0.999711
1.488 0.068326 0.931674 3.535 0.000204 0.999796
1.581 0.056894 0.943106 3.628 0.000143 0.999857
1.674 0.047024 0.952976 3.721 0.000099 0.999901
1.767 0.038577 0.961423 3.814 0.000068 0.999932
1.860 0.031410 0.968590 3.907 0.000047 0.999953
1.953 0.025381 0.974619 > 4.000 < 0.000032 > 0.999968
Table 2: Cumulative Distribution Function for the Standard Normal Distribution Z ∼ N(0, 1)16
Differentiation
d
dx
{a f(x)} = a d
dx
{f(x)}, d
dx
{
xk
}
= kxk−1
d
dx
{log x} = 1
x
, Chain rule: d
dx
{f(g(x))} = d
d g(x) {f(g(x))} ·
d
dx
{g(x)}
17

学霸联盟