python代写-A 1|学霸联盟

python代写-A 1

时间：2021-08-27

Project A 1

Project A
So Far:
Estimation and Detection: based on given model, combine knowledge from different
sources
Learning models from data: Parameter estimation, ML and divergence minimization
Choosing of criterions and parameterization
Neural network as one way to do learning and inference together
Specific non-linear unit
Limited number of form of feature functions
Choice of loss metric for training and for testing
Project A 2
Last Layer for Classification
Dense(inputdim = k, outputdim = Cy, activation = softmax)
Notation: , with
Softmax output (Picture)
Denominator is a normalization: .
Looks like an exponential family:
Ignore normalization, for fixed ,
For fixed ,
Not only looking for features of , but also features of , except features are
represented differently.
Recall exponential family are particularly easy for ML estimates
Cross Entropy Loss
Find the ML fitting of the discriminative part of the model
C =y ∣Y ∣ = M Y = {1, 2,…,M}.
Output(i) P (i∣x)=want y∣x
= Q (i∣x) =y∣x
e∑i′
W ?f (x)+b∑j=1
k
i j′ j i′
e W ?f (x)+b∑j=1
k
ij j i
i ∈′ Y
y = i Q (?∣i) ∝x∣y Q (?) ?x
exp W ? f (x)(∑j=1
k
ij j )
x Q (?∣x) ∝y∣x exp f ? W (?)(∑j=1
k
j j )
x y
=
arg (x) ? D( (?∣x)∣∣Q (?∣x))
W ,f
min
x
∑ P^x P^y∣x y∣x
(W ,f )
arg (x) (y∣x) ? log Q (y∣x)
W ,f
max
x
∑ P^x
y
∑ P^y∣x ( (W ,f ) )
Py∣x
Project A 3
Our Task
def generateSamples(nSamples, nDim, nClasses, nClusters):
'''
Generate samples as Normal from (nClasses * nClusters) clusters
assume each cluster has the same variance for now

return X : np.array of size [nSamples, nDim]

each sample generated as a nDim Normal, belong to a Cluster

Y: np.array of size [nSamples]
only gives the class label for each sample.

'''

Y = np.random.choice(nClasses, [nSamples])
Z = np.random.choice(nClusters, [nSamples]) + Y * nClusters

cluster_centers = np.random.normal(0, 10, [nClusters * nClasses, nDim])
centers = cluster_centers[Z,:]

X = np.random.normal(0, 1, [nSamples, nDim]) + centers
Project A 4

return X, Y
Technical part: use matrix operations.
Mixture Gaussian: each class has multiple clusters.
Change number of classes, clusters to change complexity
More clusters per class, more non-linear.
We might want to change the shape of clusters ( in example), can include
any known shape.
All the parameters can be known or partially known, or not known at all, changing
the solutions.
Solve This Numerically
Keep in mind that NNs are just one type of learning solutions, but the general
solution should have a structure for parameterization, criteria for training and
testing, and numerical procedures. Every element should be tailored to the specific
problem.
NNs are particularly designed for
very large number of parameters
relatively abundant supply of samples
with pretty much no structural knowledge
don't mind rather inefficient computations
NNs are made for Gradient Descent
N(0, 1)
Project A 5
Convex Optimization - Boyd and Vandenberghe
A MOOC on convex optimization, CVX101, was run from 1/21/14 to
3/14/14. If you register for it, you can access all the course
materials. More material can be found at the web sites for EE364A
http://web.stanford.edu/~boyd/cvxbook/
Stochastic Gradient Descent and Back Propagation
Example:
Want to solve
If we see the formula,
If not:
1. randomly pick a ,
2. Evaluate

Need to evaluate cheaply;
In neural networks: is the loss function in training (cross entropy loss
for fixed input and label , forward calculation to evaluate.
How to choose step size
f(θ) = a ? (θ ? θ ) +0 2 b
arg min f(θ)θ
θ =? θ0
θ
f(θ),f(θ + Δ)
f(θ + Δ) < f(θ)
f(θ + Δ) > f(θ)
f(θ) = f(θ)
: θ ← θ + Δ
: θ ← θ ? Δ
: θ = θ; ?stop??
f(?)
f(?)
L(x,y, θ = w) x y
Δ?
Project A 6
Too small: converge slow with many iterations
Too large: zigzag around the optimum
Solution: use gradient, bigger steps when far away from the optimum
How do we choose ? Adaptive learning rate.
What if not quadratic?
Example: Multiple Parameters
To minimize , can separately optimize individual parameters;
Plot this with matplotlib, contour, surface
Not always separable
Make them separable (Newton's method)
or not.
Idea: fix one parameter, try to update the other
Fix , solve
Update:
Then reverse direction, fix , update
Δ ∝ ? =dθ
df (θ) δ ? f(θ)dθ
d
δ
f(θ , θ ) =1 2 a ?1 (θ ?1 c ) +1 2 a ?2 (θ ?2 c )2 2
f(θ , θ )1 2
θ =2 θˇ2
min f(θ , )θ1 1 θˇ2
θ ←1 θ ?1 δ ? f(θ , )?θ1
?
1 θˇ2
θˇ1 θ ←2 θ ?2 δ ? f( , θ )?θ2
? θˇ1 2
Project A 7
Gradient
Algorithm
1. Randomly pick
2. Update
Discussions: Is This a Good Idea?
Does stop at the optimal.
Learning rate needs to be chosen carefully.
Hope everything to be quadratic, cannot avoid local optimum.
Key advantage: generality
Doesn't have to be precise gradient
Doesn't have to use the perfect step size
Operations in Neural Networks
? f =θ [
f(θ , θ )?θ1
?
1 2
f(θ , θ )?θ2
?
1 2
]
=θ [θ , θ ]1 2 T
←θ ?θ δ ? ? fθ
Project A 8
Classification .
Last layer input: ,
Last layer weights: ,
Softmax activation:
Cross-Entropy Loss, ML for discriminative model.

model = Sequential()
model.add(...)
model.add(Dense(yCard, activation='softmax', input_dim=k))
sgd = SGD(4, decay=1e-2, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy', optimizer=sgd)
y ∈ {1,…, ∣Y ∣}
f (x),…,f (x)1 k
g (y), i =i 1,…,k;y ∈ Y
(y∣x) = , y ∈ YP^y∣x
(f ,g)
exp f (x) ? g (y ) + b(y )∑y ′ [∑i=1
k
i i ′ ′ ]
exp f (x) ? g (y) + b(y)[∑i=1
k
i i ]
arg D( ? ∣∣ ? )
f ,g
min Pˇx Pˇy∣x Pˇx P^y∣x
(f ,g)
Project A 9
Back Propagation SGD in Neural Networks
Objective function,

Need to compute gradient

Back Propagation (need an animation)
Expectation over or mini-batches. Stochastic Gradient Descent
Beyond Neural Networks
Need to understand what is good about neural networks
Requires very little knowledge/structure
Applicable to very broad range of problems
Repetition of simple units
Scalable
Best effort based
min [? log Q (y∣x; )]θ E^ y∣x θ
? [? log Q (y∣x; θ)]θ E^ y∣x
z = g(y), y = f(x) ? =dx
dz ?dy
dz =dx
dy g (f(x)) ?′ f (x)′
P , ,xy P^xy
Project A 10
Behind the calculations
Feature functions in learning and inference
Distribution matching
Numerical optimization
Other problems
Where we know some structures
Where we can afford specialized computation modules
Where we need guarantees in performance and in resources
Where we need insights
Your project:
Known form of feature functions
Linear combination
Softmax
f (x; θ ) =i i b(x; θ ), θ =i i center
F(x; ( , )) =w θ w ?∑i i f (x; θ )i i
Q (y∣x) =y∣x e +eF(x) ?F(x)
e±F(x)
Project A 11
Difference: given a current estimate of , how do we update each basis
function and the corresponding weights?
Variations
clusters with different radius
two groups each with a known radius, and known total number of clusters
each cluster chooses radius i.i.d. with certain probability
clusters with different shapes
regularity in clusters
Each one would result in a different way of updating the feature functions.
Some Specific Components
Base functions:
radius based
can change later (square, grid...)
Measure how well a sample fit for a base function
Optimize this
F(?; ( , ))w θ
log F(x;w, θ) = log w b(x; θ)∑i i
Project A 12
Gradient descend?
Isolate individual parameter to optimize
Other optimization techniques (search, Newton...)
Overall: approximate!