证明代写-COMS 4774|学霸联盟

证明代写-COMS 4774

时间：2021-04-06

COMS 4774 Spring 2021 Scribe: Sebastian Salazar (ss5971) February 23, 2021 Editors: Qingyang Wu (qw2345) Michael Fagan (mef2224) Jay Pakala (jp4102) High-dimensional statistics - bounds on random vectors 1 Introduction So far in the course, tools for dealing with random variables over real numbers have been studied extensively. However, in many real-world applications, data points are seldom one-dimensional. Therefore, it is essential to develop tools for dealing with high-dimensional random variables (i.e., random vectors). 2 Random Vectors Let {X1, X2, . . . , Xn} ⊆ Rd be an i.i.d. sample of random vectors from some dis- tribution with mean E[X1] = ~µ ∈ Rd. If we want to estimate the value of ~µ using {X1, X2, . . . , Xn}, it is natural to introduce the unbiased estimator for the mean µˆ = 1 n ∑n i=1Xi. Once we have calculated µˆ, it is natural to ask how “close” ~µ and µˆ are. In one dimension, the notion of closeness is well-defined (i.e. just look at the quantity |µ− µˆ|), however, we are dealing with random vectors now! Therefore, thinking about what it means for two vectors to be “close” to each other is not a straightforward endeavor. Do we consider each of the components individually? Or perhaps we could think of our sample as living in some sub-metric space (X , dx) of Rd. These are both valid schemes for thinking about the “closeness” of vectors in Rd. In this section of the notes, however, we will only provide tools for dealing with mean estimation in Rd under the standard Euclidean Norm. In other words, we’ll examine the statistical properties of ‖~µ− µˆ‖22. 2.1 Mean of a Gaussian Random Vector First, consider the case when {X1, X2, . . . , Xn} ⊆ Rd are i.i.d. spherical Gaussians. In this case, we know the distribution of the sums of these random vectors. In particular 1 note that ∑ i Xi ∼ N (n~µ, nσ2Id) 1 n ∑ i Xi ∼ N (~µ, σ 2 n Id) µˆ ∼ N (~µ, σ 2 n Id) µˆ− ~µ ∼ N (0, σ 2 n Id) √ n σ (µˆ− ~µ) ∼ N (0, Id). This implies that ∥∥∥√nσ (µˆ− ~µ)∥∥∥2 2 ∼ χ2(d). This is great, since we know that the Chernoff bounds of the χ2(d) distribution are given by y ∼ χ2(d) : P ( y ≥ d+ 2 √ 2 log 1/δ − 2 log 1/δ ) ≤ δ (1) y ∼ χ2(d) : P ( y ≤ d+ 2 √ 2 log 1/δ ) ≤ δ. (2) Now, by definition, each of the components of √ n σ2 (µˆ− ~µ) is a spherical Gaussian, so it is clear that ∥∥∥√nσ2 (µˆ− ~µ)∥∥∥2 2 ∼ χ2(d). Using these facts and taking the union bound of (1) and (2) we obtain1 P ∥∥∥∥∥ √ n σ (µˆ− ~µ) ∥∥∥∥∥ 2 2 ∈ √ d± 2 (√ d log 1/δ + log 1/δ ) ≥ 1− 2δ P ( ‖µˆ− ~µ‖22 ∈ σ2 n √ d± 2 (√ d log 1/δ + log 1/δ )) ≥ 1− 2δ That is, ‖µˆ− ~µ‖22 ≈ σ2 n (d±O( √ d)) ‖µˆ− ~µ‖2 ≈ σ√ n √ d±O( √ d) (3) with high probability. For a general, non-spherical Gaussian these bounds apply with the slight modification σ → Tr(Σ). 1This is a definite abuse of notation. Here, x ∈ a± b is understood to mean x ∈ (a− b, a+ b). 2 2.2 sub-Gaussian random variables Note : WLOG, all random vectors in this and subsequent sections of Section 2 are understood to have zero mean.2 Note that, to characterize the behavior of a random vector, we could always look at the its behavior in a particular direction. In other words, given some direction u ∈ Sd−1, how does the random variable uᵀX behave? To answer these questions we need to revisit and generalize the notion of subGaussian random variables and introduce subGaussian random vectors. As a starting point, consider the following definitions from [1]: Definition 1. A random variable X in R is said to be sub-Gaussian with variance proxy σ2 if E(X) = 0 and its moment generating function —denoted MX(t)— satisfies MX(s) ≤ exp σ 2t2 2 , ∀t ∈ R. (4) This is often written as X ∼ subG(σ2). Definition 1 can easily be generalized to the case where X is a random vector in R. Definition 2. A random vector X ∈ Rd is said to be sub-Gaussian with variance proxy σ2 if E(X) = 0 and for any u ∈ Sd−1, the random variable uᵀX is sub-Gaussian with variance proxy σ2. This is often written as X ∼ subGd(σ2). From an intuitive standpoint, sub-Gaussian random vectors behave very similarly to Gaussian random variables in the sense that they have very similar concentration properties. To begin formalizing this intuition, consider the following Lemma from [1]: Lemma 1. If X ∼ subG(σ2) then for any t > 0, it follows that: P(X > t) ≤ exp ( − t 2 2σ2 ) P(X < −t) ≤ exp ( − t 2 2σ2 ) This indeed confirms our intuition that, at least from a concentration standpoint, subGaussian random variables behave just like Gaussians. In particular, setting the RHS of the inequalities in Lemma 1 equal to some δ ∈ [0, 1], we immediately get the following useful corollary: 2Note that there is no loss of generality here, since we can always subtract off the mean of a random vector. That is, take X → X − µ 3 Corollary 1. If X ∼ subG(σ2), then with probability at least 1 − 2δ, the following event holds: X ∈ [ −σ √ 2 log 1/δ, σ √ 2 log 1/δ ] The proof of this is left as an exercise to the reader. For the more general case when we are considering a sum of subGaussian indepen- dent random variables, 1 can be generalized in the following way: Lemma 2. Let X1, . . . , Xn be a set of independent random variables such that Xi ∼ subG(σ2), then for any t > 0 and any vector a = (a1, . . . , an) ∈ Rd, P  n∑ i=1 aiXi < t  ≤ exp(− t2 2σ2‖a‖22 ) P  n∑ i=1 aiXi < −t  ≤ exp(− t2 2σ2‖a‖22 ) For a proof of this theorem, the reader is refered to [1]. In a similar fashion, the generalization of corollary 1 is given below: Corollary 2. With probability at least 1− 2δ the following event holds: n∑ i=1 aiXi ∈ [ −σ‖a‖2 √ 2 log 1/δ, σ‖a‖2 √ 2 log 1/δ ] 2.3 σ2 sub-Gaussian random vectors Let u ∈ Sd−1 and X be a subGaussian random vector with variance proxy σ2. Then it follows that uᵀX is subGaussian with variance proxy σ2. In particular let us consider a collection of subGaussian random vectors {Xi}ni=1 with variance proxy σ2. Then it follows that {uᵀXi}ni=1 is a collection of subGaussian random vectors with variance proxy σ2. Thus, from corollary 2 is follows that with probability at least 1− 2δ the following event holds (i.e. just take a = 1 n (1, . . . , 1)ᵀ): 1 n n∑ i=1 uᵀXi ∈ [ − σ√ n √ 2 log 1/δ, σ√ n √ 2 log 1/δ ] and this inequality holds for any fixed u ∈ Sd−1. However, we want to impose a bound on sup u∈Sd−1 ∣∣∣∣∣∣ 1n n∑ i=1 uᵀXi ∣∣∣∣∣∣ = supu∈Sd−1|uᵀµˆ| where µˆ = 1n n∑ i=1 Xi. (5) To impose a bound on Equation(5) the following lemma will be particularly useful 4 Lemma 3. Let {Xi}ni=0 be a collection of n i.i.d. random vectors in Rd such that for every i ∈ {1, . . . , n}, Xi ∼ subGd(σ2). Then it follows the empirical mean µˆ = 1 n n∑ i=1 Xi is subGaussian with variance proxy σ 2 n . Proof. Let s ∈ Sd−1 and note that the MGF of µˆ = 1 n ∑n i=1Xi is given by Msᵀµˆ(t) = E ( e 1 n ∑n i=1 s ᵀXit ) = E ( e ∑n i=1 s ᵀXi tn ) = n∏ i=1 E ( e t n sᵀXi ) = ( MsᵀXi ( t n ))n (6) Since Xi ∼ subGd(σ2) it follows that sᵀXi ∼ subG(σ2) so we can apply (4) to (6) and obtain ( MsᵀXi ( t n ))n ≤ exp ( σ2 t2 2n2 )n = exp ( σ2 t2 2n ) = exp ( σ2 n t2 2 ) which completes the proof. Now we’ll show that a bound on (5) can be obtained using Lemma 3, and is stated in the following theorem [1]: Theorem 1. Let X ∼ subGd(σ2) be a subGaussian random vector with variance proxy σ2 and B2 be the unit `2 ball in Rd. Then for any δ ∈ (0, 1), the following event holds with probability at least 1− δ: sup u∈Sd−1 uᵀX ≤ sup u∈B2 uᵀX = sup u∈B2 |uᵀX| ≤ 4σ √ d+ 2σ √ 2 log 1/δ To prove this theorem we need to use the following lemmas, which we state without proof (see [1]): 5 Lemma 4. The unit ball B2 in Rd has an -net N of cardinality |N | ≤ ( 3 )d. Lemma 5. Let N := {r1, ..., rn} be a set of points in Rn and X ∈ Rd be a random vector such that rᵀiX are subGaussian random variables with variance proxy σ2. Then the following two inequalities hold:3 P(sup r∈N rᵀX > t) ≤|N | e− t 2 2σ2 P(sup r∈N |rᵀX| > t) ≤ 2|N | e− t 2 2σ2 Now we present a proof of theorem 1 from [1]. Proof. Let N be a 1/2-net of B2. Then from lemma 4 we know we can choose N such that |N | ≤ 6d. Now, since N is a 1/2-net of B2, we know that for any u ∈ B2, there exists an r ∈ N such that ‖u− r‖ ≤ 1/2. Now, define x = u− r and note that since ‖x‖ ≤ 1/2, x ∈ 1 2 B2. Then, sup u∈B2 uᵀX ≤ sup z∈N ,x∈ 1 2 B2 (rᵀX + xᵀX) ≤ sup z∈N rᵀX + sup x∈ 1 2 B2 xᵀX = sup z∈N rᵀX + 1 2 sup u∈B2 uᵀX (7) Where the last equality follows since sup x∈ 1 2 B2 xᵀX = sup x∈ 1 2 B2 1 2 (2x)ᵀX, taking v = 2x ∈ B2 = 1 2 sup v∈B2 vᵀX, interchanging the dummy variables u ⇐⇒ v = 1 2 sup u∈B2 uᵀX Now, from (7), it follows that sup u∈B2 uᵀX ≤ 2 sup r∈N rᵀX Therefore, for any t > 0, P(sup u∈B2 uᵀX > t) ≤ P(2 sup r∈N rᵀX > t) ≤|N | e− t 2 8σ2 ≤ 6de− t 2 8σ2 (8) 3This is a special case of theorem 1.16 from [1] 6 Where the second to last line follows from lemma 1. Setting the RHS of (8) to be at least δ, we obtain the following minimal value of t: 6de− t2 8σ2 ≤ δ d log 6− log δ ≤ t 2 8σ2 t2 ≥ 8σ2 log(6)d+ 8σ2 log 1/δ t ≥ 2σ √ 2 log(6)d+ 2 log 1/δ Now note that 4σ √ d+ 2σ √ 2 log 1/δ ≥ 2σ( √ 2 log(6)d+ √ 2 log 1/δ) ≥ 2σ √ 2 log(6)d+ 2 log 1/δ Therefore, it is sufficient to take t = 4σ √ d+ 2σ √ 2 log 1/δ. Thus, P ( sup u∈Sd−1 uᵀX ≤ sup u∈B2 uᵀX ≤ 4σ √ d+ 2σ √ 2 log 1/δ ) ≥ 1− δ. (9) Now a straightfoward application of Lemma 3 and theorem 1, reveals that: P ( sup u∈Sd−1 uᵀµˆ ≤ sup u∈B2 uᵀµˆ ≤ 4 σ√ n √ d+ 2 σ√ n √ 2 log 1/δ ) ≥ 1− δ (10) Which is consistent with the results from lecture. 2.4 Non-homogeneous random vectors Now, let {Xi}ni=1 ⊆ Rd be a set of sub Gaussian random vectors with variance proxy σ2 having the following distribution P(Xi = a) = P(Xi = −a) = 1 2 . Now, if we apply (9) to µˆ = 1 n ∑n i=1Xi ∼ subG(σ2/n), we will obtain the following approximate bound sup u∈B2 uᵀµˆ ≈ O ( σ √ d n ) (11) However, in the direction of aᵀX ∼ O(1), so the bound (11) is loose by a factor of√ d! To get around this, we introduce the following lemma by Yurinsky: 7 Lemma 6. Suppose ‖X‖ ≤ b almost surely and that ‖X − ~µ‖2 ≤ σ2, then for any δ ∈ (0, 1) the following event holds with probability at least 1− δ: ‖µˆ− ~µ‖2 ≤ E‖µˆ− ~µ‖2 + σ √ 8 log 1/δ n + 2b log 1/δ 3n ≤ σ√ n + σ √ 8 log 1/δ n + 2b log 1/δ 3n Applying the above lemma to the example given, we obtain that with high proba- bility ‖µˆ− ~µ‖2 ≈ O ( ‖a‖2√ n ) , which is a much better than what we had previously. 8 3 A brief introduction to covariance estimation Let {Xi}ni=1 ⊆ Rd be a set of random variables with E(X) = µ such that Xi − ~µ is subGaussian. Now, to begin our study of random matrices, we compare the empirical covariance matrix Σˆ = 1 n n∑ i=1 (Xi − µˆ)(Xi − µˆ)ᵀ = 1 n n∑ i=1 (Xi − ~µ)(X − ~µ)ᵀ − (µˆ− ~µ)(µˆ− ~µ)ᵀ with the actual covariance Σ under the Forbenius norm. In particular, note that: ∥∥∥Σˆ− Σ∥∥∥ F ≤ ∥∥∥∥∥∥ 1n n∑ i=1 (Xi − ~µ)(X − ~µ)ᵀ ∥∥∥∥∥∥ F + ∥∥(µˆ− ~µ)(µˆ− ~µ)ᵀ∥∥ F (12) What is important to note about this is that bothXi−µ and µˆ−µ are both subGaussian random variables. Therefore, all the tools from Section 2 carry over by treating (12) as a vector in Rd2 . Moreover, under the Forbenius norm ∥∥(µˆ− ~µ)(µˆ− ~µ)ᵀ∥∥ F reduces to the familiar ‖µˆ− µ‖, which we already know how to deal with. Finally, we conclude our introduction to random matrices with a proof of the following theorem: Theorem 2. Let Xi be a symmetric, positive semi-definite random matrix, and Sn = ∑n i=1Xi. 4 Define λ1(Sn) to be the largest eigenvalue of the the matrix Sn, then the following inequality holds: P(λ1(Sn) ≥ a) ≤ E(Tr(Sn)) a Proof. Note that a · 1(λ1(Sn)) ≤ λ1(Sn) (13) ≤ n∑ i=1 λi (14) = Tr(Sn) (15) From above we conclude that: 1(λ1(Sn) ≥ a) ≤ Tr(Sn) a 4Recall that the sum of PSD matrices is PSD. 9 Taking the expectation value of both sides yields the desired result E(1(λ1(Sn)) ≥ a) = P(λ1(Sn) ≥ a) ≤ E(Tr(Sn) a Bibliography [1] Philippe Rigollet. High dimensional statistics. 10

学霸联盟