LM2019 have a nice survey of recent results in robust mean estimation and regression. I often learn best by writing out results. Here I will document some of the key proofs in the survey.
Set-up
Say we observe n iid samples X1,…Xn with mean E[X1]=μ. The goal is to develop estimators μ^n which are close to μ with high probability in a non-asymptotic sense. That is we want to find the smallest possible value of ϵ=ϵ(n,δ) such that:
P(∣μ^n−μ∣>ϵ)≤δ
Sub-Gaussian RVs
Here is a summary of sub-Gaussian RVs. A random variable X with E[X]=μ is sub-Gaussian with parameter s if for all λ∈R:
E[eλ(X−μ)]≤es2λ2/2
Plugging this into a Chernoff bound, gives us the following concentration inequality for all t>0:
P(∣X−μ∣≥t)≤2e−t2/2s2
We know s=1 for X∼N(0,1). Any sub-Gaussian RV also obeys, ∀p≥2:
(E∣X−μ∣p)1/p≤s′p(E∣X−μ∣2)1/2(1)
for some absolute constants c1 and c2 such that c1s≤s′≤c2s. Note that this implies the existence of all higher order moments.
Empirical Mean + CLT
The most natural estimator is the empirical mean:
μˉn=n1i=1∑nXi
By the CLT, we have:
n→∞limP{∣μˉn−μ∣>nσΦ−1(1−δ/2)}≤δ
Taking t=Φ−1(1−δ/2) and plugging into the upper deviation inequality for standard normals, we get Φ−1(1−δ/2)≤2log(2/δ). It follows:
n→∞limP{∣μˉn−μ∣>nσ2log(2/δ)}≤δ
Non-Asymptotic Sub-Gaussian Estimators
We are interested in developing non-asymptotic bounds with rates similar to those given by CLT. We will call an estimator μ^n sub Gaussian with parameter L if with probability 1−δ:
∣μ^n−μ∣≤nLσlog(2/δ)(2)
Alternatively, setting the RHS to t, we can say:
P(∣μ^n−μ∣≥t)≤2exp(−L2σ2t2n)
Minimax Error Rate
Such an error rate is minimax even for a fixed confidence level δ.
Theorem: Let μ∈R, σ>0 and δ∈(2e−n/4,1/2). For any estimator μ^n, there exists a distribution with mean μ and variance σ2 such that:
P{∣μ^n−μ∣>nσlog(1/δ)}≥δ
Proof: Consider two distributions P+ and P− concentrated on two points:
P+({0})=P+({0})=1−pP+({c})=P+({−c})=p
where p∈[0,1] and c>0. We have μP+=pc, μP−=−pc and σP−2=σP+2=c2p(1−p). Now consider n independent (Xi,Yi) pairs such that:
P{Xi=Yi=0}=1−pP{Xi=c,Yi=−c}=p
Note that Xi∼P+ and Yi∼P−. Let δ∈(0,1/2). If δ≥2e−n/4 and p=(1/(2n))log(2/δ), then using 1−p≥exp(−p/(1−p)):
P{X1n=Y1n}=(1−p)n≥2δ
Let μ^n be any mean estimator, possibly depending on δ, then:
Hence for the choices of δ above, the best one can do for both P+ or P− will be a sub-Gaussian error rate.
Empirical Mean + Sub-Gaussian RVs
For any sub-Gaussian RV, Chernoff bounds will prove that u^n is sub-Gaussian for some L. However, as noted in Equation 1, sub-Gaussian RVs exhibit strong decay in tail probabilities. If you only assume finite variance, then one standard result is Chebyshev’s inequality. With probabliity 1−δ:
∣μˉn−μ∣≤σnδ1
Although this gets the correct O(n−1/2) rate, when compared to the Equation 2 the dependence on δ is exponentially worse. This matters in settings where we want to potentially estimate many means simultaneously and will have to apply a union bound.
Median of Means
Using only the assumption of finite variance, the first estimator which acheives sub-Gaussian rates is the median of means. The core idea is to split the samples into k approximately equal subsets, compute means and then taking the median of computed values. Formally, let 1≤k≤n and parition [n]={1,…,n} into k blocks B1,…Bk of size ∣Bi∣≥⌊n/k⌋≥2. For each i∈[k], compute:
Zi=∣Bi∣1i∈Bi∑Xi
and define the MOM estimator μ^=M(Z1,…,Zk) where M denotes the median operator. For each block, the mean is unbiased for the mean with standard deviation controlled by σ/n/k. Hence the median of the distribution of the blockwise medians lies within deviation σ/n/k from the mean. Finally, the empirical median concentrates around this median.
Theorem: Let X1,…Xn be n iid samples with E[Xi]=μ and Var(Xi)=σ2. Assume n=mk for positive integers m and k. Then:
P{∣μ^n−μ∣>σ4/m}≤e−k/8
For any δ∈(0,1), if k=⌈8log1/δ⌉, then with probability 1−δ:
∣μ^n−μ∣≤σn32log1/δ
Proof: By Chebyshev’s inequality, we have with probability at least 3/4:
∣Zj−μ∣≤σm4
Then, ∣μ^n−μ∣>σ4/m implies that at least k/2 of the means are such that ∣Zj−μ∣>σ4/m. Hence:
Here the last line follows from Hoeffding’s inequality. This theorem proves that the MOM estimator is sub-Gaussian with parameter L=32.
Note that difference confidence levels lead to different estimators because the k depends on δ. However, if you only assume finite variance, then there do not exist sub-Gaussian estimators that work independent of δ. An analogous result holds even for distributions with infinite variance as long as E[∣X−μ∣1+α]<∞ for some α∈(0,1].
The dependence of k on δ is disappointing. As long as X has a finite 2+α moment for some α>0, then we get sub-Gaussian performance for more values of k. Consider the case where α=1.
Theorem: Let X1,…,Xn be n iid samples with E[Xi]=μ and Var(Xi)=σ2 and third central moment ρ=E[∣X−μ∣3]. Let m,k be positive integers such that n=mk. Assume that:
2klog(2/δ)+2σ3mρ≤1/4(3)
Then, the MOM estimator μ^n with k blocks satisfies with probability 1−δ:
∣μ^n−μ∣≤c1(σ2nlog(2/δ)+2σ2nρk)
Here c=ϕ(Φ−1(3/4)) is a constant, where ϕ and Φ denote the standard normal density and distribution functions.
In this case the following choices for k get sub-Gaussian performance:
k≤ρ2σ3n
Note that this is independent of δ and so the bound holds simultaneously for all values of δ permitted by Equation 3. Here k can be a constant fraction of n, much larger than the previous case. Then, the result holds simultaneously for all δ≥e−c0n.
Here G∼N(0,1). By Hoeffding’s inequality, we have with probability 1−δ/2:
k1i=1∑k{I(Zi−μ≤a)−P{Zi−μ≤a}}≥−2klog(2/δ)
The Berry-Esseen is an extension of the CLT which gives a bound on maximal approximation error of the distribution of a sample average and the standard normal to which it converges. It implies:
So, the LHS is ≥1/2 with probability 1−δ/2 whenever we have a such that:
P{Gmσ≤a}≥21+2σ3mρ+2klog(2/δ)
Take 2σ3mρ+2klog(2/δ)≤1/4. Then, the inequality holds trivially for am/σ≥Φ−1(3/4). Note that Φ−1(t)≤t for all t≤3/4 and cΦ−1(3/4)≤1/4 where c=ϕ(Φ−1(3/4)). Hence:
A similar argument follows for the other case of the the which ensures μ^n∈[μ−a,μ+a]. The theorem follows.
Catoni’s Estimator
Catoni’s estimator is an M-estimator developed in C2021. The idea is that the empirical mean y=μˉn solves:
i=1∑n(Xi−y)=0
One can replace the left-hand side with a strictly decreasing function of y:
Rn,α(y)=i=1∑nψ(α(Xi−y))
where α∈R is a parameter and ψ:R→R in as anti-symmetric increasing function. If ψ(x) grows much slower than x than the effect of outliers is diminished. This idea is similar to the Huber loss function often used as a replacement for square loss in regression. One choice of ψ is as follows:
ψ(x)={log(1+x+x2/2) if x≥0,log(1−x+x2/2) if x<0
Catoni’s estimator μ^ is just defined as the value of y such that Rn,α(y)=0. Since ϕ(x)≤x for all x∈R, assuming that Var(X)=σ2, we have:
The last line follows from 1+x≤exp(x). Then, by Markov’s inequality:
P{Rn,α≥nα(μ−y)+2nα2(σ2+(μ−y)2)+log(1/δ)}≤δ
Define:
h(y)=nα(μ−y)+2nα2(σ2+(μ−y)2)+log(1/δ)
This is a quadratic polynomial and has at least one root when we have α such that ασ2+2log(1/δ)≤1. Let y+=μ+g(α,n,δ,σ2) be the smaller root and y−=μ−g(α,n,δ,σ2) the larger root. Taking y=y+, we have Rn,α(y+)<0 with probability 1−δ. Since Rn,α(y) is strictly decreasing, we have μ^<y+. Similarly, we get μ^>y−. Combining these two inequalities, implies that with probability 1−2δ, we have:
∣μ^n,α−μ∣≤g(α,n,δ,σ2)
Finally, we optimize g(α,n,δ,σ2) with respect to α.
Theorem: Let X1,…Xn be iid random samples with E[X]=μ and Var(X)=σ2. For δ∈(0,1) such that n≥2log(1/δ). Set:
α=nσ2(1+n−2log(1/δ)2log(1/δ))2log(1/δ)
Then, Catoni’s estimator μ^n,α satisfies with probability 1−2δ:
∣μ^n,α−μ∣≤n−2log(1/δ)2σ2log(1/δ)
Hence we get a sub-Gaussian estimator. In fact, the 2 is optimal. However, unlike MOM, it relies on knowledge of σ2 to chose α. We can replace σ2 with an upper bound σ2≤v. In case, no knowlege of σ2 is available, one can use Lepski’s method to adaptively select α from the data. Again, our estimator depends on δ. It can be made independent of δ by setting α=2/(nσ2), but the estimator becomes sub-exponential instead of sub-Gaussian.
Trimmed Mean
The trimmed mean is the most intuitive estimator which acheives sub-Gaussian rates. Consider a simple variant. Split your data into halves. Say you have 2n samples: X1,…,Xn and Y1,…Yn. Define the truncation function:
ϕ(x)=⎩⎪⎪⎨⎪⎪⎧βxα if x>β, if x∈[α,β], if x<α
For x1,…xm∈R, we denote its sorted order as x1∗≤x2∗≤⋯≤xm∗. Given a confidence δ≥8e−3n/16, set:
ϵ=3n16log(8/δ)
Let α=Yϵn∗ and β=Y(1−ϵ)n∗ and we have the estimator:
μ^2n=n1i=1∑nϕα,β(Xi)
Theorem: Let X1,…Xn,Y1,…Yn be iid copies with E[X]=μ and Var(X)=σ2. Let δ∈(0,1) be such that n>(16/3)log(8/δ). Then, with probability 1−δ:
∣μ^2n−μ∣≤9σnlog(8/δ)
*Proof: The proof starts by showing the truncation levels are close to the true quantiles. For p∈(0,1), define the quantiles:
Qp=sup{M∈R∣P{X≥M}≥1−p}
Assume that X has a non-atomic distribution. Then, P{X>Qp}=P{X≥Qp}=1−p. Applying Bernstein’s inequality, we get with probability at least 1−2exp(−(3/16)ϵn), we have:
∣{i∈[n]:Yi≥Q1−2ϵ}∣≥ϵn and ∣{i∈[n]:Yi≤Q1−ϵ/2}∣≥(1−ϵ)n
So, with probability 1−2exp(−(3/16)ϵn), we have:
Q1−2ϵ≤Y(1−ϵ)n∗≤Q1−ϵ/2(4)
A symmetric argument yields:
Qϵ/2≤Yϵn∗≤Q2ϵ(5)
Next, we show that ∣Eϕα,β(X)−μ∣ is small and the estimator concentrates around its mean. Consider the event, E, that both Equations 4 and 5 holds. We have P(E)≥1−4exp(−(3/16)ϵn). Conditional on E:
Hence, conditional on E (that only depends on Y1,…Yn), Z is the average of centered random variables that are bounded pointwise by:
M=max{∣Qϵ/2−μ∣,∣Q1−ϵ/2−μ∣}≤σ2/ϵ
Hence, with probability at least 1−δ/2, Bernstein’s inequality implies:
Z≤σn2log(2/δ)+nlog(2/δ)σ2/ϵ≤3σnlog(2/δ)(7)
Putting together Equation 6 and Equation 7, the result follows.
A unique property of the trimmed mean is that it is also robust to adversarial contamination.
Impossibility delta-independent estimators
The DLLL2020 prove that δ-independent estimators are impossible without assumptions beyond finite variance. A simple result is as follows.
Theorem: For all L≥0 and every sample size n, no estimator can be simultaneously L-sub Gaussian for both δ1=1/(2eL3+1) and δ2=2e−L4/4 for all distributions with finite second moments.
Proof: The proof follows by considering the restricted class of Poissons. Assume, by way of contradiction, that there exists an estimator μ^ that is L-sub Gaussian for both δ1 and δ2 for all Poissons. Let X1,…,Xn be iid Poissons with parameter 1/n and let Y1,…,Yn be iid Poissons with parameter c/n where we set c=L3+1. For simplicitly, assume c is an integer. By sub-Gaussianity:
Here the last line follows from the fact that ∑iYi is Poisson with parameter c and then applying Stirling’s formula. Note that the conditional joint distribution of n independent Poiss(λ) random variables conditioned on the event that their sum equals c only depends on c, not λ. So:
So, we have 1/2≤ec!δ2=2ec!e−L4/4. But explicitly computing the RHS, shows that it is less than 1/2 for L≥50. Hence we have a contradiction.
A caveat to the above analysis is that Lepski’s method can be used if one knows more than finite variance. Specifically, if one has access to non-trivial upper and lower bounds on the variance. Also, as shown in a previous section, existence of higher moments also suffices for such δ-independent estimators.