LM2019 - Robust Mean Estimation

LM2019 have a nice survey of recent results in robust mean estimation and regression. I often learn best by writing out results. Here I will document some of the key proofs in the survey.

Set-up

Say we observe nn iid samples X1,XnX_1, \dots X_n with mean E[X1]=μ\mathbb{E}[X_1] = \mu. The goal is to develop estimators μ^n\hat{\mu}_{n} which are close to μ\mu with high probability in a non-asymptotic sense. That is we want to find the smallest possible value of ϵ=ϵ(n,δ)\epsilon = \epsilon(n, \delta) such that:

P(μ^nμ>ϵ)δ \mathbb{P}(\lvert \hat{\mu}_{n} - \mu \rvert > \epsilon) \leq \delta

Sub-Gaussian RVs

Here is a summary of sub-Gaussian RVs. A random variable XX with E[X]=μ\mathbb{E}[X] = \mu is sub-Gaussian with parameter ss if for all λR\lambda \in \mathbb{R}:

E[eλ(Xμ)]es2λ2/2 \mathbb{E}\left[e^{\lambda(X - \mu)}\right] \leq e^{s^{2}\lambda^{2}/2}

Plugging this into a Chernoff bound, gives us the following concentration inequality for all t>0t > 0:

P(Xμt)2et2/2s2 \mathbb{P}\left(\lvert X - \mu \rvert \geq t \right) \leq 2e^{-t^{2}/2s^2}

We know s=1s = 1 for XN(0,1)X \sim N(0, 1). Any sub-Gaussian RV also obeys, p2\forall p \geq 2:

(EXμp)1/psp(EXμ2)1/2(1) \left(\mathbb{E} \lvert X - \mu \rvert^{p} \right)^{1/p} \leq s' \sqrt{p} \left( \mathbb{E} \lvert X - \mu \rvert ^{2} \right)^{1/2} \tag{1}

for some absolute constants c1c_1 and c2c_2 such that c1ssc2sc_1s \leq s' \leq c_2s. Note that this implies the existence of all higher order moments.

Empirical Mean + CLT

The most natural estimator is the empirical mean:

μˉn=1ni=1nXi\bar{\mu}_n = \frac{1}{n} \sum_{i = 1}^n X_i

By the CLT, we have:

limnP{μˉnμ>σΦ1(1δ/2)n}δ \lim_{n \rightarrow \infty} \mathbb{P}\left\{\lvert \bar{\mu}_n - \mu \rvert > \frac{\sigma \Phi^{-1}(1 - \delta/2)}{\sqrt{n}} \right\} \leq \delta

Taking t=Φ1(1δ/2)t = \Phi^{-1}(1 - \delta/2) and plugging into the upper deviation inequality for standard normals, we get Φ1(1δ/2)2log(2/δ)\Phi^{-1}(1 - \delta/2) \leq \sqrt{2 \log(2/\delta)}. It follows:

limnP{μˉnμ>σ2log(2/δ)n}δ \lim_{n \rightarrow \infty} \mathbb{P}\left\{\lvert \bar{\mu}_n - \mu \rvert > \frac{\sigma \sqrt{2 \log(2/\delta)}}{\sqrt{n}} \right\} \leq \delta

Non-Asymptotic Sub-Gaussian Estimators

We are interested in developing non-asymptotic bounds with rates similar to those given by CLT. We will call an estimator μ^n\hat{\mu}_n sub Gaussian with parameter LL if with probability 1δ1 - \delta:

μ^nμLσlog(2/δ)n(2) \lvert \hat{\mu}_n - \mu \rvert \leq \frac{L\sigma \sqrt{\log(2/\delta)}}{\sqrt{n}} \tag{2}

Alternatively, setting the RHS to tt, we can say:

P(μ^nμt)2exp(t2nL2σ2) \mathbb{P}\left(\lvert \hat{\mu}_n - \mu \rvert \geq t\right) \leq 2 \exp\left(- \frac{t^2 n}{L^2 \sigma^2}\right)

Minimax Error Rate

Such an error rate is minimax even for a fixed confidence level δ\delta.

Theorem: Let μR\mu \in \mathbb{R}, σ>0\sigma > 0 and δ(2en/4,1/2)\delta \in (2e^{-n/4}, 1/2). For any estimator μ^n\hat{\mu}_n, there exists a distribution with mean μ\mu and variance σ2\sigma^2 such that:

P{μ^nμ>σlog(1/δ)n}δ \mathbb{P}\left\{ \lvert\hat{\mu}_n - \mu \rvert > \frac{\sigma \sqrt{\log(1/\delta)}}{\sqrt{n}}\right\} \geq \delta

Proof: Consider two distributions P+P_{+} and PP_{-} concentrated on two points:

P+({0})=P+({0})=1pP+({c})=P+({c})=p P_{+}(\{0\}) = P_{+}(\{0\}) = 1 - p \quad \quad \quad P_{+}(\{c\}) = P_{+}(\{-c\}) = p

where p[0,1]p \in [0, 1] and c>0c > 0. We have μP+=pc\mu_{P_{+}} = pc, μP=pc\mu_{P_{-}} = -pc and σP2=σP+2=c2p(1p)\sigma_{P_{-}}^{2} = \sigma_{P_{+}}^{2} = c^{2}p(1 - p). Now consider nn independent (Xi,Yi)(X_i, Y_i) pairs such that:

P{Xi=Yi=0}=1pP{Xi=c,Yi=c}=p P\{X_i = Y_i = 0\} = 1 - p \quad \quad \quad P\left\{X_i = c, Y_i = -c\right\} = p

Note that XiP+X_i \sim P_{+} and YiPY_i \sim P_{-}. Let δ(0,1/2)\delta \in (0, 1/2). If δ2en/4\delta \geq 2e^{-n/4} and p=(1/(2n))log(2/δ)p = (1/(2n))\log(2/\delta), then using 1pexp(p/(1p))1 - p \geq \exp(-p/(1- p)):

P{X1n=Y1n}=(1p)n2δ P\{X_{1}^n = Y_{1}^n\} = (1 - p)^n \geq 2\delta

Let μ^n\hat{\mu}_n be any mean estimator, possibly depending on δ\delta, then:

max(P{μ^(X1n)μP+>cp},P{μ^(Y1n)μP>cp})12P{μ^(X1n)μP+>cporμ^(Y1n)μP>cp}12P{μ^n(X1n)=μ^n(Y1n)}12P{X1,,Xn=Y1,Yn}δ \begin{aligned} \max&\left(\mathbb{P}\left\{\lvert \hat{\mu}(X_1^n) - \mu_{P_{+}} \rvert > cp\right\}, \mathbb{P}\left\{\lvert \hat{\mu}(Y_1^n) - \mu_{P_{-}} \rvert > cp\right\}\right)\\ & \geq \frac{1}{2} \mathbb{P}\left\{ \lvert \hat{\mu}(X_1^n) - \mu_{P_{+}} \rvert > cp \quad \text{or} \quad \lvert \hat{\mu}(Y_1^n) - \mu_{P_{-}} \rvert > cp\right\}\\ & \geq \frac{1}{2} \mathbb{P}\left\{ \hat{\mu}_n(X_1^n) = \hat{\mu}_n(Y_1^n) \right\}\\ & \geq \frac{1}{2} \mathbb{P}\{X_1, \dots, X_n = Y_1, \dots Y_n\}\\ &\geq \delta \end{aligned}

From σ2=c2p(1p)\sigma^2 = c^2 p(1 - p) and p1/2p \geq 1/2, we have that cpσp/2cp \geq \sigma \sqrt{p/2}, so:

max(P{μ^(X1n)μP+>σlog2δn},P{μ^(Y1n)μP>σlog2δn})δ \max\left(\mathbb{P}\left\{\lvert \hat{\mu}(X_1^n) - \mu_{P_{+}} \rvert > \sigma \sqrt{\frac{\log \frac{2}{\delta}}{n}}\right\}, \mathbb{P}\left\{\lvert \hat{\mu}(Y_1^n) - \mu_{P_{-}} \rvert > \sigma \sqrt{\frac{\log \frac{2}{\delta}}{n}}\right\}\right) \geq \delta

Hence for the choices of δ\delta above, the best one can do for both P+P_{+} or PP_{-} will be a sub-Gaussian error rate.

Empirical Mean + Sub-Gaussian RVs

For any sub-Gaussian RV, Chernoff bounds will prove that u^n\hat{u}_n is sub-Gaussian for some LL. However, as noted in Equation 11, sub-Gaussian RVs exhibit strong decay in tail probabilities. If you only assume finite variance, then one standard result is Chebyshev’s inequality. With probabliity 1δ 1- \delta:

μˉnμσ1nδ \lvert \bar{\mu}_n - \mu \rvert \leq \sigma \sqrt{\frac{1}{n \delta}}

Although this gets the correct O(n1/2)O(n^{-1/2}) rate, when compared to the Equation 22 the dependence on δ\delta is exponentially worse. This matters in settings where we want to potentially estimate many means simultaneously and will have to apply a union bound.

Median of Means

Using only the assumption of finite variance, the first estimator which acheives sub-Gaussian rates is the median of means. The core idea is to split the samples into kk approximately equal subsets, compute means and then taking the median of computed values. Formally, let 1kn1 \leq k \leq n and parition [n]={1,,n}[n] = \{1, \dots, n\} into kk blocks B1,BkB_1, \dots B_k of size Bin/k2\lvert B_i \rvert \geq \lfloor n/k \rfloor \geq 2. For each i[k]i \in [k], compute:

Zi=1BiiBiXi Z_{i} = \frac{1}{\lvert B_i \rvert} \sum_{i \in B_i} X_i

and define the MOM estimator μ^=M(Z1,,Zk)\hat{\mu} = M(Z_{1}, \dots, Z_{k}) where MM denotes the median operator. For each block, the mean is unbiased for the mean with standard deviation controlled by σ/n/k\sigma/\sqrt{n/k}. Hence the median of the distribution of the blockwise medians lies within deviation σ/n/k\sigma/\sqrt{n/k} from the mean. Finally, the empirical median concentrates around this median.

Theorem: Let X1,XnX_1, \dots X_n be nn iid samples with E[Xi]=μ\mathbb{E}[X_i] = \mu and Var(Xi)=σ2\text{Var}(X_i) = \sigma^2. Assume n=mkn = mk for positive integers mm and kk. Then:

P{μ^nμ>σ4/m}ek/8 \mathbb{P}\left\{\lvert \hat{\mu}_n - \mu \rvert > \sigma \sqrt{4/m} \right\} \leq e^{-k/8}

For any δ(0,1)\delta \in (0, 1), if k=8log1/δk = \lceil 8 \log 1/\delta \rceil, then with probability 1δ1 - \delta:

μ^nμσ32log1/δn \lvert \hat{\mu}_n - \mu \rvert \leq \sigma \sqrt{\frac{32 \log{1 / \delta}}{n}}

Proof: By Chebyshev’s inequality, we have with probability at least 3/43/4:

Zjμσ4m \lvert Z_j - \mu \rvert \leq \sigma \sqrt{\frac{4}{m}}

Then, μ^nμ>σ4/m\lvert \hat{\mu}_n - \mu \rvert > \sigma \sqrt{4/m} implies that at least k/2k/2 of the means are such that Zjμ>σ4/m\lvert Z_j - \mu \rvert > \sigma \sqrt{4/m}. Hence:

P{μ^nμ>σ4m}P{Bin(k,1/4)k2}=P{Bin(k,1/4)E[Bin(k,1/4)]k4}=ek/8 \begin{aligned} \mathbb{P}\left\{\lvert \hat{\mu}_n - \mu \rvert > \sigma \sqrt{\frac{4}{m}}\right\} &\leq \mathbb{P}\left\{\text{Bin}(k, 1/4) \geq \frac{k}{2}\right\}\\ &= \mathbb{P}\left\{\text{Bin}(k, 1/4) - \mathbb{E}[\text{Bin}(k, 1/4)] \geq \frac{k}{4}\right\}\\ &= e^{-k/8} \end{aligned}

Here the last line follows from Hoeffding’s inequality. This theorem proves that the MOM estimator is sub-Gaussian with parameter L=32L = \sqrt{32}.

Note that difference confidence levels lead to different estimators because the kk depends on δ\delta. However, if you only assume finite variance, then there do not exist sub-Gaussian estimators that work independent of δ\delta. An analogous result holds even for distributions with infinite variance as long as E[Xμ1+α]<\mathbb{E}[\lvert X - \mu \rvert^{1 + \alpha}] < \infty for some α(0,1]\alpha \in (0, 1].

The dependence of kk on δ\delta is disappointing. As long as XX has a finite 2+α2 + \alpha moment for some α>0\alpha > 0, then we get sub-Gaussian performance for more values of kk. Consider the case where α=1\alpha = 1.

Theorem: Let X1,,XnX_1, \dots, X_n be nn iid samples with E[Xi]=μ\mathbb{E}[X_i] = \mu and Var(Xi)=σ2\text{Var}(X_i) = \sigma^2 and third central moment ρ=E[Xμ3]\rho = \mathbb{E}[\lvert X - \mu \rvert^{3}]. Let m,km, k be positive integers such that n=mkn = mk. Assume that:

log(2/δ)2k+ρ2σ3m1/4(3) \sqrt{\frac{\log(2/\delta)}{2k}} + \frac{\rho}{2\sigma^3 \sqrt{m}} \leq 1/4 \tag{3}

Then, the MOM estimator μ^n\hat{\mu}_n with kk blocks satisfies with probability 1δ1 - \delta:

μ^nμ1c(σlog(2/δ)2n+ρk2σ2n) \lvert \hat{\mu}_n - \mu \rvert \leq \frac{1}{c}\left(\sigma\sqrt{\frac{\log\left(2 / \delta \right)}{2n}} + \frac{\rho k}{2\sigma^2 n} \right)

Here c=ϕ(Φ1(3/4))c = \phi(\Phi^{-1}(3/4)) is a constant, where ϕ\phi and Φ\Phi denote the standard normal density and distribution functions.

In this case the following choices for kk get sub-Gaussian performance:

k2σ3ρn k \leq \frac{2\sigma^{3}}{\rho} \sqrt{n}

Note that this is independent of δ\delta and so the bound holds simultaneously for all values of δ\delta permitted by Equation 33. Here kk can be a constant fraction of n\sqrt{n}, much larger than the previous case. Then, the result holds simultaneously for all δec0n\delta \geq e^{-c_0 \sqrt{n}}.

Proof: Note that μ^n[μa,μ+a]\hat{\mu}_n \in [\mu - a, \mu + a] for a>0a > 0 if:

1ki=1kI(Ziμa)12and1ki=1kI(Ziμa)12 \frac{1}{k} \sum_{i = 1}^{k} \mathbb{I}(Z_{i} - \mu \geq a) \geq \frac{1}{2} \quad \text{and} \quad \frac{1}{k} \sum_{i = 1}^{k} \mathbb{I}(Z_{i} - \mu \leq a) \geq \frac{1}{2}

We have:

1ki=1kI(Ziμa)=1ki=1k{I(Ziμa)P{Ziμa}}+P{Z1μa}P{Gσma}+P{Gσma} \begin{aligned} \frac{1}{k} \sum_{i = 1}^{k} \mathbb{I}(Z_{i} - \mu \leq a) = &\frac{1}{k} \sum_{i = 1}^{k} \left\{\mathbb{I}(Z_{i} - \mu \leq a) - \mathbb{P}\{Z_i - \mu \leq a\}\right\}\\ &+ \mathbb{P}\left\{Z_1 - \mu \leq a\right\} - \mathbb{P}\left\{G\frac{\sigma}{\sqrt{m}} \leq a\right\} + \mathbb{P}\left\{G\frac{\sigma}{\sqrt{m}} \leq a\right\} \end{aligned}

Here GN(0,1)G \sim N(0, 1). By Hoeffding’s inequality, we have with probability 1δ/21 - \delta/2:

1ki=1k{I(Ziμa)P{Ziμa}}log(2/δ)2k \frac{1}{k} \sum_{i = 1}^{k} \left\{\mathbb{I}(Z_{i} - \mu \leq a) - \mathbb{P}\{Z_i - \mu \leq a\}\right\} \geq - \sqrt{\frac{\log(2/\delta)}{2k}}

The Berry-Esseen is an extension of the CLT which gives a bound on maximal approximation error of the distribution of a sample average and the standard normal to which it converges. It implies:

P{Z1μa}P{Gσma}ρ2σ3m \mathbb{P}\left\{Z_1 - \mu \leq a\right\} - \mathbb{P}\left\{G\frac{\sigma}{\sqrt{m}} \leq a\right\} \geq - \frac{\rho}{2\sigma^3\sqrt{m}}

Hence with probability 1δ/21 - \delta/2 we have:

1ki=1kI(Ziμa)ρ2σ3mlog(2/δ)2k+P{Gσma} \frac{1}{k} \sum_{i = 1}^{k} \mathbb{I}(Z_{i} - \mu \leq a) \geq - \frac{\rho}{2\sigma^3\sqrt{m}} -\sqrt{\frac{\log(2/\delta)}{2k}} + \mathbb{P}\left\{G\frac{\sigma}{\sqrt{m}} \leq a\right\}

So, the LHS is 1/2\geq 1/2 with probability 1δ/21 - \delta/2 whenever we have aa such that:

P{Gσma}12+ρ2σ3m+log(2/δ)2k \mathbb{P}\left\{G\frac{\sigma}{\sqrt{m}} \leq a\right\} \geq \frac{1}{2} + \frac{\rho}{2\sigma^3\sqrt{m}} + \sqrt{\frac{\log(2/\delta)}{2k}}

Take ρ2σ3m+log(2/δ)2k1/4\frac{\rho}{2\sigma^3\sqrt{m}} + \sqrt{\frac{\log(2/\delta)}{2k}} \leq 1/4. Then, the inequality holds trivially for am/σΦ1(3/4)a\sqrt{m}/\sigma \geq \Phi^{-1}(3/4). Note that Φ1(t)t\Phi^{-1}(t) \leq t for all t3/4t \leq 3/4 and cΦ1(3/4)1/4c\Phi^{-1}(3/4) \leq 1/4 where c=ϕ(Φ1(3/4))c = \phi(\Phi^{-1}(3/4)) . Hence:

P{Gamσ}12+camσ \mathbb{P}\left\{G \leq a\frac{\sqrt{m}}{\sigma}\right\} \geq \frac{1}{2} + c\frac{a\sqrt{m}}{\sigma}

So, we take:

a=σcm(ρ2σ3m+log(2/δ)2k)=1c(σlog(2/δ)2n+ρk2σ2n) a = \frac{\sigma}{c\sqrt{m}}\left(\frac{\rho}{2\sigma^3\sqrt{m}} + \sqrt{\frac{\log(2/\delta)}{2k}}\right) = \frac{1}{c}\left(\sigma\sqrt{\frac{\log\left(2 / \delta \right)}{2n}} + \frac{\rho k}{2\sigma^2 n} \right)

A similar argument follows for the other case of the the which ensures μ^n[μa,μ+a]\hat{\mu}_n \in [\mu - a, \mu + a]. The theorem follows.

Catoni’s Estimator

Catoni’s estimator is an M-estimator developed in C2021. The idea is that the empirical mean y=μˉny = \bar{\mu}_n solves:

i=1n(Xiy)=0 \sum_{i = 1}^n (X_i - y) = 0

One can replace the left-hand side with a strictly decreasing function of yy:

Rn,α(y)=i=1nψ(α(Xiy)) R_{n, \alpha}(y) = \sum_{i = 1}^n \psi(\alpha(X_i - y))

where αR\alpha \in \mathbb{R} is a parameter and ψ:RR\psi: \mathbb{R} \rightarrow \mathbb{R} in as anti-symmetric increasing function. If ψ(x)\psi(x) grows much slower than xx than the effect of outliers is diminished. This idea is similar to the Huber loss function often used as a replacement for square loss in regression. One choice of ψ\psi is as follows:

ψ(x)={log(1+x+x2/2) if x0,log(1x+x2/2) if x<0 \psi(x) = \begin{cases} \log (1 + x + x^2/2) \quad \text{ if } x \geq 0,\\ \log (1 - x + x^2/2) \quad \text{ if } x < 0 \end{cases}

Catoni’s estimator μ^\hat{\mu} is just defined as the value of yy such that Rn,α(y)=0R_{n, \alpha}(y) = 0. Since ϕ(x)x\phi(x) \leq x for all xRx \in \mathbb{R}, assuming that Var(X)=σ2\text{Var(X)} = \sigma^2, we have:

E[eRn,α(y)](E[1+α(Xy)+α2(Xy)22])n=(1+α(μy)+α2(σ2+(μy)2)2)nexp(nα(μy)+nα2(σ2+(μy)2)2) \begin{aligned} \mathbb{E}\left[e^{R_{n, \alpha}(y)} \right] &\leq \left(\mathbb{E}\left[1 + \alpha(X - y) + \frac{\alpha^2 (X - y)^2}{2}\right]\right)^{n}\\ &= \left(1 + \alpha(\mu - y) + \frac{\alpha^2 (\sigma^2 + (\mu - y)^2)}{2}\right)^n\\ &\leq \exp\left(n\alpha(\mu - y) + \frac{n\alpha^2 (\sigma^2 + (\mu - y)^2)}{2}\right) \end{aligned}

The last line follows from 1+xexp(x)1 + x \leq \exp(x). Then, by Markov’s inequality:

P{Rn,αnα(μy)+nα2(σ2+(μy)2)2+log(1/δ)}δ \mathbb{P}\left\{R_{n, \alpha} \geq n\alpha(\mu - y) + \frac{n\alpha^2 (\sigma^2 + (\mu - y)^2)}{2} + \log(1/\delta) \right\} \leq \delta

Define:

h(y)=nα(μy)+nα2(σ2+(μy)2)2+log(1/δ) h(y) = n\alpha(\mu - y) + \frac{n\alpha^2 (\sigma^2 + (\mu - y)^2)}{2} + \log(1/\delta)

This is a quadratic polynomial and has at least one root when we have α\alpha such that ασ2+2log(1/δ)1\alpha\sigma^2 + 2\log(1/\delta) \leq 1. Let y+=μ+g(α,n,δ,σ2)y_{+} = \mu + g(\alpha, n, \delta, \sigma^2) be the smaller root and y=μg(α,n,δ,σ2)y_{-} = \mu - g(\alpha, n, \delta, \sigma^2) the larger root. Taking y=y+y = y_{+}, we have Rn,α(y+)<0R_{n, \alpha}(y_{+}) < 0 with probability 1δ1 - \delta. Since Rn,α(y)R_{n, \alpha}(y) is strictly decreasing, we have μ^<y+\hat{\mu} < y_{+}. Similarly, we get μ^>y\hat{\mu} > y_{-}. Combining these two inequalities, implies that with probability 12δ1 - 2\delta, we have:

μ^n,αμg(α,n,δ,σ2) \lvert \hat{\mu}_{n, \alpha} - \mu \rvert \leq g(\alpha, n, \delta, \sigma^2)

Finally, we optimize g(α,n,δ,σ2)g(\alpha, n, \delta, \sigma^2) with respect to α\alpha.

Theorem: Let X1,XnX_{1}, \dots X_{n} be iid random samples with E[X]=μ\mathbb{E}[X] = \mu and Var(X)=σ2\text{Var}(X) = \sigma^2. For δ(0,1)\delta \in (0, 1) such that n2log(1/δ)n \geq 2\log(1/\delta). Set:

α=2log(1/δ)nσ2(1+2log(1/δ)n2log(1/δ)) \alpha = \sqrt{\frac{2\log(1/\delta)}{n\sigma^2\left(1 + \frac{2\log(1/\delta)}{n - 2\log(1/\delta)}\right)}}

Then, Catoni’s estimator μ^n,α\hat{\mu}_{n, \alpha} satisfies with probability 12δ1 - 2\delta:

μ^n,αμ2σ2log(1/δ)n2log(1/δ) \lvert \hat{\mu}_{n, \alpha} - \mu \rvert \leq \sqrt{\frac{2\sigma^2 \log(1/\delta)}{n - 2\log(1/\delta)}}

Hence we get a sub-Gaussian estimator. In fact, the 2\sqrt{2} is optimal. However, unlike MOM, it relies on knowledge of σ2\sigma^2 to chose α\alpha. We can replace σ2\sigma^2 with an upper bound σ2v\sigma^2 \leq v. In case, no knowlege of σ2\sigma^2 is available, one can use Lepski’s method to adaptively select α\alpha from the data. Again, our estimator depends on δ\delta. It can be made independent of δ\delta by setting α=2/(nσ2)\alpha = \sqrt{2/(n\sigma^2)}, but the estimator becomes sub-exponential instead of sub-Gaussian.

Trimmed Mean

The trimmed mean is the most intuitive estimator which acheives sub-Gaussian rates. Consider a simple variant. Split your data into halves. Say you have 2n2n samples: X1,,XnX_{1}, \dots, X_{n} and Y1,YnY_{1}, \dots Y_{n}. Define the truncation function:

ϕ(x)={β if x>β,x if x[α,β],α if x<α \phi(x) = \begin{cases} \beta &\text{ if } x > \beta,\\ x &\text{ if } x \in [\alpha, \beta],\\ \alpha &\text{ if } x < \alpha \end{cases}

For x1,xmRx_{1}, \dots x_{m} \in \mathbb{R}, we denote its sorted order as x1x2xmx_{1}^{\ast} \leq x_{2}^{\ast} \leq \dots \leq x_{m}^{\ast}. Given a confidence δ8e3n/16\delta \geq 8e^{-3n/16}, set:

ϵ=16log(8/δ)3n \epsilon = \frac{16\log(8/\delta)}{3n}

Let α=Yϵn\alpha = Y_{\epsilon n}^{\ast} and β=Y(1ϵ)n\beta = Y_{(1 - \epsilon)n}^{\ast} and we have the estimator:

μ^2n=1ni=1nϕα,β(Xi) \hat{\mu}_{2n} = \frac{1}{n}\sum_{i=1}^n \phi_{\alpha, \beta}(X_i)

Theorem: Let X1,Xn,Y1,YnX_{1}, \dots X_{n}, Y_{1}, \dots Y_{n} be iid copies with E[X]=μ\mathbb{E}[X] = \mu and Var(X)=σ2\text{Var}(X) = \sigma^2. Let δ(0,1)\delta \in (0, 1) be such that n>(16/3)log(8/δ)n > (16/3)\log(8/\delta). Then, with probability 1δ1- \delta:

μ^2nμ9σlog(8/δ)n \lvert \hat{\mu}_{2n} - \mu \rvert \leq 9\sigma \sqrt{\frac{\log(8/\delta)}{n}}

*Proof: The proof starts by showing the truncation levels are close to the true quantiles. For p(0,1)p \in (0, 1), define the quantiles:

Qp=sup{MRP{XM}1p} Q_{p} = \sup\{M \in \mathbb{R} \mid \mathbb{P}\{X \geq M\} \geq 1 - p\}

Assume that XX has a non-atomic distribution. Then, P{X>Qp}=P{XQp}=1p\mathbb{P}\{X > Q_p\} = \mathbb{P}\{X \geq Q_p\} = 1 - p. Applying Bernstein’s inequality, we get with probability at least 12exp((3/16)ϵn)1 - 2\exp(-(3/16)\epsilon n), we have:

{i[n]:YiQ12ϵ}ϵn and {i[n]:YiQ1ϵ/2}(1ϵ)n \lvert \{i \in [n] : Y_{i} \geq Q_{1 - 2\epsilon} \} \rvert \geq \epsilon n \quad \text{ and } \quad \lvert \{i \in [n] : Y_{i} \leq Q_{1 - \epsilon/2} \} \rvert \geq (1-\epsilon) n

So, with probability 12exp((3/16)ϵn)1 - 2\exp(-(3/16)\epsilon n), we have:

Q12ϵY(1ϵ)nQ1ϵ/2(4) Q_{1 - 2\epsilon} \leq Y^{\ast}_{(1 - \epsilon)n} \leq Q_{1 - \epsilon/2} \tag{4}

A symmetric argument yields:

Qϵ/2YϵnQ2ϵ(5) Q_{\epsilon/2} \leq Y^{\ast}_{\epsilon n} \leq Q_{2 \epsilon} \tag{5}

Next, we show that Eϕα,β(X)μ\lvert \mathbb{E} \phi_{\alpha, \beta}(X) - \mu \rvert is small and the estimator concentrates around its mean. Consider the event, EE, that both Equations 44 and 55 holds. We have P(E)14exp((3/16)ϵn)\mathbb{P}(E) \geq 1 - 4\exp(-(3/16)\epsilon n). Conditional on EE:

E[ϕα,β(X)Y1,,Ynμ]E[(Xα)I(Xα)Y1,,Yn]+E[(Xβ)I(Xβ)Y1,,Yn]E[(XQ2ϵ)I(XQ2ϵ)]+E[(XQ12ϵ)I(XQ12ϵ)] \begin{aligned} \lvert \mathbb{E}&\left[\phi_{\alpha, \beta}(X) \mid Y_{1}, \dots, Y_{n} - \mu \right] \rvert\\ &\leq \lvert \mathbb{E}\left[(X - \alpha) \mathbb{I}(X \leq \alpha) \mid Y_{1}, \dots, Y_{n}\right] \rvert + \lvert \mathbb{E}\left[(X - \beta) \mathbb{I}(X \geq \beta) \mid Y_{1}, \dots, Y_{n}\right] \rvert\\ &\leq \lvert \mathbb{E}\left[(X - Q_{2\epsilon }) \mathbb{I}(X \leq Q_{2 \epsilon}) \right] \rvert + \lvert \mathbb{E}\left[(X - Q_{1 - 2\epsilon}) \mathbb{I}(X \geq Q_{1 - 2\epsilon})\right] \rvert \end{aligned}

By Chebyshev’s inequality:

2ϵ=P{XQ12ϵ}σX2(Q12ϵμ)2Q12ϵμ+σ2ϵ 2 \epsilon = \mathbb{P}\{X \geq Q_{1 - 2\epsilon} \} \leq \frac{\sigma^2_{X}}{(Q_{1 - 2\epsilon} - \mu)^{2}} \Longrightarrow Q_{1 - 2\epsilon} \leq \mu + \frac{\sigma}{\sqrt{2 \epsilon}}

Now, by Cauchy-Schwarz:

E(XQ12ϵ)I(XQ12ϵ)=E((Xμ)(Q12ϵμ))I(XQ12ϵ)E(Xμ)I(XQ12ϵ)+(Q12ϵμ)P{XQ1ϵ}σP{XQ12ϵ}+2ϵ(Q12ϵμ)σ8ϵ \begin{aligned} \lvert \mathbb{E}(X - Q_{1 - 2\epsilon}) \mathbb{I}(X \geq Q_{1 - 2 \epsilon})\rvert &= \lvert \mathbb{E}((X - \mu) - (Q_{1 - 2\epsilon} - \mu)) \mathbb{I}(X \geq Q_{1 - 2 \epsilon})\rvert\\ &\leq \mathbb{E}\lvert (X - \mu)\mathbb{I}(X \geq Q_{1 - 2 \epsilon}) + (Q_{1 - 2 \epsilon} - \mu)\mathbb{P}\{X \geq Q_{1 - \epsilon}\}\\ &\leq \sigma \sqrt{\mathbb{P}\{X \geq Q_{1 - 2\epsilon}\}} + 2\epsilon (Q_{1 - 2 \epsilon} - \mu)\\ &\leq \sigma \sqrt{8\epsilon} \end{aligned}

Similarly, we have E[(XQ2ϵ)I(XQ2ϵ)]σ8ϵ\lvert \mathbb{E}\left[(X - Q_{2\epsilon}) \mathbb{I}(X \leq Q_{2\epsilon})\right] \rvert \leq \sigma \sqrt{8\epsilon}. So, conditional on EE, we have:

E[ϕα,β(X)Y1,,Yn]μσ32ϵ6σlog(8/δ)n(6) \lvert \mathbb{E}\left[\phi_{\alpha, \beta}(X) \mid Y_{1}, \dots, Y_{n}\right] - \mu \rvert \leq \sigma \sqrt{32 \epsilon} \leq 6 \sigma \sqrt{\frac{\log(8 / \delta)}{n}} \tag{6}

where the last inequality comes from the choice of ϵ\epsilon. Next, note:

Z=1ni=1nϕα,β(Xi)E[ϕα,β(X)Y1,,Yn]=1ni=1nϕαμ,βμ(Xiμ)E[ϕαμ,βμ(Xμ)Y1,,Yn] \begin{aligned} Z &= \frac{1}{n} \sum_{i = 1}^n \phi_{\alpha, \beta}(X_i) - \mathbb{E}\left[\phi_{\alpha, \beta}(X) \mid Y_{1}, \dots, Y_{n}\right]\\ &= \frac{1}{n} \sum_{i = 1}^n \phi_{\alpha - \mu, \beta - \mu}(X_i - \mu) - \mathbb{E}\left[\phi_{\alpha - \mu, \beta - \mu}(X - \mu) \mid Y_{1}, \dots, Y_{n}\right] \end{aligned}

Hence, conditional on EE (that only depends on Y1,YnY_{1}, \dots Y_{n}), ZZ is the average of centered random variables that are bounded pointwise by:

M=max{Qϵ/2μ,Q1ϵ/2μ}σ2/ϵ M = \max\{\lvert Q_{\epsilon/2} - \mu \rvert, \lvert Q_{1 - \epsilon/2} - \mu \rvert\} \leq \sigma \sqrt{2/\epsilon}

Hence, with probability at least 1δ/21 - \delta/2, Bernstein’s inequality implies:

Zσ2log(2/δ)n+log(2/δ)σ2/ϵn3σlog(2/δ)n(7) Z \leq \sigma \sqrt{\frac{2 \log(2/\delta)}{n}} + \frac{\log(2/\delta)\sigma \sqrt{2/\epsilon}}{n} \leq 3\sigma \sqrt{\frac{\log(2 / \delta)}{n}} \tag{7}

Putting together Equation 66 and Equation 77, the result follows.

A unique property of the trimmed mean is that it is also robust to adversarial contamination.

Impossibility delta-independent estimators

The DLLL2020 prove that δ\delta-independent estimators are impossible without assumptions beyond finite variance. A simple result is as follows.

Theorem: For all L0L \geq 0 and every sample size nn, no estimator can be simultaneously LL-sub Gaussian for both δ1=1/(2eL3+1)\delta_1 = 1/(2e\sqrt{L^3 + 1}) and δ2=2eL4/4\delta_2 = 2e^{-L^4/4} for all distributions with finite second moments.

Proof: The proof follows by considering the restricted class of Poissons. Assume, by way of contradiction, that there exists an estimator μ^\hat{\mu} that is LL-sub Gaussian for both δ1\delta_1 and δ2\delta_2 for all Poissons. Let X1,,XnX_{1}, \dots, X_{n} be iid Poissons with parameter 1/n1/n and let Y1,,YnY_{1}, \dots, Y_{n} be iid Poissons with parameter c/nc/n where we set c=L3+1c = L^3 + 1. For simplicitly, assume cc is an integer. By sub-Gaussianity:

P{μ^n(Y1,Yn)<cnLnclog1δ1}δ1(8) \mathbb{P}\left\{\hat{\mu}_n(Y_1, \dots Y_n) < \frac{c}{n} - \frac{L}{n} \sqrt{c \log\frac{1}{\delta_1}}\right\} \leq \delta_1 \tag{8}

Now, we can lower bound the LHS by:

P{μ^n(Y1,Yn)<cnLnclog1δ}P{μ^n(Y1,Yn)<cnLnclog1δ1,i=1nYi=c}1ecP{μ^n(Y1,Yn)<cnLnclog1δ1i=1nYi=c} \begin{aligned} \mathbb{P}&\left\{\hat{\mu}_n(Y_1, \dots Y_n) < \frac{c}{n} - \frac{L}{n} \sqrt{c \log\frac{1}{\delta}}\right\}\\ &\geq \mathbb{P}\left\{\hat{\mu}_n(Y_1, \dots Y_n) < \frac{c}{n} - \frac{L}{n} \sqrt{c \log\frac{1}{\delta_{1}}}, \sum_{i=1}^n Y_i = c\right\}\\ &\geq \frac{1}{e\sqrt{c}} \mathbb{P}\left\{\hat{\mu}_n(Y_1, \dots Y_n) < \frac{c}{n} - \frac{L}{n} \sqrt{c \log\frac{1}{\delta_{1}}} \mid \sum_{i=1}^n Y_i = c\right\} \end{aligned}

Here the last line follows from the fact that iYi\sum_{i}Y_i is Poisson with parameter cc and then applying Stirling’s formula. Note that the conditional joint distribution of nn independent Poiss(λ)\text{Poiss}(\lambda) random variables conditioned on the event that their sum equals cc only depends on cc, not λ\lambda. So:

P{μ^n(Y1,Yn)<cnLnclog1δ1i=1nYi=c}=P{μ^n(Y1,Yn)<cnLnclog1δ1i=1nXi=c} \begin{aligned} \mathbb{P}&\left\{\hat{\mu}_n(Y_1, \dots Y_n) < \frac{c}{n} - \frac{L}{n} \sqrt{c \log\frac{1}{\delta_{1}}} \mid \sum_{i=1}^n Y_i = c\right\}\\ &=\mathbb{P}\left\{\hat{\mu}_n(Y_1, \dots Y_n) < \frac{c}{n} - \frac{L}{n} \sqrt{c \log\frac{1}{\delta_{1}}} \mid \sum_{i=1}^n X_i = c\right\} \end{aligned}

Combined with Equation 88 and δ1=1/(2ec)\delta_1 = 1/(2e\sqrt{c}), we have:

12=1ecδ1P{μ^n(Y1,Yn)cnLnclog1δ1,i=1nXi=c}P{i=1nXi=c}ec!P{μ^n(Y1,Yn)cnLnclog1δ1}ec!P{μ^n(Y1,Yn)1n+c1nLnclog1δ1}ec!P{μ^n(Y1,Yn)1n+c12n} \begin{aligned} \frac{1}{2} &= 1 - e\sqrt{c}\delta_1\\ &\leq \frac{\mathbb{P}\left\{\hat{\mu}_n(Y_1, \dots Y_n) \geq \frac{c}{n} - \frac{L}{n} \sqrt{c \log\frac{1}{\delta_1}}, \sum_{i=1}^n X_i = c\right\}}{\mathbb{P}\left\{\sum_{i = 1}^n X_i =c\right\}}\\ &\leq ec!\mathbb{P}\left\{\hat{\mu}_n(Y_1, \dots Y_n) \geq \frac{c}{n} - \frac{L}{n} \sqrt{c \log\frac{1}{\delta_1}}\right\}\\ &\leq ec!\mathbb{P}\left\{\hat{\mu}_n(Y_1, \dots Y_n) \geq \frac{1}{n} + \frac{c-1}{n} - \frac{L}{n} \sqrt{c \log\frac{1}{\delta_1}}\right\}\\ &\leq ec!\mathbb{P}\left\{\hat{\mu}_n(Y_1, \dots Y_n) \geq \frac{1}{n} + \frac{c-1}{2n}\right\} \end{aligned}

The third inequality uses that for our choice of δ1\delta_1, whenever L10L \geq 10:

Lnclog1δ1c1n \frac{L}{n} \sqrt{c \log\frac{1}{\delta_1}} \leq \frac{c -1}{n}

By sub-Gaussianity, we also have for δ2=2eL4/4\delta_2 = 2e^{-L^4/4}:

P{μ^n(Y1,Yn)1n+c12n}=P{μ^n(Y1,Yn)1n+Lnlog(2/δ2)}δ2 \mathbb{P}\left\{\hat{\mu}_n(Y_1, \dots Y_n) \geq \frac{1}{n} + \frac{c-1}{2n}\right\} = \mathbb{P}\left\{\hat{\mu}_n(Y_1, \dots Y_n) \geq \frac{1}{n} + \frac{L}{n}\sqrt{\log (2/\delta_2)}\right\} \leq \delta_2

So, we have 1/2ec!δ2=2ec!eL4/41/2 \leq ec!\delta_2 = 2ec!e^{-L^4/4}. But explicitly computing the RHS, shows that it is less than 1/21/2 for L50L \geq 50. Hence we have a contradiction.

A caveat to the above analysis is that Lepski’s method can be used if one knows more than finite variance. Specifically, if one has access to non-trivial upper and lower bounds on the variance. Also, as shown in a previous section, existence of higher moments also suffices for such δ\delta-independent estimators.