Home

LM2019 - Robust Mean Estimation

Published

01 August 2021

LM2019 have a nice survey of recent results in robust mean estimation and regression. I often learn best by writing out results. Here I will document some of the key proofs in the survey.

Set-up

Say we observe $n$ iid samples $X_1, \dots X_n$ with mean $\mathbb{E}[X_1] = \mu$ . The goal is to develop estimators $\hat{\mu}_{n}$ which are close to $\mu$ with high probability in a non-asymptotic sense. That is we want to find the smallest possible value of $\epsilon = \epsilon(n, \delta)$ such that:

$\mathbb{P}(\lvert \hat{\mu}_{n} - \mu \rvert > \epsilon) \leq \delta$

Sub-Gaussian RVs

Here is a summary of sub-Gaussian RVs. A random variable $X$ with $\mathbb{E}[X] = \mu$ is sub-Gaussian with parameter $s$ if for all $\lambda \in \mathbb{R}$ :

$\mathbb{E}\left[e^{\lambda(X - \mu)}\right] \leq e^{s^{2}\lambda^{2}/2}$

Plugging this into a Chernoff bound, gives us the following concentration inequality for all $t > 0$ :

$\mathbb{P}\left(\lvert X - \mu \rvert \geq t \right) \leq 2e^{-t^{2}/2s^2}$

We know $s = 1$ for $X \sim N(0, 1)$ . Any sub-Gaussian RV also obeys, $\forall p \geq 2$ :

$\left(\mathbb{E} \lvert X - \mu \rvert^{p} \right)^{1/p} \leq s' \sqrt{p} \left( \mathbb{E} \lvert X - \mu \rvert ^{2} \right)^{1/2} \tag{1}$

for some absolute constants $c_1$ and $c_2$ such that $c_1s \leq s' \leq c_2s$ . Note that this implies the existence of all higher order moments.

Empirical Mean + CLT

The most natural estimator is the empirical mean:

$\bar{\mu}_n = \frac{1}{n} \sum_{i = 1}^n X_i$

By the CLT, we have:

$\lim_{n \rightarrow \infty} \mathbb{P}\left\{\lvert \bar{\mu}_n - \mu \rvert > \frac{\sigma \Phi^{-1}(1 - \delta/2)}{\sqrt{n}} \right\} \leq \delta$

Taking $t = \Phi^{-1}(1 - \delta/2)$ and plugging into the upper deviation inequality for standard normals, we get $\Phi^{-1}(1 - \delta/2) \leq \sqrt{2 \log(2/\delta)}$ . It follows:

$\lim_{n \rightarrow \infty} \mathbb{P}\left\{\lvert \bar{\mu}_n - \mu \rvert > \frac{\sigma \sqrt{2 \log(2/\delta)}}{\sqrt{n}} \right\} \leq \delta$

Non-Asymptotic Sub-Gaussian Estimators

We are interested in developing non-asymptotic bounds with rates similar to those given by CLT. We will call an estimator $\hat{\mu}_n$ sub Gaussian with parameter $L$ if with probability $1 - \delta$ :

$\lvert \hat{\mu}_n - \mu \rvert \leq \frac{L\sigma \sqrt{\log(2/\delta)}}{\sqrt{n}} \tag{2}$

Alternatively, setting the RHS to $t$ , we can say:

$\mathbb{P}\left(\lvert \hat{\mu}_n - \mu \rvert \geq t\right) \leq 2 \exp\left(- \frac{t^2 n}{L^2 \sigma^2}\right)$

Minimax Error Rate

Such an error rate is minimax even for a fixed confidence level $\delta$ .

Theorem: Let $\mu \in \mathbb{R}$ , $\sigma > 0$ and $\delta \in (2e^{-n/4}, 1/2)$ . For any estimator $\hat{\mu}_n$ , there exists a distribution with mean $\mu$ and variance $\sigma^2$ such that:

$\mathbb{P}\left\{ \lvert\hat{\mu}_n - \mu \rvert > \frac{\sigma \sqrt{\log(1/\delta)}}{\sqrt{n}}\right\} \geq \delta$

Proof: Consider two distributions $P_{+}$ and $P_{-}$ concentrated on two points:

$P_{+}(\{0\}) = P_{+}(\{0\}) = 1 - p \quad \quad \quad P_{+}(\{c\}) = P_{+}(\{-c\}) = p$

where $p \in [0, 1]$ and $c > 0$ . We have $\mu_{P_{+}} = pc$ , $\mu_{P_{-}} = -pc$ and $\sigma_{P_{-}}^{2} = \sigma_{P_{+}}^{2} = c^{2}p(1 - p)$ . Now consider $n$ independent $(X_i, Y_i)$ pairs such that:

$P\{X_i = Y_i = 0\} = 1 - p \quad \quad \quad P\left\{X_i = c, Y_i = -c\right\} = p$

Note that $X_i \sim P_{+}$ and $Y_i \sim P_{-}$ . Let $\delta \in (0, 1/2)$ . If $\delta \geq 2e^{-n/4}$ and $p = (1/(2n))\log(2/\delta)$ , then using $1 - p \geq \exp(-p/(1- p))$ :

$P\{X_{1}^n = Y_{1}^n\} = (1 - p)^n \geq 2\delta$

Let $\hat{\mu}_n$ be any mean estimator, possibly depending on $\delta$ , then:

$\begin{aligned} \max&\left(\mathbb{P}\left\{\lvert \hat{\mu}(X_1^n) - \mu_{P_{+}} \rvert > cp\right\}, \mathbb{P}\left\{\lvert \hat{\mu}(Y_1^n) - \mu_{P_{-}} \rvert > cp\right\}\right)\\ & \geq \frac{1}{2} \mathbb{P}\left\{ \lvert \hat{\mu}(X_1^n) - \mu_{P_{+}} \rvert > cp \quad \text{or} \quad \lvert \hat{\mu}(Y_1^n) - \mu_{P_{-}} \rvert > cp\right\}\\ & \geq \frac{1}{2} \mathbb{P}\left\{ \hat{\mu}_n(X_1^n) = \hat{\mu}_n(Y_1^n) \right\}\\ & \geq \frac{1}{2} \mathbb{P}\{X_1, \dots, X_n = Y_1, \dots Y_n\}\\ &\geq \delta \end{aligned}$

From $\sigma^2 = c^2 p(1 - p)$ and $p \geq 1/2$ , we have that $cp \geq \sigma \sqrt{p/2}$ , so:

$\max\left(\mathbb{P}\left\{\lvert \hat{\mu}(X_1^n) - \mu_{P_{+}} \rvert > \sigma \sqrt{\frac{\log \frac{2}{\delta}}{n}}\right\}, \mathbb{P}\left\{\lvert \hat{\mu}(Y_1^n) - \mu_{P_{-}} \rvert > \sigma \sqrt{\frac{\log \frac{2}{\delta}}{n}}\right\}\right) \geq \delta$

Hence for the choices of $\delta$ above, the best one can do for both $P_{+}$ or $P_{-}$ will be a sub-Gaussian error rate.

Empirical Mean + Sub-Gaussian RVs

For any sub-Gaussian RV, Chernoff bounds will prove that $\hat{u}_n$ is sub-Gaussian for some $L$ . However, as noted in Equation $1$ , sub-Gaussian RVs exhibit strong decay in tail probabilities. If you only assume finite variance, then one standard result is Chebyshev’s inequality. With probabliity $1- \delta$ :

$\lvert \bar{\mu}_n - \mu \rvert \leq \sigma \sqrt{\frac{1}{n \delta}}$

Although this gets the correct $O(n^{-1/2})$ rate, when compared to the Equation $2$ the dependence on $\delta$ is exponentially worse. This matters in settings where we want to potentially estimate many means simultaneously and will have to apply a union bound.

Median of Means

Using only the assumption of finite variance, the first estimator which acheives sub-Gaussian rates is the median of means. The core idea is to split the samples into $k$ approximately equal subsets, compute means and then taking the median of computed values. Formally, let $1 \leq k \leq n$ and parition $[n] = \{1, \dots, n\}$ into $k$ blocks $B_1, \dots B_k$ of size $\lvert B_i \rvert \geq \lfloor n/k \rfloor \geq 2$ . For each $i \in [k]$ , compute:

$Z_{i} = \frac{1}{\lvert B_i \rvert} \sum_{i \in B_i} X_i$

and define the MOM estimator $\hat{\mu} = M(Z_{1}, \dots, Z_{k})$ where $M$ denotes the median operator. For each block, the mean is unbiased for the mean with standard deviation controlled by $\sigma/\sqrt{n/k}$ . Hence the median of the distribution of the blockwise medians lies within deviation $\sigma/\sqrt{n/k}$ from the mean. Finally, the empirical median concentrates around this median.

Theorem: Let $X_1, \dots X_n$ be $n$ iid samples with $\mathbb{E}[X_i] = \mu$ and $\text{Var}(X_i) = \sigma^2$ . Assume $n = mk$ for positive integers $m$ and $k$ . Then:

$\mathbb{P}\left\{\lvert \hat{\mu}_n - \mu \rvert > \sigma \sqrt{4/m} \right\} \leq e^{-k/8}$

For any $\delta \in (0, 1)$ , if $k = \lceil 8 \log 1/\delta \rceil$ , then with probability $1 - \delta$ :

$\lvert \hat{\mu}_n - \mu \rvert \leq \sigma \sqrt{\frac{32 \log{1 / \delta}}{n}}$

Proof: By Chebyshev’s inequality, we have with probability at least $3/4$ :

$\lvert Z_j - \mu \rvert \leq \sigma \sqrt{\frac{4}{m}}$

Then, $\lvert \hat{\mu}_n - \mu \rvert > \sigma \sqrt{4/m}$ implies that at least $k/2$ of the means are such that $\lvert Z_j - \mu \rvert > \sigma \sqrt{4/m}$ . Hence:

$\begin{aligned} \mathbb{P}\left\{\lvert \hat{\mu}_n - \mu \rvert > \sigma \sqrt{\frac{4}{m}}\right\} &\leq \mathbb{P}\left\{\text{Bin}(k, 1/4) \geq \frac{k}{2}\right\}\\ &= \mathbb{P}\left\{\text{Bin}(k, 1/4) - \mathbb{E}[\text{Bin}(k, 1/4)] \geq \frac{k}{4}\right\}\\ &= e^{-k/8} \end{aligned}$

Here the last line follows from Hoeffding’s inequality. This theorem proves that the MOM estimator is sub-Gaussian with parameter $L = \sqrt{32}$ .

Note that difference confidence levels lead to different estimators because the $k$ depends on $\delta$ . However, if you only assume finite variance, then there do not exist sub-Gaussian estimators that work independent of $\delta$ . An analogous result holds even for distributions with infinite variance as long as $\mathbb{E}[\lvert X - \mu \rvert^{1 + \alpha}] < \infty$ for some $\alpha \in (0, 1]$ .

The dependence of $k$ on $\delta$ is disappointing. As long as $X$ has a finite $2 + \alpha$ moment for some $\alpha > 0$ , then we get sub-Gaussian performance for more values of $k$ . Consider the case where $\alpha = 1$ .

Theorem: Let $X_1, \dots, X_n$ be $n$ iid samples with $\mathbb{E}[X_i] = \mu$ and $\text{Var}(X_i) = \sigma^2$ and third central moment $\rho = \mathbb{E}[\lvert X - \mu \rvert^{3}]$ . Let $m, k$ be positive integers such that $n = mk$ . Assume that:

$\sqrt{\frac{\log(2/\delta)}{2k}} + \frac{\rho}{2\sigma^3 \sqrt{m}} \leq 1/4 \tag{3}$

Then, the MOM estimator $\hat{\mu}_n$ with $k$ blocks satisfies with probability $1 - \delta$ :

$\lvert \hat{\mu}_n - \mu \rvert \leq \frac{1}{c}\left(\sigma\sqrt{\frac{\log\left(2 / \delta \right)}{2n}} + \frac{\rho k}{2\sigma^2 n} \right)$

Here $c = \phi(\Phi^{-1}(3/4))$ is a constant, where $\phi$ and $\Phi$ denote the standard normal density and distribution functions.

In this case the following choices for $k$ get sub-Gaussian performance:

$k \leq \frac{2\sigma^{3}}{\rho} \sqrt{n}$

Note that this is independent of $\delta$ and so the bound holds simultaneously for all values of $\delta$ permitted by Equation $3$ . Here $k$ can be a constant fraction of $\sqrt{n}$ , much larger than the previous case. Then, the result holds simultaneously for all $\delta \geq e^{-c_0 \sqrt{n}}$ .

Proof: Note that $\hat{\mu}_n \in [\mu - a, \mu + a]$ for $a > 0$ if:

$\frac{1}{k} \sum_{i = 1}^{k} \mathbb{I}(Z_{i} - \mu \geq a) \geq \frac{1}{2} \quad \text{and} \quad \frac{1}{k} \sum_{i = 1}^{k} \mathbb{I}(Z_{i} - \mu \leq a) \geq \frac{1}{2}$

We have:

$\begin{aligned} \frac{1}{k} \sum_{i = 1}^{k} \mathbb{I}(Z_{i} - \mu \leq a) = &\frac{1}{k} \sum_{i = 1}^{k} \left\{\mathbb{I}(Z_{i} - \mu \leq a) - \mathbb{P}\{Z_i - \mu \leq a\}\right\}\\ &+ \mathbb{P}\left\{Z_1 - \mu \leq a\right\} - \mathbb{P}\left\{G\frac{\sigma}{\sqrt{m}} \leq a\right\} + \mathbb{P}\left\{G\frac{\sigma}{\sqrt{m}} \leq a\right\} \end{aligned}$

Here $G \sim N(0, 1)$ . By Hoeffding’s inequality, we have with probability $1 - \delta/2$ :

$\frac{1}{k} \sum_{i = 1}^{k} \left\{\mathbb{I}(Z_{i} - \mu \leq a) - \mathbb{P}\{Z_i - \mu \leq a\}\right\} \geq - \sqrt{\frac{\log(2/\delta)}{2k}}$

The Berry-Esseen is an extension of the CLT which gives a bound on maximal approximation error of the distribution of a sample average and the standard normal to which it converges. It implies:

$\mathbb{P}\left\{Z_1 - \mu \leq a\right\} - \mathbb{P}\left\{G\frac{\sigma}{\sqrt{m}} \leq a\right\} \geq - \frac{\rho}{2\sigma^3\sqrt{m}}$

Hence with probability $1 - \delta/2$ we have:

$\frac{1}{k} \sum_{i = 1}^{k} \mathbb{I}(Z_{i} - \mu \leq a) \geq - \frac{\rho}{2\sigma^3\sqrt{m}} -\sqrt{\frac{\log(2/\delta)}{2k}} + \mathbb{P}\left\{G\frac{\sigma}{\sqrt{m}} \leq a\right\}$

So, the LHS is $\geq 1/2$ with probability $1 - \delta/2$ whenever we have $a$ such that:

$\mathbb{P}\left\{G\frac{\sigma}{\sqrt{m}} \leq a\right\} \geq \frac{1}{2} + \frac{\rho}{2\sigma^3\sqrt{m}} + \sqrt{\frac{\log(2/\delta)}{2k}}$

Take $\frac{\rho}{2\sigma^3\sqrt{m}} + \sqrt{\frac{\log(2/\delta)}{2k}} \leq 1/4$ . Then, the inequality holds trivially for $a\sqrt{m}/\sigma \geq \Phi^{-1}(3/4)$ . Note that $\Phi^{-1}(t) \leq t$ for all $t \leq 3/4$ and $c\Phi^{-1}(3/4) \leq 1/4$ where $c = \phi(\Phi^{-1}(3/4))$ . Hence:

$\mathbb{P}\left\{G \leq a\frac{\sqrt{m}}{\sigma}\right\} \geq \frac{1}{2} + c\frac{a\sqrt{m}}{\sigma}$

So, we take:

$a = \frac{\sigma}{c\sqrt{m}}\left(\frac{\rho}{2\sigma^3\sqrt{m}} + \sqrt{\frac{\log(2/\delta)}{2k}}\right) = \frac{1}{c}\left(\sigma\sqrt{\frac{\log\left(2 / \delta \right)}{2n}} + \frac{\rho k}{2\sigma^2 n} \right)$

A similar argument follows for the other case of the the which ensures $\hat{\mu}_n \in [\mu - a, \mu + a]$ . The theorem follows.

Catoni’s Estimator

Catoni’s estimator is an M-estimator developed in C2021. The idea is that the empirical mean $y = \bar{\mu}_n$ solves:

$\sum_{i = 1}^n (X_i - y) = 0$

One can replace the left-hand side with a strictly decreasing function of $y$ :

$R_{n, \alpha}(y) = \sum_{i = 1}^n \psi(\alpha(X_i - y))$

where $\alpha \in \mathbb{R}$ is a parameter and $\psi: \mathbb{R} \rightarrow \mathbb{R}$ in as anti-symmetric increasing function. If $\psi(x)$ grows much slower than $x$ than the effect of outliers is diminished. This idea is similar to the Huber loss function often used as a replacement for square loss in regression. One choice of $\psi$ is as follows:

$\psi(x) = \begin{cases} \log (1 + x + x^2/2) \quad \text{ if } x \geq 0,\\ \log (1 - x + x^2/2) \quad \text{ if } x < 0 \end{cases}$

Catoni’s estimator $\hat{\mu}$ is just defined as the value of $y$ such that $R_{n, \alpha}(y) = 0$ . Since $\phi(x) \leq x$ for all $x \in \mathbb{R}$ , assuming that $\text{Var(X)} = \sigma^2$ , we have:

$\begin{aligned} \mathbb{E}\left[e^{R_{n, \alpha}(y)} \right] &\leq \left(\mathbb{E}\left[1 + \alpha(X - y) + \frac{\alpha^2 (X - y)^2}{2}\right]\right)^{n}\\ &= \left(1 + \alpha(\mu - y) + \frac{\alpha^2 (\sigma^2 + (\mu - y)^2)}{2}\right)^n\\ &\leq \exp\left(n\alpha(\mu - y) + \frac{n\alpha^2 (\sigma^2 + (\mu - y)^2)}{2}\right) \end{aligned}$

The last line follows from $1 + x \leq \exp(x)$ . Then, by Markov’s inequality:

$\mathbb{P}\left\{R_{n, \alpha} \geq n\alpha(\mu - y) + \frac{n\alpha^2 (\sigma^2 + (\mu - y)^2)}{2} + \log(1/\delta) \right\} \leq \delta$

Define:

$h(y) = n\alpha(\mu - y) + \frac{n\alpha^2 (\sigma^2 + (\mu - y)^2)}{2} + \log(1/\delta)$

This is a quadratic polynomial and has at least one root when we have $\alpha$ such that $\alpha\sigma^2 + 2\log(1/\delta) \leq 1$ . Let $y_{+} = \mu + g(\alpha, n, \delta, \sigma^2)$ be the smaller root and $y_{-} = \mu - g(\alpha, n, \delta, \sigma^2)$ the larger root. Taking $y = y_{+}$ , we have $R_{n, \alpha}(y_{+}) < 0$ with probability $1 - \delta$ . Since $R_{n, \alpha}(y)$ is strictly decreasing, we have $\hat{\mu} < y_{+}$ . Similarly, we get $\hat{\mu} > y_{-}$ . Combining these two inequalities, implies that with probability $1 - 2\delta$ , we have:

$\lvert \hat{\mu}_{n, \alpha} - \mu \rvert \leq g(\alpha, n, \delta, \sigma^2)$

Finally, we optimize $g(\alpha, n, \delta, \sigma^2)$ with respect to $\alpha$ .

Theorem: Let $X_{1}, \dots X_{n}$ be iid random samples with $\mathbb{E}[X] = \mu$ and $\text{Var}(X) = \sigma^2$ . For $\delta \in (0, 1)$ such that $n \geq 2\log(1/\delta)$ . Set:

$\alpha = \sqrt{\frac{2\log(1/\delta)}{n\sigma^2\left(1 + \frac{2\log(1/\delta)}{n - 2\log(1/\delta)}\right)}}$

Then, Catoni’s estimator $\hat{\mu}_{n, \alpha}$ satisfies with probability $1 - 2\delta$ :

$\lvert \hat{\mu}_{n, \alpha} - \mu \rvert \leq \sqrt{\frac{2\sigma^2 \log(1/\delta)}{n - 2\log(1/\delta)}}$

Hence we get a sub-Gaussian estimator. In fact, the $\sqrt{2}$ is optimal. However, unlike MOM, it relies on knowledge of $\sigma^2$ to chose $\alpha$ . We can replace $\sigma^2$ with an upper bound $\sigma^2 \leq v$ . In case, no knowlege of $\sigma^2$ is available, one can use Lepski’s method to adaptively select $\alpha$ from the data. Again, our estimator depends on $\delta$ . It can be made independent of $\delta$ by setting $\alpha = \sqrt{2/(n\sigma^2)}$ , but the estimator becomes sub-exponential instead of sub-Gaussian.

Trimmed Mean

The trimmed mean is the most intuitive estimator which acheives sub-Gaussian rates. Consider a simple variant. Split your data into halves. Say you have $2n$ samples: $X_{1}, \dots, X_{n}$ and $Y_{1}, \dots Y_{n}$ . Define the truncation function:

$\phi(x) = \begin{cases} \beta &\text{ if } x > \beta,\\ x &\text{ if } x \in [\alpha, \beta],\\ \alpha &\text{ if } x < \alpha \end{cases}$

For $x_{1}, \dots x_{m} \in \mathbb{R}$ , we denote its sorted order as $x_{1}^{\ast} \leq x_{2}^{\ast} \leq \dots \leq x_{m}^{\ast}$ . Given a confidence $\delta \geq 8e^{-3n/16}$ , set:

$\epsilon = \frac{16\log(8/\delta)}{3n}$

Let $\alpha = Y_{\epsilon n}^{\ast}$ and $\beta = Y_{(1 - \epsilon)n}^{\ast}$ and we have the estimator:

$\hat{\mu}_{2n} = \frac{1}{n}\sum_{i=1}^n \phi_{\alpha, \beta}(X_i)$

Theorem: Let $X_{1}, \dots X_{n}, Y_{1}, \dots Y_{n}$ be iid copies with $\mathbb{E}[X] = \mu$ and $\text{Var}(X) = \sigma^2$ . Let $\delta \in (0, 1)$ be such that $n > (16/3)\log(8/\delta)$ . Then, with probability $1- \delta$ :

$\lvert \hat{\mu}_{2n} - \mu \rvert \leq 9\sigma \sqrt{\frac{\log(8/\delta)}{n}}$

*Proof: The proof starts by showing the truncation levels are close to the true quantiles. For $p \in (0, 1)$ , define the quantiles:

$Q_{p} = \sup\{M \in \mathbb{R} \mid \mathbb{P}\{X \geq M\} \geq 1 - p\}$

Assume that $X$ has a non-atomic distribution. Then, $\mathbb{P}\{X > Q_p\} = \mathbb{P}\{X \geq Q_p\} = 1 - p$ . Applying Bernstein’s inequality, we get with probability at least $1 - 2\exp(-(3/16)\epsilon n)$ , we have:

$\lvert \{i \in [n] : Y_{i} \geq Q_{1 - 2\epsilon} \} \rvert \geq \epsilon n \quad \text{ and } \quad \lvert \{i \in [n] : Y_{i} \leq Q_{1 - \epsilon/2} \} \rvert \geq (1-\epsilon) n$

So, with probability $1 - 2\exp(-(3/16)\epsilon n)$ , we have:

$Q_{1 - 2\epsilon} \leq Y^{\ast}_{(1 - \epsilon)n} \leq Q_{1 - \epsilon/2} \tag{4}$

A symmetric argument yields:

$Q_{\epsilon/2} \leq Y^{\ast}_{\epsilon n} \leq Q_{2 \epsilon} \tag{5}$

Next, we show that $\lvert \mathbb{E} \phi_{\alpha, \beta}(X) - \mu \rvert$ is small and the estimator concentrates around its mean. Consider the event, $E$ , that both Equations $4$ and $5$ holds. We have $\mathbb{P}(E) \geq 1 - 4\exp(-(3/16)\epsilon n)$ . Conditional on $E$ :

$\begin{aligned} \lvert \mathbb{E}&\left[\phi_{\alpha, \beta}(X) \mid Y_{1}, \dots, Y_{n} - \mu \right] \rvert\\ &\leq \lvert \mathbb{E}\left[(X - \alpha) \mathbb{I}(X \leq \alpha) \mid Y_{1}, \dots, Y_{n}\right] \rvert + \lvert \mathbb{E}\left[(X - \beta) \mathbb{I}(X \geq \beta) \mid Y_{1}, \dots, Y_{n}\right] \rvert\\ &\leq \lvert \mathbb{E}\left[(X - Q_{2\epsilon }) \mathbb{I}(X \leq Q_{2 \epsilon}) \right] \rvert + \lvert \mathbb{E}\left[(X - Q_{1 - 2\epsilon}) \mathbb{I}(X \geq Q_{1 - 2\epsilon})\right] \rvert \end{aligned}$

By Chebyshev’s inequality:

$2 \epsilon = \mathbb{P}\{X \geq Q_{1 - 2\epsilon} \} \leq \frac{\sigma^2_{X}}{(Q_{1 - 2\epsilon} - \mu)^{2}} \Longrightarrow Q_{1 - 2\epsilon} \leq \mu + \frac{\sigma}{\sqrt{2 \epsilon}}$

Now, by Cauchy-Schwarz:

$\begin{aligned} \lvert \mathbb{E}(X - Q_{1 - 2\epsilon}) \mathbb{I}(X \geq Q_{1 - 2 \epsilon})\rvert &= \lvert \mathbb{E}((X - \mu) - (Q_{1 - 2\epsilon} - \mu)) \mathbb{I}(X \geq Q_{1 - 2 \epsilon})\rvert\\ &\leq \mathbb{E}\lvert (X - \mu)\mathbb{I}(X \geq Q_{1 - 2 \epsilon}) + (Q_{1 - 2 \epsilon} - \mu)\mathbb{P}\{X \geq Q_{1 - \epsilon}\}\\ &\leq \sigma \sqrt{\mathbb{P}\{X \geq Q_{1 - 2\epsilon}\}} + 2\epsilon (Q_{1 - 2 \epsilon} - \mu)\\ &\leq \sigma \sqrt{8\epsilon} \end{aligned}$

Similarly, we have $\lvert \mathbb{E}\left[(X - Q_{2\epsilon}) \mathbb{I}(X \leq Q_{2\epsilon})\right] \rvert \leq \sigma \sqrt{8\epsilon}$ . So, conditional on $E$ , we have:

$\lvert \mathbb{E}\left[\phi_{\alpha, \beta}(X) \mid Y_{1}, \dots, Y_{n}\right] - \mu \rvert \leq \sigma \sqrt{32 \epsilon} \leq 6 \sigma \sqrt{\frac{\log(8 / \delta)}{n}} \tag{6}$

where the last inequality comes from the choice of $\epsilon$ . Next, note:

$\begin{aligned} Z &= \frac{1}{n} \sum_{i = 1}^n \phi_{\alpha, \beta}(X_i) - \mathbb{E}\left[\phi_{\alpha, \beta}(X) \mid Y_{1}, \dots, Y_{n}\right]\\ &= \frac{1}{n} \sum_{i = 1}^n \phi_{\alpha - \mu, \beta - \mu}(X_i - \mu) - \mathbb{E}\left[\phi_{\alpha - \mu, \beta - \mu}(X - \mu) \mid Y_{1}, \dots, Y_{n}\right] \end{aligned}$

Hence, conditional on $E$ (that only depends on $Y_{1}, \dots Y_{n}$ ), $Z$ is the average of centered random variables that are bounded pointwise by:

$M = \max\{\lvert Q_{\epsilon/2} - \mu \rvert, \lvert Q_{1 - \epsilon/2} - \mu \rvert\} \leq \sigma \sqrt{2/\epsilon}$

Hence, with probability at least $1 - \delta/2$ , Bernstein’s inequality implies:

$Z \leq \sigma \sqrt{\frac{2 \log(2/\delta)}{n}} + \frac{\log(2/\delta)\sigma \sqrt{2/\epsilon}}{n} \leq 3\sigma \sqrt{\frac{\log(2 / \delta)}{n}} \tag{7}$

Putting together Equation $6$ and Equation $7$ , the result follows.

A unique property of the trimmed mean is that it is also robust to adversarial contamination.

Impossibility delta-independent estimators

The DLLL2020 prove that $\delta$ -independent estimators are impossible without assumptions beyond finite variance. A simple result is as follows.

Theorem: For all $L \geq 0$ and every sample size $n$ , no estimator can be simultaneously $L$ -sub Gaussian for both $\delta_1 = 1/(2e\sqrt{L^3 + 1})$ and $\delta_2 = 2e^{-L^4/4}$ for all distributions with finite second moments.

Proof: The proof follows by considering the restricted class of Poissons. Assume, by way of contradiction, that there exists an estimator $\hat{\mu}$ that is $L$ -sub Gaussian for both $\delta_1$ and $\delta_2$ for all Poissons. Let $X_{1}, \dots, X_{n}$ be iid Poissons with parameter $1/n$ and let $Y_{1}, \dots, Y_{n}$ be iid Poissons with parameter $c/n$ where we set $c = L^3 + 1$ . For simplicitly, assume $c$ is an integer. By sub-Gaussianity:

$\mathbb{P}\left\{\hat{\mu}_n(Y_1, \dots Y_n) < \frac{c}{n} - \frac{L}{n} \sqrt{c \log\frac{1}{\delta_1}}\right\} \leq \delta_1 \tag{8}$

Now, we can lower bound the LHS by:

$\begin{aligned} \mathbb{P}&\left\{\hat{\mu}_n(Y_1, \dots Y_n) < \frac{c}{n} - \frac{L}{n} \sqrt{c \log\frac{1}{\delta}}\right\}\\ &\geq \mathbb{P}\left\{\hat{\mu}_n(Y_1, \dots Y_n) < \frac{c}{n} - \frac{L}{n} \sqrt{c \log\frac{1}{\delta_{1}}}, \sum_{i=1}^n Y_i = c\right\}\\ &\geq \frac{1}{e\sqrt{c}} \mathbb{P}\left\{\hat{\mu}_n(Y_1, \dots Y_n) < \frac{c}{n} - \frac{L}{n} \sqrt{c \log\frac{1}{\delta_{1}}} \mid \sum_{i=1}^n Y_i = c\right\} \end{aligned}$

Here the last line follows from the fact that $\sum_{i}Y_i$ is Poisson with parameter $c$ and then applying Stirling’s formula. Note that the conditional joint distribution of $n$ independent $\text{Poiss}(\lambda)$ random variables conditioned on the event that their sum equals $c$ only depends on $c$ , not $\lambda$ . So:

$\begin{aligned} \mathbb{P}&\left\{\hat{\mu}_n(Y_1, \dots Y_n) < \frac{c}{n} - \frac{L}{n} \sqrt{c \log\frac{1}{\delta_{1}}} \mid \sum_{i=1}^n Y_i = c\right\}\\ &=\mathbb{P}\left\{\hat{\mu}_n(Y_1, \dots Y_n) < \frac{c}{n} - \frac{L}{n} \sqrt{c \log\frac{1}{\delta_{1}}} \mid \sum_{i=1}^n X_i = c\right\} \end{aligned}$

Combined with Equation $8$ and $\delta_1 = 1/(2e\sqrt{c})$ , we have:

$\begin{aligned} \frac{1}{2} &= 1 - e\sqrt{c}\delta_1\\ &\leq \frac{\mathbb{P}\left\{\hat{\mu}_n(Y_1, \dots Y_n) \geq \frac{c}{n} - \frac{L}{n} \sqrt{c \log\frac{1}{\delta_1}}, \sum_{i=1}^n X_i = c\right\}}{\mathbb{P}\left\{\sum_{i = 1}^n X_i =c\right\}}\\ &\leq ec!\mathbb{P}\left\{\hat{\mu}_n(Y_1, \dots Y_n) \geq \frac{c}{n} - \frac{L}{n} \sqrt{c \log\frac{1}{\delta_1}}\right\}\\ &\leq ec!\mathbb{P}\left\{\hat{\mu}_n(Y_1, \dots Y_n) \geq \frac{1}{n} + \frac{c-1}{n} - \frac{L}{n} \sqrt{c \log\frac{1}{\delta_1}}\right\}\\ &\leq ec!\mathbb{P}\left\{\hat{\mu}_n(Y_1, \dots Y_n) \geq \frac{1}{n} + \frac{c-1}{2n}\right\} \end{aligned}$

The third inequality uses that for our choice of $\delta_1$ , whenever $L \geq 10$ :

$\frac{L}{n} \sqrt{c \log\frac{1}{\delta_1}} \leq \frac{c -1}{n}$

By sub-Gaussianity, we also have for $\delta_2 = 2e^{-L^4/4}$ :

$\mathbb{P}\left\{\hat{\mu}_n(Y_1, \dots Y_n) \geq \frac{1}{n} + \frac{c-1}{2n}\right\} = \mathbb{P}\left\{\hat{\mu}_n(Y_1, \dots Y_n) \geq \frac{1}{n} + \frac{L}{n}\sqrt{\log (2/\delta_2)}\right\} \leq \delta_2$

So, we have $1/2 \leq ec!\delta_2 = 2ec!e^{-L^4/4}$ . But explicitly computing the RHS, shows that it is less than $1/2$ for $L \geq 50$ . Hence we have a contradiction.

A caveat to the above analysis is that Lepski’s method can be used if one knows more than finite variance. Specifically, if one has access to non-trivial upper and lower bounds on the variance. Also, as shown in a previous section, existence of higher moments also suffices for such $\delta$ -independent estimators.