Skip to main content

Estimators

info

We create estimators from sufficient statistics. These estimators have some good properties and sometimes we can even know if they are the best.

A statistic is a function tt of the sample xx which not depend directly of an unknown parameter θ\theta.

An estimator of ν(θ)\nu (\theta) is a r.v TnT_n which is measurable and calculable on the sample xx.

Tn=t(X):=ν^ T_n=t(X):=\hat \nu

The estimator of the mean is the sample mean: μ^=1niXi:=Xˉ \hat \mu = \frac{1}{n}\sum_i X_i := \bar X

note

An estimator is not only to estimate parameters of a law! It can be any function which is a statistic. But if it is the case, we denote θ^\hat \theta the estimator of the parameter θ\theta.

A statistic is sufficient for θ\theta when Pθ(Xt(X)=t(x))\mathbb P_\theta (X|t(X)=t(x)) not depend of θ\theta. That's mean that the statistic give all the information we need about θ\theta.

Suppose X1,...,XnExp(λ)X_1, ..., X_n \sim Exp(\lambda):

  • t(X1,...,Xn)=X1t(X_1, ..., X_n) = X_1 is a statistic
  • t(X1,...,Xn)=iXit(X_1, ..., X_n) = \sum_i X_i is a sufficient statistic
  • λ^=n/iXi=n/t(X1,...,Xn)\hat \lambda = n/{\sum_i X_i}=n/{t(X_1,...,X_n)} is an estimator of λ\lambda

A statistic is sufficient for θ\theta iff the density of the sample is factorisable like f(x;θ)=h(x)g(t(x);θ)f(x;\theta) = h(x)g(t(x);\theta)

The bias of an estimator is b(ν^)=E(ν^ν)=E(ν^)νb(\hat \nu)=\mathbb E(\hat \nu - \nu)=\mathbb E(\hat \nu) - \nu

The estimator is unbiased when b(ν^)=0b(\hat \nu)=0

The variance of an estimator is var(ν^)=E((ν^E(ν^))2)var(\hat \nu)=\mathbb E((\hat \nu -E(\hat \nu))^2)

The risk of an estimator is R(ν^)=E((ν^,ν))R(\hat \nu)=\mathbb E(\ell(\hat \nu, \nu)) with \ell a loss function.

The Mean Squared Error (MSE) is the risk with the loss function (ν^,ν)=(ν^ν)2\ell(\hat \nu, \nu)=(\hat \nu - \nu)^2

note

For any minimization of the risk, you have a bias-variance trade off to deal with.

Proof for the MSE:

MSE(ν^)=E((ν^ν)2)=E((ν^E(ν^)+E(ν^)ν)2)=E((ν^E(ν^))2+(E(ν^)ν)2+2(ν^E(ν^))(E(ν^)ν))=E((ν^E(ν^))2)+E((E(ν^)ν)2)+2E(ν^E(ν^))(E(ν^)ν)=var(ν^)+b(ν^)2+0\begin{align*} MSE(\hat \nu) &= \mathbb{E}((\hat \nu - \nu)^2) \\ &= \mathbb{E}((\hat \nu - \mathbb{E}(\hat \nu) + \mathbb{E}(\hat \nu) - \nu)^2) \\ &= \mathbb{E}((\hat \nu - \mathbb{E}(\hat \nu))^2 + (\mathbb{E}(\hat \nu) - \nu)^2 + 2(\hat \nu - \mathbb{E}(\hat \nu))(\mathbb{E}(\hat \nu) - \nu)) \\ &= \mathbb{E}((\hat \nu - \mathbb{E}(\hat \nu))^2) + \mathbb{E}((\mathbb{E}(\hat \nu) - \nu)^2) + 2\mathbb{E}(\hat \nu - \mathbb{E}(\hat \nu))(\mathbb{E}(\hat \nu) - \nu) \\ &= var(\hat \nu) + b(\hat \nu)^2 + 0 \end{align*}

Asymptotic

  • ν^\hat \nu is weakly consistent when θΘ,ε>0,limnPθ(ν^ν>ε)=0\forall \theta \in \Theta, \forall \varepsilon > 0, \lim_{n \to \infty} \mathbb P_\theta(|\hat \nu - \nu|>\varepsilon) = 0 denote ν^Pν\hat \nu \xrightarrow{P} \nu
  • ν^\hat \nu is strongly consistent (or almost sure) when Pθ(limnν^ν=0)=1\mathbb P_\theta(\lim_{n \to \infty}|\hat \nu - \nu|=0) = 1 denote ν^a.sν\hat \nu \xrightarrow{a.s} \nu
  • ν^\hat \nu is consistent in distribution when f\forall f continue and bounded, limnE(f(ν^))=E(f(ν))\lim_{n \to \infty} \mathbb E(f(\hat \nu))=\mathbb E(f(\nu)) denote ν^dν\hat \nu \xrightarrow{d} \nu
  • ν^\hat \nu is consistent in risk when θΘ,limnR(ν^)=0\forall \theta \in \Theta, \lim_{n \to \infty}R(\hat \nu)=0

For risks in the following forms: E(ν^νr)\mathbb E(|\hat \nu- \nu|^r) with rNr\in \mathbb N, we denote ν^Lrν\hat \nu \xrightarrow{L^r} \nu

ν^Lrνν^Pνν^dν\hat \nu \xrightarrow{L^r} \nu \Rightarrow \hat \nu \xrightarrow{P} \nu \Rightarrow \hat \nu \xrightarrow{d} \nu and ν^a.sνν^Pν\hat \nu \xrightarrow{a.s} \nu \Rightarrow \hat \nu \xrightarrow{P} \nu

Asymptotic laws

If X1,...,XnX_1,..., X_n i.i.d with E(Xi)=μ\mathbb E(X_i) = \mu. Then,

1niXiPμ \frac{1}{n}\sum_i X_i \xrightarrow{P} \mu

If X1,...,XnX_1,..., X_n i.i.d with E(Xi)=μ\mathbb E(X_i) = \mu and var(Xi)=σ2<var(X_i) = \sigma^2< \infty. Then,

n(Xˉμ)dN(0,σ2) \sqrt n (\bar X - \mu) \xrightarrow{d} \mathcal N(0, \sigma^2)

If X1,...,XnX_1,..., X_n i.i.d with E(Xi)=μ\mathbb E(X_i) = \mu, var(Xi)=σ2<var(X_i) = \sigma^2< \infty and hh a function differentiable in θ\theta. Then,

n(h(θ^)h(θ))dN(0,σ2h(θ)2) \sqrt n (h(\hat \theta) - h(\theta)) \xrightarrow{d} \mathcal N(0, \sigma^2 h'(\theta)^2)

If un(XnX)dZu_n(X_n-X)\xrightarrow{d}Z with un+u_n \rightarrow +\infty and hh function differentiable in XX. Then,

un(h(Xn)h(X))dDh(X).Z u_n (h(X_n) - h(X)) \xrightarrow{d} \mathcal Dh(X).Z

Methods for creating estimators

Method of Moments

The moment k of XX is mk:=E(Xk)m_k:=\mathbb E(X^k)

We deduce the empirical moment k of XX with is m^k=1niXik\hat m_k=\frac{1}{n}\sum_i X_i^k

The aim is to describe the parameter θ\theta with the moments of XX and then plugin the empirical moment to get θ^MM\hat \theta_{MM}.

  1. Describe the parameters θ\theta with the moments mkm_k. You may have a system if dim(θ)>1dim(\theta) > 1.
  2. Resolve the system for θ\theta.
  3. Plugin the m^k\hat m_k to get θ^MM\hat \theta_{MM}.

Let's find the parameters θ=(μ,σ2)\theta=(\mu, \sigma^2) of N(μ,σ2)\mathcal N(\mu, \sigma^2).

  1. μ=E(X)=m1\mu = \mathbb E(X) = m_1 and σ2=var(X)=E(X2)(E(X))2=m2m12\sigma^2 = var(X) = \mathbb E(X^2) - (\mathbb E(X))^2=m_2-m_1^2
  2. Already resolve
  3. μ^=m^1=1niXi\hat \mu = \hat m_1= \frac{1}{n}\sum_i X_i and σ^2=m^2m^12=1niXi2(1niXi)2\hat \sigma^2=\hat m_2 - \hat m_1^2 = \frac{1}{n}\sum_i X_i^2 - (\frac{1}{n}\sum_i X_i)^2
note

For complex variable, it is often we can't compute the moments. We can try to do it numerically.

Maximum Likelihood Estimation

The Maximum likelihood estimation is the estimator given by

θ^MLE:=argmaxθΘL(θ;x)=argmaxθΘθ\hat \theta_{MLE}:=\arg\max_{\theta \in \Theta}L(\theta;x)=\arg\max_{\theta \in \Theta}\ell_\theta

The method is the classic method to find a maximum:

  1. Compute the gradient θ\nabla \ell_\theta.
  2. Find all critical points by resolving θ=0\nabla \ell_\theta=0.
  3. Find all the extrema by checking if 2θ<0\nabla^2 \ell_\theta<0 on the critical point.
  4. Choose one extrema as θ^\hat \theta.
note

Be careful because in real life, the likelihood has a lot of local extrema! Moreover, if the model is not regular, the method have to be modify.

Let's estimate the parameters θ\theta of B(θ)\mathcal B(\theta)

  1. θ=iXilog(θ)+i(1Xi)log(1θ)θ=iXiθniXi1θ \ell_\theta=\sum_i X_i \log(\theta) + \sum_i (1-X_i) \log(1-\theta) \Rightarrow \nabla \ell_\theta=\frac{\sum_i X_i}{\theta} - \frac{n -\sum_i X_i}{1-\theta}
  2. θ=0iXiθ=niXi1θθ=1niXiθ^MLE=1niXi \nabla \ell_\theta=0 \Rightarrow \frac{\sum_i X_i}{\theta} = \frac{n -\sum_i X_i}{1-\theta} \Rightarrow \theta = \frac{1}{n}\sum_i X_i \Rightarrow \hat \theta_{MLE} = \frac{1}{n}\sum_i X_i
  3. Checking: 2θ=iXiθ2niXi(1θ)2<0\nabla^2 \ell_\theta=-\frac{\sum_i X_i}{\theta^2} - \frac{n -\sum_i X_i}{(1-\theta)^2}<0 OK!
note

For complex variable, it is often we can't compute the likelihood. We can try to do it numerically but we have to try best to not fall into local extrema.

If tt is a sufficient statistic for θ\theta, then θ^MLE\hat \theta_{MLE} is a function a tt.

note

But θ^MLE\hat \theta_{MLE} don't have to be sufficient.

h(θ^MLE)h(\hat \theta_{MLE}) is the MLE of h(θ)h(\theta).

Suppose:

  • H1H_1: model identifiable
  • H2H_2: Θ\Theta compact and f(θ;x)C0,xXf(\theta;x) \in \mathcal C^0, \forall x \in \mathcal X
  • H3H_3: θΘ,h(x):=supθΘlogfθ(x)L1(Pθ)\forall \theta \in \Theta, h(x):=\sup_{\theta \in \Theta} |\log f_\theta(x)| \in L_1(\mathbb P_\theta)

Then, θ^MLE\hat \theta_{MLE} is strongly consistent

If θ^MLE\hat \theta_{MLE} is consistent, the model is regular and I1(θ):=Eθ(2logfθ(X1))I_1(\theta):= - \mathbb E_\theta(\nabla^2 \log f_\theta(X_1)) is invertible.

Then,

n(θ^MLEθ)dN(0,I1(θ)1) \sqrt{n}(\hat \theta_{MLE} - \theta) \xrightarrow{d} \mathcal N(0, I_1(\theta)^{-1})

So with the delta method,

n(h(θ^MLE)h(θ))dN(0,I1(θ)1h(θ)2) \sqrt{n}(h(\hat \theta_{MLE}) - h(\theta)) \xrightarrow{d} \mathcal N(0, I_1(\theta)^{-1}h'(\theta)^2)
note

If you want to know where the I1(θ)I_1(\theta) come from, check the Fisher Information

Other methods

There are a lot of different methods, I give you an non-exhaustive list:

You can find others in the Estimation Theory wiki page.

What is a good estimator?

In general

There is no uniform way to say "I have the best estimator". Most of the time that depend of what you are looking for. Sometimes you can't accept bias, sometimes you need to have a strong consistency and sometimes you just want to minimize you risk.

But you know that mathematicians don't like the answer the answer "that's depend". So they create a function that define a characteristic of the model: the Fisher Information. Sometime it can help to find the best estimator !

With the Fisher Information

The score is the vector θ\nabla \ell_\theta

The score is centered, i.e E(θ)=0\mathbb E(\nabla \ell_\theta) = 0

For a regular model, the score is additive, i.e θ(X,Y)=θ(X)+θ(Y)\nabla\ell_\theta(X,Y) = \nabla\ell_\theta(X) + \nabla\ell_\theta(Y)

The Fisher Information the variance matrix of this score:

I(θ)=var(θ)=E((θE(θ))2)=E(θ2)=E(θθ) I(\theta) = var(\nabla \ell_\theta) = E((\nabla \ell_\theta - E(\nabla \ell_\theta))^2) = E(\nabla \ell_\theta^2) = E(\nabla \ell_\theta \nabla \ell_\theta^\top)
note

Fisher's information is related to the precision with which the parameter is estimated.

note

If the model is i.i.d, we denote In(θ)I_n(\theta) the Fisher Information.

Each sample give the same information, i.e In(θ)=nI1(θ)I_n(\theta) = nI_1(\theta)

For any statistic tt, It(θ)In(θ)I_t(\theta) \leq I_n(\theta)

If the model is regular, then the Fisher Information is symmetrical, positive semi-definite and

In(θ)=E(2θ(X)) I_n(\theta) = - \mathbb E (\nabla^2 \ell_\theta(X))

If the model is regular and In(θ)I_n(\theta) is invertible.

Then,

Tnestimator unbiased of θ s.t E(Tn)<,var(Tn)In(θ)1 \forall T_n \text{estimator unbiased of } \theta \text{ s.t } \mathbb E(|T_n|)<\infty, var(T_n) \geq I_n(\theta)^{-1}

And with hh a function,

Tnestimator unbiased of h(θ) s.t E(Tn)<,var(Tn)Dh(θ)In(θ)1Dh(θ) \forall T_n \text{estimator unbiased of } h(\theta) \text{ s.t } \mathbb E(|T_n|)<\infty, var(T_n) \geq Dh(\theta)I_n(\theta)^{-1}Dh(\theta)^\top

And,

Tnestimator biased of h(θ) s.t E(Tn)<,var(Tn)(Dh(θ)+Db(θ))In(θ)1(Dh(θ)+Db(θ)) \forall T_n \text{estimator biased of } h(\theta) \text{ s.t } \mathbb E(|T_n|)<\infty, var(T_n) \geq (Dh(\theta)+Db(\theta))I_n(\theta)^{-1}(Dh(\theta)+Db(\theta))^\top

The lower bound is the Cramer-Rao bound !

An estimator TnT_n unbiased is efficient when var(Tn)var(T_n) touch the Cramer-Rao bound.

TnT_n is efficient iff the family of law is an exponential family (i.e f(x;θ)=exp(a(x)α(θ)+β(θ)+c(x))f(x; \theta)=\exp (a(x)\alpha(\theta) + \beta(\theta) + c(x))) or TnT_n following the form Aa(Xi)+BA\sum a(X_i) + B with h(θ)=Eθ(Tn)h(\theta) = \mathbb E_\theta(T_n)

The efficiency is the way to say "my estimator is the best" (among the unbiased) !

note

You can proof that your sufficient statistic is the best among the best (i.e total, complete, etc) with some properties (Lehman-Scheffé, etc.) but to be honest I never used it in practice so I skip it 😅