Skip to main content

Introduction

info

Statistical learning is based on "well-formed"statistical models. This makes it easy to define likelihood.

Probability and Statistics

A random variable (r.v) XX is a measurable function from a probability space (Ω,A,P)(\Omega , \mathcal A, \mathbb P) to a measurable space (E,E)(E, \mathcal E)

If you don't understand anything, that's okay. You just need to remember that this is an application of a large space we don't know about to our real world. If you want to fully understand this definition, take a look at the Appendix.

note

A sample xx is a realization of the random variable XX.

You can imagine that you're flipping a coin. You'll know it's either heads or tails (i.e. E={heads,tails}E=\{heads, tails\}) without knowing the pulse you're giving (i.e. Ω\Omega).

The law of XX or the probability distribution of XX is its measure of probability PX\mathbb P_X on (E,E)(E, \mathcal E). εE,PX:=(X1(ε))\forall \varepsilon \in \mathcal E, \mathbb P_X := (X^{-1}(\varepsilon))

If you don't understand anything again, that's okay too. You just need to remember that the law of XX is a function that describes how the values of XX are distributed. Many of these are already known and have names: bernoulli, normal, gamma, etc. A part of the job of a probability/statistics researcher is to play with them. If you want to fully understand this definition, take a look at the Appendix.

Statistical models

Statistical modeling is the basis of all statistical inference. To model an experiment is to propose a theoretical law for the random variable X=(X1,...,Xn)X = (X_1, . . . , X_n).

A statistical model M\mathcal M is the tuple M:=(Xn,An,P)\mathcal M := ( \mathcal X^n, \mathcal A^n, \mathcal P) where (Xn,An)(\mathcal X^n, \mathcal A^n) is a measurable space, P\mathcal P is a family of probability law (i.e P=(Pθn)θΘ\mathcal P = (\mathbb P^n_\theta)_{\theta \in \Theta}) and nn is just the number of variable.

It may seem complicated at first, but reassuringly it's not. Let's look at an example through an exercise.

Give the statistical model of a mm identical flipping coin. A priori, we don't know if the coins are balanced.

note

If the distribution is i.i.d, we can use the direct notation M\mathcal M is the tuple M:=(X,A,PθΘ)\mathcal M := ( \mathcal X, \mathcal A, \mathbb P_{\theta\in \Theta})

M\mathcal M is said parametric when Θ\Theta has a finite dimension.

M\mathcal M is said unparametric when Θ\Theta has a unfinished dimension.

As unparametric is less restrictive than parametric, it can be used with a wide range of assumptions.

M\mathcal M is identifiable when θ,θΘ,Pθ=Pθθ=θ\forall \theta, \theta ' \in \Theta, \mathbb P_\theta = \mathbb P_{\theta'} \Rightarrow \theta = \theta'

This assumption is essential for a good machine learning model and is often forgotten by data scientists. I hope the example below speaks for itself.

Assume M=(R,B,P=(N(α+β,1),N(γ+β,1)),θ=(α,γ,β))\mathcal M = ( \mathbb R, \mathcal B, \mathcal P = (\mathcal N(\alpha + \beta, 1 ), \mathcal N(\gamma + \beta, 1)), \theta=(\alpha, \gamma, \beta)) is NOT identifiable because for θ=(1,1,0)\theta = (1,1,0) and θ=(0,0,1)\theta' = (0,0,1) we have N(α+β,1)=N(γ+β,1)=N(1,1)\mathcal N(\alpha + \beta, 1 ) = \mathcal N(\gamma + \beta, 1) =\mathcal N(1, 1). So if we create an ML algo. to find the good parameters, it could not choose between θ\theta and θ\theta ' from the data.

PP,AA,Pθ(A)=Afθ(x)dξ(x) \forall \mathbb P \in \mathcal P, \forall A \in \mathcal A, \mathbb P_\theta(A) = \int_A f_\theta(x)d\xi(x) with ξ\xi a positive σ\sigma-finite measure that dominate P\mathbb P (i.e when ξ=0P=0\xi = 0 \Rightarrow \mathbb P = 0)

Again, if you didn't understand everything: that's ok. Just remember that in statistics, we precise that the family of laws is not a set of horrible function with the "domination". We can have a density thanks to that !

A model is regular when:

  1. supp(f(;θ))supp(f(\cdot; \theta)) is independent of θ\theta
  2. f(;θ)C2f(\cdot; \theta) \in \mathcal C^2
  3. f(x;θ)dν(x)\int f(x;\theta)d\nu(x) is twice derivable and we can switch derivable and integral

Idem, just to be sure that the law family is chill !

The model with a family P=U(0,θ)\mathcal P = \mathcal U(0, \theta) is not regular because the first condition is not fit.

Likelihood

In a parametric dominated model with the realizations x1,...,xnx_1, ...,x_n, the likelihood is the function: L(θ;x1,...,xn)=fθ(x1,...,xn) L(\theta;x_1, ..., x_n) = f_\theta(x_1, ..., x_n)

For a i.i.d sample, L(θ;x1,...,xn)=i=1nfθ(xi)L(\theta;x_1, ..., x_n) = \prod^n_{i=1} f_\theta(x_i)

For an i.i.d nn-sample x=(x1,...,xn)x=(x_1,...,x_n) of gaussian law N(μ,σ2)\mathcal N(\mu, \sigma^2). Assume that we know σ2\sigma^2, the likelihood of the model is L(μ;x)=12πσ2nexp(i=1n(xiμ)22σ2)L(\mu;x)=\frac{1}{\sqrt{2 \pi \sigma^2}^n} \exp (- \frac{\sum^n_{i=1} (x_i-\mu)^2}{2\sigma^2})

note

For simplicity for i.i.d sample, we often use the log-likelihood θ:=logL(θ;x1,...xn)=i=1nlogfθ(xi)\ell_\theta:=\log L(\theta; x_1,...x_n)=\sum^n_{i=1} \log f_\theta(x_i)

It's important to understand what likelihood is, and how it differs from classical probability. It's the same function, but not from the same point of view. For probabilities, we know the framework in which we're doing the experiment, for example, I know that my coin is fair and I can calculate the probabilities from there (the density is a function of xx, the expected result). In the case of likelihood, we don't know the precise framework, but knowing the results, we try to assign a probability of having these results for each possible framework (the density is a function of θ\theta, the setting parameter).