Introduction

info

Statistical learning is based on "well-formed"statistical models. This makes it easy to define likelihood.

Probability and Statistics

Random Variable

A random variable (r.v) $X$ is a measurable function from a probability space $(\Omega , \mathcal A, \mathbb P)$ to a measurable space $(E, \mathcal E)$

If you don't understand anything, that's okay. You just need to remember that this is an application of a large space we don't know about to our real world. If you want to fully understand this definition, take a look at the Appendix.

note

A sample $x$ is a realization of the random variable $X$ .

Exemple 1
Exemple 2

You can imagine that you're flipping a coin. You'll know it's either heads or tails (i.e. $E=\{heads, tails\}$ ) without knowing the pulse you're giving (i.e. $\Omega$ ).

Law of X

The law of $X$ or the probability distribution of $X$ is its measure of probability $\mathbb P_X$ on $(E, \mathcal E)$ . $\forall \varepsilon \in \mathcal E, \mathbb P_X := (X^{-1}(\varepsilon))$

If you don't understand anything again, that's okay too. You just need to remember that the law of $X$ is a function that describes how the values of $X$ are distributed. Many of these are already known and have names: bernoulli, normal, gamma, etc. A part of the job of a probability/statistics researcher is to play with them. If you want to fully understand this definition, take a look at the Appendix.

Statistical models

Statistical modeling is the basis of all statistical inference. To model an experiment is to propose a theoretical law for the random variable $X = (X_1, . . . , X_n)$ .

Statistical model

A statistical model $\mathcal M$ is the tuple $\mathcal M := ( \mathcal X^n, \mathcal A^n, \mathcal P)$ where $(\mathcal X^n, \mathcal A^n)$ is a measurable space, $\mathcal P$ is a family of probability law (i.e $\mathcal P = (\mathbb P^n_\theta)_{\theta \in \Theta}$ ) and $n$ is just the number of variable.

It may seem complicated at first, but reassuringly it's not. Let's look at an example through an exercise.

Exercise
Tips
Result

Give the statistical model of a $m$ identical flipping coin. A priori, we don't know if the coins are balanced.

note

If the distribution is i.i.d, we can use the direct notation $\mathcal M$ is the tuple $\mathcal M := ( \mathcal X, \mathcal A, \mathbb P_{\theta\in \Theta})$

Parametric model

$\mathcal M$ is said parametric when $\Theta$ has a finite dimension.

Unparametric model

$\mathcal M$ is said unparametric when $\Theta$ has a unfinished dimension.

As unparametric is less restrictive than parametric, it can be used with a wide range of assumptions.

Identifiability

$\mathcal M$ is identifiable when $\forall \theta, \theta ' \in \Theta, \mathbb P_\theta = \mathbb P_{\theta'} \Rightarrow \theta = \theta'$

This assumption is essential for a good machine learning model and is often forgotten by data scientists. I hope the example below speaks for itself.

Example 1
Example 2

Assume $\mathcal M = ( \mathbb R, \mathcal B, \mathcal P = (\mathcal N(\alpha + \beta, 1 ), \mathcal N(\gamma + \beta, 1)), \theta=(\alpha, \gamma, \beta))$ is NOT identifiable because for $\theta = (1,1,0)$ and $\theta' = (0,0,1)$ we have $\mathcal N(\alpha + \beta, 1 ) = \mathcal N(\gamma + \beta, 1) =\mathcal N(1, 1)$ . So if we create an ML algo. to find the good parameters, it could not choose between $\theta$ and $\theta '$ from the data.

Dominated model

$\forall \mathbb P \in \mathcal P, \forall A \in \mathcal A, \mathbb P_\theta(A) = \int_A f_\theta(x)d\xi(x)$ with $\xi$ a positive $\sigma$ -finite measure that dominate $\mathbb P$ (i.e when $\xi = 0 \Rightarrow \mathbb P = 0$ )

Again, if you didn't understand everything: that's ok. Just remember that in statistics, we precise that the family of laws is not a set of horrible function with the "domination". We can have a density thanks to that !

Regular model

A model is regular when:

$supp(f(\cdot; \theta))$ is independent of $\theta$
$f(\cdot; \theta) \in \mathcal C^2$
$\int f(x;\theta)d\nu(x)$ is twice derivable and we can switch derivable and integral

Idem, just to be sure that the law family is chill !

Example

The model with a family $\mathcal P = \mathcal U(0, \theta)$ is not regular because the first condition is not fit.

Likelihood

Likelihood

In a parametric dominated model with the realizations $x_1, ...,x_n$ , the likelihood is the function: $L(\theta;x_1, ..., x_n) = f_\theta(x_1, ..., x_n)$

For a i.i.d sample, $L(\theta;x_1, ..., x_n) = \prod^n_{i=1} f_\theta(x_i)$

Example 1
Example 2

For an i.i.d $n$ -sample $x=(x_1,...,x_n)$ of gaussian law $\mathcal N(\mu, \sigma^2)$ . Assume that we know $\sigma^2$ , the likelihood of the model is $L(\mu;x)=\frac{1}{\sqrt{2 \pi \sigma^2}^n} \exp (- \frac{\sum^n_{i=1} (x_i-\mu)^2}{2\sigma^2})$

note

For simplicity for i.i.d sample, we often use the log-likelihood $\ell_\theta:=\log L(\theta; x_1,...x_n)=\sum^n_{i=1} \log f_\theta(x_i)$

It's important to understand what likelihood is, and how it differs from classical probability. It's the same function, but not from the same point of view. For probabilities, we know the framework in which we're doing the experiment, for example, I know that my coin is fair and I can calculate the probabilities from there (the density is a function of $x$ , the expected result). In the case of likelihood, we don't know the precise framework, but knowing the results, we try to assign a probability of having these results for each possible framework (the density is a function of $\theta$ , the setting parameter).

Probability and Statistics​

Statistical models​

Likelihood​

Probability and Statistics

Statistical models

Likelihood