Skip to main content

Introduction

The feature space X\mathcal X. Usually X=Rd\mathcal X=\mathbb R^d but you can also have qualitative variable.

The target space Y\mathcal Y. It can be ever compact R\mathbb R or countable N\mathbb N.

A dataset is a couple Dn:=(Xi,Yi)\mathcal D_n:=(X_i, Y_i) where Xi,YiX_i, Y_i r.v from X,Y\mathcal X,\mathcal Y

note

You can split a dataset in a train and a test dataset or a train, a validation and a test dataset.

The hypothesis class is the set H={h:X→Y;h measurable }\mathcal H = \{h : \mathcal X \rightarrow \mathcal Y; h \text{ measurable }\} with hh a predictor.

A learning rule is a mapping from training data to hypotheses in a given hypothesis class, i.e h^:Dn→H\hat h : \mathcal D_n \rightarrow \mathcal H

note

By habit, we will not note the conditioning to the dataset of the learning rule: h^(Dn)(x):=h^(x)\hat h (\mathcal D_n)(x):=\hat h (x)

A loss function is ℓ:Y×Y→R+\ell : \mathcal Y \times \mathcal Y \rightarrow \mathbb R^+

The risk of a predictor is R(h^)=E(β„“(h^(X),Y))R(\hat h) = \mathbb E(\ell(\hat h(X), Y) )

note

Depending the dataset that you use (i.e train, validation, test) you can have different type of risk. The most important is the generalization one with the test dataset.

Most of time, we don't know the law of the data so we need to estimate R(h^∣Dn)R(\hat h|\mathcal D_n) with R^(h^)=1nβˆ‘iβ„“(h^(xi),yi)\hat{\mathcal R}(\hat h)=\frac{1}{n}\sum_i\ell(\hat h(x_i),y_i)

note

You can also estimate R(h^)R(\hat h) via cross validation, bootstrap, etc.

The bayes risk is the best possible risk from the hypothesis class, i.e Rβˆ—=inf⁑h∈HE(β„“(h(X),Y))\mathcal R^* = \inf_{h \in \mathcal H}\mathbb E(\ell(h(X), Y) )

The excess risk is R(h)βˆ’Rβˆ—R(h)-R^*

R(h^)βˆ’Rβˆ—=inf⁑h∈HRβˆ—βˆ’R(h)+((R(h^)βˆ’Rβˆ—)βˆ’(inf⁑h∈HR(h)βˆ’Rβˆ—))R(\hat h)-R^*=\inf_{h \in \mathcal H}R^*-R(h) + \left( (R(\hat h)-R^*)- (\inf_{h \in \mathcal H}R(h) -R^*) \right)

note

That is the approximation/estimation error (same than bias/variance)