Skip to main content

Linear Regression

A regression function is the estimation of E(Y∣X)\mathbb E(Y|X).

The regression minimize the quadratic risk R(h)=E((h(X)βˆ’Y)2)\mathcal R(h)=\mathbb E((h(X)-Y)^2)

R(h)=E((h(X)βˆ’Y)2)=E((h(X)βˆ’E(Y∣X)+E(Y∣X)βˆ’Y)2)=E((h(X)βˆ’E(Y∣X))2+(E(Y∣X)βˆ’Y)2+2(h(X)βˆ’E(Y∣X))(E(Y∣X)βˆ’Y))=E((h(X)βˆ’E(Y∣X))2)+E((E(Y∣X)βˆ’Y)2)+2E((h(X)βˆ’E(Y∣X))(E(Y∣X)βˆ’Y))=E((h(X)βˆ’E(Y∣X))2)+R(E(Y∣X))+0β‰₯R(E(Y∣X))\begin{align*} R(h) &= \mathbb E((h(X)-Y)^2)\\ &= \mathbb E((h(X)-\mathbb E(Y|X)+\mathbb E(Y|X)-Y)^2) \\ &= \mathbb E((h(X)-\mathbb E(Y|X))^2+(\mathbb E(Y|X)-Y)^2+2(h(X)-\mathbb E(Y|X))(\mathbb E(Y|X)-Y)) \\ &= \mathbb E((h(X)-\mathbb E(Y|X))^2)+\mathbb E((\mathbb E(Y|X)-Y)^2)+2\mathbb E((h(X)-\mathbb E(Y|X))(\mathbb E(Y|X)-Y)) \\ &= \mathbb E((h(X)-\mathbb E(Y|X))^2)+R(\mathbb E(Y|X))+0 \geq R(\mathbb E(Y|X)) \end{align*}

So the minimum is obtain for h(X)=E(Y∣X)h(X)=\mathbb E(Y|X)

Suppose that the target follow Y=X⊀β+ΡY = X^\top \beta + \varepsilon with:

  1. linear expectation in β\beta : E(Y∣X)=X⊀β\mathbb E(Y|X) = X^\top \beta
  2. centered error : E(Ρ∣X)=0\mathbb E(\varepsilon | X) = 0
  3. variance of the error is constant : var(Ρ∣X)=Οƒ2var(\varepsilon | X) = \sigma^2
  4. independence of the error: cov(Ρi,Ρj∣Xi,Xj)=0cov(\varepsilon_i, \varepsilon_j|X_i, X_j) = 0
note

The hypotesis class is H:={hβ(x)=x⊀β∣β∈Rp}\mathcal H := \{ h_\beta(x) = x^\top \beta | \beta \in \mathbb R^p \}. We can also denote x⊀β=ηx^\top \beta=\eta.

Suppose that the target follow Y=X⊀β+ΡY = X^\top \beta + \varepsilon with:

  1. linear expectation in β\beta : E(Y∣X)=X⊀β\mathbb E(Y|X) = X^\top \beta
  2. errors follow Ρ∼N(0,Οƒ2)\varepsilon \sim \mathcal N(0, \sigma^2) iid

Above the XX is a random vector. Below, we will use the experimental plan XX define as (X1⊀X2⊀X3⊀)\begin{pmatrix} X_1^\top \\ X_2^\top \\ X_3^\top \end{pmatrix}

note

The associated model is given by M=(R,B(R),P=N(XΞ²,Οƒ2In)(Ξ²,Οƒ2)∈RpΓ—R+βˆ—)\mathcal M=(\mathbb R, \mathcal B(\mathbb R), \mathcal P = \mathcal N(X\beta, \sigma^2 I_n)_{(\beta, \sigma^2)\in \mathbb R^p \times \mathbb R^*_+}).

The model is identifiable iff XX is full rank or Ker(X)=0Ker(X)={0} or XX injective or columns of XX independent.

When to do a linear regression ? For each couple of variables do a scatter plot and if you see a linear correlation with the target for a lot of features, bingo ! Be careful, if two features are "strongly" correlate you can drop one.

We can add a feature equal to 11 for each data because, for now, the line go through the origin. With the intercept we add a new estimator Ξ²0\beta_0 that represent the height at the origin of the line.

Normalize the features is not mandatory but it can be interesting if you want to compare the elements of Ξ²\beta. It can also help for numerical stabilization.

Estimation​

You have two way to estimate ΞΈ\theta: from residual sum of squares or from likelihood. They both have the same estimator of Ξ²\beta and two equivalent estimator of Οƒ2\sigma^2. Keep in mind that the first one is for linear regression and the second need the gaussian supposition.

Residual sum of squares is the empirical risk define as RSS(h):=βˆ‘i(Yiβˆ’h(Xi))2=∣∣Yβˆ’h(X)∣∣22=∣∣Yβˆ’Xβ∣∣22RSS(h):=\sum_i(Y_i-h(X_i))^2=||Y-h(X)||^2_2=||Y-X\beta||^2_2

We want to find the learning rule h^(X)=XΞ²^\hat h(X)=X\hat\beta that minimize this risk RSSRSS.

Ξ²^RSS:=arg⁑min⁑β∈Rp∣∣Yβˆ’Xβ∣∣22\hat \beta_{RSS}:=\arg\min_{\beta \in \mathbb R^p}||Y-X\beta||^2_2

ΞΈ^MLE:=(Ξ²^MLE,Οƒ^MLE2):=arg⁑max⁑θ∈RpΓ—R+βˆ—β„“ΞΈ(Y)\hat \theta_{MLE}:=(\hat\beta_{MLE}, \hat\sigma^2_{MLE}):=\arg\max_{\theta \in \mathbb R^p\times \mathbb R^*_+}\ell_\theta(Y)

If the model is identifiable, Ξ²^=Ξ²^RSS=Ξ²^MLE=(X⊀X)βˆ’1X⊀Y\hat \beta=\hat \beta_{RSS}=\hat \beta_{MLE}=(X^\top X)^{-1}X^\top Y

βˆ‡Ξ²βˆ£βˆ£XΞ²βˆ’Y∣∣22=2X⊀XΞ²βˆ’2X⊀Y=0β‡’Ξ²=(X⊀X)βˆ’1X⊀Y\begin{align*} \nabla_\beta ||X \beta-Y||^2_2 &= 2X^\top X \beta - 2X^\top Y = 0 \Rightarrow \beta = (X^\top X)^{-1}X^\top Y \end{align*}

The inverse exist thank to the identifiability ! βˆ€v∈Rps.tX⊀Xv=0β‡’v⊀X⊀Xv=0β‡’βˆ£βˆ£Xv∣∣22=0β‡’v=0\forall v \in \mathbb R^p s.t X^\top Xv=0 \Rightarrow v^\top X^\top Xv=0 \Rightarrow ||Xv||^2_2=0 \Rightarrow v=0 so Ker(X⊀X)={0}Ker(X^\top X)=\{0\}

Check if it is the minimum βˆ‡Ξ²2∣∣XΞ²βˆ’Y∣∣22=2X⊀X>0\nabla^2_\beta ||X \beta-Y||^2_2=2X^\top X >0 OK!

b(Ξ²^)=0b(\hat \beta)=0 and var(Ξ²^)=Οƒ2(X⊀X)βˆ’1var(\hat \beta)=\sigma^2(X^\top X)^{-1} is the minimal variance possible for unbiased linear estimator.

Ξ²^=(X⊀X)βˆ’1X⊀Y=(X⊀X)βˆ’1X⊀(XΞ²+Ξ΅)=Ξ²+(X⊀X)βˆ’1X⊀Ρ\begin{align*} \hat \beta &= (X^\top X)^{-1}X^\top Y = (X^\top X)^{-1}X^\top (X\beta +\varepsilon) = \beta + (X^\top X)^{-1}X^\top \varepsilon \end{align*}

So,

var(Ξ²^)=var((X⊀X)βˆ’1X⊀Ρ)=(X⊀X)βˆ’1X⊀var(Ξ΅)((X⊀X)βˆ’1X⊀)⊀=Οƒ2(X⊀X)βˆ’1\begin{align*} var(\hat \beta) &= var((X^\top X)^{-1}X^\top \varepsilon) = (X^\top X)^{-1}X^\top var(\varepsilon) ((X^\top X)^{-1}X^\top)^\top = \sigma^2(X^\top X)^{-1} \end{align*}

The hat matrix is defined as the orthogonal projection on Im(X)Im(X), HX:=(X⊀X)βˆ’1X⊀H_X:=(X^\top X)^{-1}X^\top

If the model is identifiable, Οƒ^RSS2=RSS(h^RSS)/(nβˆ’p)\hat \sigma^2_{RSS}=RSS(\hat h_{RSS})/(n-p) is an unbiased estimator of Οƒ2\sigma^2.

If the model is identifiable and gaussian, Οƒ^MLE2=RSS(h^MLE)/n\hat \sigma^2_{MLE}=RSS(\hat h_{MLE})/n is a biased estimator of Οƒ2\sigma^2.

E(RSS(h^RSS))=E(∣∣XΞ²^RSSβˆ’Y∣∣22)=E(∣∣Yβˆ’Ξ΅^βˆ’Y∣∣22)=E(∣∣Ρ^∣∣22)=E(tr(Ξ΅^⊀Ρ^))=E(tr(Ξ΅^Ξ΅^⊀))=tr(E(Ξ΅^Ξ΅^⊀))=tr(E(HXβŠ₯ΡΡ⊀HXβŠ₯⊀))=Οƒ2tr(HXβŠ₯)=Οƒ2(nβˆ’p)\begin{align*} \mathbb E(RSS(\hat h_{RSS})) &= \mathbb E(||X \hat \beta_{RSS}-Y||^2_2) = \mathbb E(||Y - \hat \varepsilon - Y||^2_2) = \mathbb E(||\hat \varepsilon||^2_2) = \mathbb E(tr(\hat \varepsilon ^\top \hat \varepsilon )) \\ &= \mathbb E(tr(\hat \varepsilon \hat \varepsilon^\top )) = tr(\mathbb E(\hat \varepsilon \hat \varepsilon^\top )) = tr(\mathbb E(H_{X^\perp} \varepsilon \varepsilon^\top H_{X^\perp}^\top )) = \sigma^2 tr(H_{X^\perp}) \\ &= \sigma^2(n-p) \end{align*}

For an identifiable linear model,

Ξ²^=(X⊀X)βˆ’1X⊀Y\hat \beta=(X^\top X)^{-1}X^\top Y and Οƒ^2=∣∣Yβˆ’XΞ²^∣∣22/(nβˆ’p)\hat \sigma^2=||Y-X\hat \beta||^2_2/(n-p)

We don't have any information on the laws.

For an identifiable gaussian linear model, you have

Ξ²^=(X⊀X)βˆ’1X⊀Y∼N(Ξ²,Οƒ2(X⊀X)βˆ’1)\hat \beta=(X^\top X)^{-1}X^\top Y \sim \mathcal N(\beta, \sigma^2(X^\top X)^{-1})

and you can choose between

Οƒ^12=∣∣Yβˆ’XΞ²^∣∣22/(nβˆ’p)\hat \sigma_1^2=||Y-X\hat \beta||^2_2/(n-p) or Οƒ^22=∣∣Yβˆ’XΞ²^∣∣22/n\hat \sigma_2^2=||Y-X\hat \beta||^2_2/n

but you have any way nβˆ’pΟƒ2Οƒ^12=nΟƒ2Οƒ^22βˆΌΟ‡2(nβˆ’p)\frac{n-p}{\sigma^2}\hat \sigma_1^2 =\frac{n}{\sigma^2}\hat \sigma_2^2 \sim \chi^2(n-p)

Model Validation​

Confidence Interval​

Tests​

Error​

R2R^2​

Some plots​

New data​

Confidence Interval​

Cook distance​

Generalization​

To qualitative features​

To non linear features​