No Title

Reading: Chapter 5, Chapter 6 sections 1-5, Chapter 7 sections 1-3.

STAT 350: Lecture 4

The Geometry of Least Squares

Mathematical Basics

Inner / dot product: a and b column vectors

$\begin{displaymath}a\cdot b = a^T b = \sum a_i b_i \end{displaymath}$

$\begin{displaymath}a\perp b \Leftrightarrow a^T b=0 \end{displaymath}$
Matrix Product: A is $r \times s$ B is $s \times t$

$\begin{displaymath}(AB)_{rt} = \sum_s A_{rs} B_{st} \end{displaymath}$

Partitioned Matrices

Partitioned matrices are like ordinary matrices but the entries are matrices themselves. They add and multiply (if the dimensions match properly) just like regular matrices but(!) you must remember that matrix multiplication is not commutative. Here is an example

$\begin{displaymath}A = \left[\begin{array}{c\vert c\vert c} A_{11} & A_{12} & A_{13} \\ \hline A_{21} & A_{22} & A_{23} \end{array} \right] \end{displaymath}$

$\begin{displaymath}B = \left[\begin{array}{c\vert c} B_{11} & B_{12} \\ \hline B_{21} & B_{22} \\ \hline B_{31} & B_{32} \end{array} \right] \end{displaymath}$

You can think of A as a $2 \times 3$ matrix and B as a $3 \times 2$ matrix and then multiply them to get C=AB a $2 \times 2$ matrix as follows:

$\begin{displaymath}AB = \left[\begin{array}{c\vert c} A_{11}B_{11} + A_{12}B_{... ...A_{21}B_{12} + A_{22}B_{22} + A_{23}B_{32} \end{array} \right] \end{displaymath}$

BUT: this only works if each of the matrix products in the formulas makes sense. So, A₁₁ must have the same number of columns as B₁₁ has rows and many other similar restrictions apply.

First application:

$\begin{displaymath}X = \left[ X_1 \vert X_2 \vert \cdots \vert X_p \right] \end{displaymath}$

where each X_i is a column of X. Then

$\begin{displaymath}X\beta = \left[ X_1 \vert X_2 \vert \cdots \vert X_p \right] ... ...ray}\right] = X_1 \beta_1 + X_2 \beta_2 + \cdots + X_p \beta_p \end{displaymath}$

which is a linear combination of the columns of X.

Definition: The column space of X, written ${\rm col}(X)$ is the (vector space of) set of all linear combinations of columns of X also called the space ``spanned'' by the columns of X.

SO: $\hat\mu=X\beta$ is in ${\rm col}(X)$ .

Back to normal equations:

$\begin{displaymath}X^TY = X^T X \hat\beta \end{displaymath}$

$\begin{displaymath}X^T\left[ Y-X\hat\beta\right] = 0 \end{displaymath}$

$\begin{displaymath}\left[\begin{array}{c} X_1^T \\ \vdots \\ X_p^T \end{array}\right] \left[ Y-X\hat\beta\right] = 0 \end{displaymath}$

$\begin{displaymath}X_i^T\left[ Y-X\hat\beta\right] = 0 \qquad i=1,\ldots,p \end{displaymath}$

$\begin{displaymath}Y-X\hat\beta \perp \mbox{ every vector in } {\rm col}(X) \end{displaymath}$

Definition: $\hat\epsilon = Y-X\hat\beta$ is the fitted residual vector.

SO: $\hat\epsilon \perp {\rm col}(X)$ and $\hat\epsilon \perp \hat\mu$

Pythagoras' Theorem: If $a \perp b$ then

||a||² + ||b||² = ||a+b||²

Definition: ||a|| is the ``length'' or ``norm'' of a:

$\begin{displaymath}\vert\vert a\vert\vert = \sqrt{\sum a_i^2} = \sqrt{a^Ta} \end{displaymath}$

Moreover, if $a,b,c,\ldots$ are all perpendicular then

$\begin{displaymath}\vert\vert a\vert\vert^2 + \vert\vert b\vert\vert^2 + \cdots = \vert\vert a+b+\cdots\vert\vert^2 \end{displaymath}$

Application:

$\begin{align*}Y & = Y-X\hat\beta + X \hat\beta \\ & = \hat\epsilon + \hat\mu \end{align*}$
so

$\begin{displaymath}\vert\vert Y\vert\vert^2 = \vert\vert\hat\epsilon\vert\vert^2 + \vert\vert\hat\mu\vert\vert^2 \end{displaymath}$

$\begin{displaymath}\sum Y_i^2 = \sum \hat\epsilon_i^2 + \sum \hat\mu_i^2 \end{displaymath}$

Definitions:

$\begin{displaymath}\sum Y_i^2 = \mbox{Total Sum of Squares (unadjusted)} \end{displaymath}$

$\begin{displaymath}\sum \hat\epsilon_i^2 = \mbox{Error or Residual Sum of Squares} \end{displaymath}$

$\begin{displaymath}\sum \hat\mu_i^2 = \mbox{Regression Sum of Squares} \end{displaymath}$

We have several alternative formulas for the Regression SS:
$\begin{align*}\sum \hat\mu_i^2 & = \hat\mu^T \hat\mu \\ & = (X\hat\beta)^T (X \hat\beta) \\ & = \hat\beta^T X^T X \hat\beta \end{align*}$
(Notice the matrix identity which I will use regularly: (AB)^T = B^T A^T.)

What is least squares

Choose $\hat\beta$ to minimize

$\begin{displaymath}\sum(Y_i-\hat\mu_i)^2 = \vert\vert Y-\hat\mu\vert\vert^2 \end{displaymath}$

That is, to minimize $\vert\vert\hat\epsilon\vert\vert^2$ . The resulting $\hat\mu$ is called the Orthogonal Projection of Y onto the column space of X.

Extension:

$\begin{displaymath}X= \left[ X_1 \vert X_2 \right] \quad \beta = \left[ \begin{... ... \beta_1 \\ \hline \beta_2 \end{array} \right] \quad p=p_1+p_2 \end{displaymath}$

Imagine we fit 2 models:

1.

The FULL model:

$\begin{displaymath}Y=X\beta+\epsilon ( = X_1 \beta_1 +X_2 \beta_2 + \epsilon) \end{displaymath}$

2.

The REDUCED model:

$\begin{displaymath}Y=X_1 \beta_1 +\epsilon \end{displaymath}$

If we fit the full model we get

$\begin{displaymath} \hat\beta_F \quad \hat\mu_F \quad \hat\epsilon_F \qquad \hat\epsilon_F \perp {\rm col}(X) \end{displaymath}$

(1)

If we fit the reduced model we get

$\begin{displaymath} \hat\beta_R \quad \hat\mu_R \quad \hat\epsilon_R \qquad \hat\mu_R \in {\rm col}(X_1) \subset {\rm col}(X) \end{displaymath}$

(2)

Notice that

$\begin{displaymath} \hat\epsilon_F \perp \hat\mu_R \, . \end{displaymath}$

(3)

(The vector $\hat\mu_R$ is in the column space of X₁ so it is in the column space of X and $\hat\epsilon_F$ is orthogonal to everything in the column space of X.) So:
$\begin{align*}Y &= \hat\epsilon_F + \hat\mu_F \\ &= \hat\epsilon_F + \hat\mu_R + (\hat\mu_F - \hat\mu_R) = \epsilon_R + \hat\mu_R \end{align*}$

You know $\hat\epsilon_F \perp \hat\mu_R$ (from (3) above) and $\hat\epsilon_F \perp \hat\mu_F$ (from (1) above). So

$\begin{displaymath}\hat\epsilon_F \perp \hat\mu_F- \hat\mu_R \end{displaymath}$

Also

$\begin{displaymath}\hat\mu_R \perp \hat\epsilon_R = \hat\epsilon_F + (\hat\mu_F- \hat\mu_R) \end{displaymath}$

So
$\begin{align*}0 &= (\hat\epsilon_F + \hat\mu_F- \hat\mu_R)^T \hat\mu_R \\ &= \u... ...{\hat\epsilon_F^T\hat\mu_R}_0 + ( \hat\mu_F- \hat\mu_R)^T \hat\mu_R \end{align*}$
so

$\begin{displaymath}\hat\mu_F- \hat\mu_R \perp \hat\mu_R \end{displaymath}$

Summary:

$\begin{displaymath}Y= \hat\mu_R + (\hat\mu_F- \hat\mu_R) + \hat\epsilon_F \end{displaymath}$

where all three vectors on the Right Hand Side are perpendicular to each other. This gives:

$\begin{displaymath}\vert\vert Y\vert\vert^2 = \vert\vert\hat\mu_R\vert\vert^2 + ... ...- \hat\mu_R\vert\vert^2 + \vert\vert\hat\epsilon_F\vert\vert^2 \end{displaymath}$

which is an Analysis of Variance (ANOVA) table!

Here is the most basic version of the above:

$\begin{displaymath}X= \left[ {\bf 1} \vert X_1\right ] \qquad Y_i = \beta_0 + \cdots + \epsilon_i \end{displaymath}$

The notation here is that

$\begin{displaymath}{\bf 1} = \left[ \begin{array}{c} 1 \\ \vdots \\ 1 \end{array} \right] \end{displaymath}$

is a column vector with all entries equal to 1. The coefficient of this column, $\beta_0$ , is called the ``intercept'' term in the model.

To find $\hat\mu_R$ we minimize

$\begin{displaymath}\sum (Y_i - \hat\beta_0)^2 \end{displaymath}$

and get simply

$\begin{displaymath}\hat\beta_0 = \bar{Y} \end{displaymath}$

and

$\begin{displaymath}\hat\mu_R = \left[ \begin{array}{c} \bar{Y} \\ \vdots \\ \bar{Y} \end{array} \right] \end{displaymath}$

Our ANOVA identity is now
$\begin{align*}\vert\vert Y\vert\vert^2 & = \vert\vert\hat\mu_R\vert\vert^2 + \ve... ...\mu_F- \hat\mu_R\vert\vert^2 + \vert\vert\hat\epsilon_F\vert\vert^2 \end{align*}$
This identity is usually rewritten in subtracted form:

$\begin{displaymath}\vert\vert Y\vert\vert^2 -n\bar{Y}^2 = \vert\vert\hat\mu_F- \hat\mu_R\vert\vert^2 + \vert\vert\hat\epsilon_F\vert\vert^2 \end{displaymath}$

Remembering the identity $\sum (Y_i-\bar{Y})^2 = \sum Y_i^2 - n\bar{Y}^2$ we find

$\begin{displaymath}\sum (Y_i-\bar{Y})^2 = \sum( \hat\mu_{F,i} - \bar{Y})^2 + \sum \hat\epsilon_{F,i}^2 \end{displaymath}$

These terms are respectively the Adjusted or Corrected Total Sum of Squares, the Regression or Model Sum of Squares and the Error Sum of Squares.

The sum of squares decomposition in one example

See Lecture 2 for more about this example. There I showed the design matrix for the model

$\begin{displaymath}Y_{ij} = \mu+\alpha_i+\epsilon_{ij} \end{displaymath}$

with $\alpha_4 = -(\alpha_1+\alpha_2+\alpha_3)$ .

The data consist of blood coagulation times for 24 animals fed one of 4 different diets. In the following I write the data in a table and decompose the table into a sum of several tables. The 4 columns of the table correspond to Diets A, B, C and D. You should think of the entries in each table as being stacked up into a column vector, but the tables save space.

The design matrix can be partitioned into a column of 1s and 3 other columns. You should compute the product X^TX and get

$\begin{displaymath}\left[ \begin{array}{rrrr} 24 & -4 & -2 & -2 \\ -4 & 12 & ... ... \\ -2 & 8 & 14 & 8 \\ -2 & 8 & 8 & 14 \end{array}\right] \end{displaymath}$

The matrix X^TY is just

$\begin{displaymath}\left[ \sum_{ij} Y_{ij}\, , \, \sum_j Y_{1j} - \sum_j Y_{4j} ... ...j} - \sum_j Y_{4j}\, , \, \sum_j Y_{3j} - \sum_j Y_{4j}\right] \end{displaymath}$

The matrix X^TX can be inverted using a program like Maple. I found that

$\begin{displaymath}384 (X^TX)^{-1} = \left[ \begin{array}{rrrr} 17 & 7 & -1 & -... ...1 & -23 & 49 & -15 \\ -1 & -23 & -15 & 49 \end{array}\right] \end{displaymath}$

It now takes quite a bit of algebra to verify that the vector of fitted values can be computed by simply averaging the data in each column. That is, the fitted value, $\hat\mu$ is the table

$\begin{displaymath}\left[ \begin{array}{rrrr} 61 & 66 & 68 & 61 \\ 61 & 66 & ... ... & 66 & 68 & 61 \\ & & & 61 \\ & & & 61 \end{array}\right] \end{displaymath}$

On the other hand fitting the model with a design matrix consisting only of a column of 1s just leads to $\hat\mu_R$ (notation from the lecture) given by

$\begin{displaymath}\left[ \begin{array}{rrrr} 64 & 64 & 64 & 64 \\ 64 & 64 & ... ... & 64 & 64 & 64 \\ & & & 64 \\ & & & 64 \end{array}\right] \end{displaymath}$

Now in class I gave the decomposition

$\begin{displaymath}Y= \hat\mu_R + (\hat\mu_F- \hat\mu_R) + \hat\epsilon_F \end{displaymath}$

which corresponds to the following identity:

$\begin{displaymath}\left[ \begin{array}{rrrr} 62 & 63 & 68 & 56 \\ 60 & 67 & ... ...\\ & 0 & 0 & 3 \\ & & & 2 \\ & & & -2 \end{array}\right] \end{displaymath}$

The sums of squares of the entries of each of these arrays are as follows. On the left hand side $62^2 + 63^2 +\cdots = 98644$ . This is the uncorrected total sum of squares. The first term on the right hand side gives 24(64²) = 98304. This term is sometimes put in ANOVA tables as the Sum of Squares due to the Grand Mean but it is usually subtracted from the total to produce the Total Sum of Squares we usually put at the bottom of the table and often called the Corrected (or Adjusted) Total Sum of Squares. In this case the corrected sum of squares is the squared length of the table

$\begin{displaymath}\left[ \begin{array}{rrrr} -2 & -1 & 4 & -8 \\ -4 & 3 & 2 ... ...\ & 2 & 4 & 0 \\ & & & -1 \\ & & & -5 \end{array}\right] \end{displaymath}$

which is 340.

The second term on the right hand side of the equation has squared length 4(-3)² + 6(2)² + 6 (4)² + 8 (-3)² = 228 (which is the Treatment Sum of Squares produced by SAS). The formula for this Sum of Squares is

$\begin{displaymath}\sum_{i=1}^{I} \sum_{j=1}^{n_i} ({\bar X}_{i\cdot} - {\bar X}... ...um_{i=1}^{I} n_i ({\bar X}_{i\cdot} - {\bar X}_{\cdot\cdot})^2 \end{displaymath}$

but I want you to see that the formula is just the squared length of the vector of individual sample means minus the grand mean. The last vector of the decomposition is called the residual vector and has squared length $1^2+(-3)^2 +0^2 +\cdots = 112$ . Corresponding to the decomposition of the total squared length of the data vector is a decomposition of its dimension, 24, into the dimensions of subspaces. For instance the grand mean is always a multiple of the single vector all of whose entries are 1; this describes a one dimensional space (this is just another way of saying that the reduced $\hat\mu_R$ is in the column space of the reduced model design matrix). The second vector, of deviations from a grand mean lies in the three dimensional subspace of tables which are constant in each column and have a total equal to 0. Similarly the vector of residuals lies in a 20 dimensional subspace - the set of all tables whose columns sum to 0. This decomposition of dimensions is the decomposition of degrees of freedom. So 24 = 1+3+20 and the degrees of freedom for treatment and error are 3 and 20 respectively. The vector whose squared length is the Corrected Total Sum of Squares lies in the 23 dimensional subspace of vectors whose entries sum to 1; this produces the 23 total degrees of freedom in the usual ANOVA table.

$next$ $up$ $previous$

Richard Lockhart
1999-01-11