No Title

STAT 350: Lecture 3

Reading: Chapter 5 sections 5-8, 10.

Fitting Linear Models

Requires assumptions about $\epsilon_i$ s. Usual assumptions:

1.

$\epsilon_1,\ldots,\epsilon_n$ are independent. (Sometimes we assume only that ${\rm Cov}(\epsilon_i,\epsilon_j)=0$ for $i\neq j$ ; that is we assume the errors are uncorrelated.)

2.

Homoscedastic errors; all variances are equal:

$\begin{displaymath}{\rm Var}(\epsilon_1) = {\rm Var}(\epsilon_2) = \cdots = \sigma^2 \end{displaymath}$

3.

Normal errors: $\epsilon_i \sim N(0,\sigma^2)$ .

Remember: we already have assumed ${\rm E}(\epsilon_i) =0$ .

Notes:

Assumptions 1, 2 and 3 permit Maximum Likelihood Estimation.
Assumptions 1 and 2 justify least squares.
Assumption 3 can be replace by other error distributions, but not in this course.
With normal errors maximum likelihood estimates are the same as least squares estimates.
Assumption 2 -- Homoscedastic errors -- can be relaxed; see STAT 402 ``Generalized Linear Models'' or ``weighted least square''.
Assumption 1 can be relaxed; see STAT 804 for Time Series models.

Motivation for Least Squares

Choose $\hat\beta$ to make fitted values $\hat\mu=X\hat\beta$ as close to Ys as possible. There are many possible choices for ``close'':

Mean Absolute Deviation: minimize

$\begin{displaymath}\frac{1}{n} \sum \left\vert Y_i - \hat\mu_i\right\vert \end{displaymath}$
Least Median of Squares: minimize

$\begin{displaymath}{\rm median}\{ \left\vert Y_i - \hat\mu_i\right\vert^2 \} \end{displaymath}$
Least squares: minimize

$\begin{displaymath}\sum (Y_i - \hat\mu_i)^2 \end{displaymath}$

WE DO LS = least squares.

To minimize $\sum (Y_i - \hat\mu_i)^2$ take derivatives with respect to each $\hat\beta_j$ and set them equal to 0:
$\begin{align*}\frac{\partial}{\partial\hat\beta_j} \sum_{i=1}^n (Y_i - \hat\mu_i... ...\hat\mu_i)^2\right] \frac{\partial \hat\mu_i}{\partial \hat\beta_j} \end{align*}$
But

$\begin{displaymath}\frac{\partial}{\partial\hat\mu_i} (Y_i - \hat\mu_i)^2 = -2 (Y_i-\hat\mu_i) \end{displaymath}$

and

$\begin{displaymath}\hat\mu_i = \sum_{j=1}^p x_{ij} \hat\beta_j \end{displaymath}$

$\begin{displaymath}\frac{\partial \hat\mu_i}{\partial \hat\beta_j} = x_{ij} \end{displaymath}$

Thus

$\begin{displaymath}\frac{\partial}{\partial\hat\beta_j} \sum_{i=1}^n (Y_i - \hat\mu_i)^2 = -2 \sum_{i=1}^n x_{ij} (Y_i - \hat\mu_i) \end{displaymath}$

Set this equal to 0 and get the so-called normal equations:

$\begin{displaymath}\sum_{i=1}^n Y_i x_{ij} = \sum_{i=1}^n \hat\mu_i x_{ij} \qquad j=1,\ldots,p \end{displaymath}$

Finally remember that $\hat\mu_i = \sum_{k=1}^p x_{ik} \hat\beta_k$ to get

$\begin{displaymath} \sum Y_i x_{ij} = \sum_{i=1}^n \sum_{k=1}^p x_{ij}x_{ik}\hat\beta_k \end{displaymath}$

(1)

This formula looks dreadful but, in fact, it is just a bunch of matrix-vector multiplications written out in summation notation. Note that it is a set of p linear equations in p unknowns $\hat\beta_1,\ldots,\hat\beta_p$ .

First

$\begin{displaymath}\sum_{i=1}^n x_{ij} x_{ik} \end{displaymath}$

is the dot product between the jth and kth columns of X. Another way to view this is as the jkth entry in the matrix X^T X:

$\begin{displaymath}X^TX = \left[ \begin{array}{cccc} x_{11} & x_{21} & \cdots & ... ...dots & \vdots \\ x_{n1} & \cdots & x_{np} \end{array}\right] \end{displaymath}$

The jkth entry in this matrix product is clearly

$\begin{displaymath}x_{1j} x_{1k} + x_{2j}x_{2k} + \cdots + x_{nj}x_{nk} \end{displaymath}$

so that the right hand side of (1) is

$\begin{displaymath}\sum_{k=1}^p (X^TX)_{jk} \hat\beta_k \end{displaymath}$

which is just the matrix product

$\begin{displaymath}((X^TX)\hat\beta)_j \end{displaymath}$

Now look at the left hand side of (1), namely, $\sum_{i=1}^n Y_i x_{ij}$ which is just the dot product of Y and the jth column of X or the jth entry of X^TY:

$\begin{displaymath}\left[ \begin{array}{cccc} x_{11} & x_{21} & \cdots & x_{n1} ... ..._{1p}Y_1 + x_{2p} Y_2 + \cdots + x_{np} Y_n \end{array}\right] \end{displaymath}$

So the normal equations are

$\begin{displaymath}(X^T Y)_j = (X^T X \hat\beta)_j \end{displaymath}$

or just

$\begin{displaymath}X^T Y = X^T X \hat\beta \end{displaymath}$

This last formula is the usual way we write the normal equations.

Now solve these equations for $\hat\beta$ . Let's look at the dimensions of the matrices first. X^T is $p \times n$ , Y is $n \times 1$ , X^TX is a $p \times n$ matrix multiplied by a $n\times p$ matrix which just produces a $p \times p$ matrix. If the matrix X^T X has rank p then X^T X is not singular and its inverse (X^T X)^-1 exists.Then Then we can solve the equation

$\begin{displaymath}X^T Y = X^T X \hat\beta \end{displaymath}$

for $\hat\beta$ by multiplying both sides by (X^T X)^-1 to get

$\begin{displaymath}\hat\beta = (X^T X)^{-1} X^T Y \end{displaymath}$

This is the ordinary least squares estimator. See assignment 1 for an example with ${\rm rank}(X) < p$ . See chapter 5 in the text for a review of matrices and vectors.

Normal Equations for Simple Linear Regression

Thermoluminescence Example

See Lecture 1 for the framework. Here I consider two models:

a straight-line model,

$\begin{displaymath}Y_i=\beta_1+\beta_2D_i+\epsilon_i \end{displaymath}$
a quadratic model,

$\begin{displaymath}Y_i=\beta_1+\beta_2D_i+\beta_3D_i^2+\epsilon_i \, . \end{displaymath}$

First, I do general theoretical formulas, then I stick in numbers and do arithmetic:

$\begin{displaymath}X= \left[\begin{array}{cc} 1 & D_1 \\ \vdots & \vdots \\ 1 & D_n \end{array}\right] \end{displaymath}$

$\begin{displaymath}X^T X = \left[\begin{array}{ccc} 1 & \cdots & 1 \\ D_1 & \cd... ...i \\ \sum_{i=1}^n D_i & \sum_{i=1}^n D_i^2 \end{array}\right] \end{displaymath}$

$\begin{displaymath}X^T Y = \left[\begin{array}{ccc} 1 & \cdots & 1 \\ D_1 & \cd... ...c} \sum_{i=1}^n Y_i \\ \sum_{i=1}^n D_i Y_i \end{array}\right] \end{displaymath}$

$\begin{displaymath}(X^T X)^{-1} = \frac{1}{n\sum D_i^2 - (\sum D_i)^2} \left[\be... ...-\sum_{i=1}^n D_i \\ -\sum_{i=1}^n D_i & n \end{array}\right] \end{displaymath}$

$\begin{align*}(X^T X)^{-1}X^T Y &= \left[ \begin{array}{c} \frac{\sum Y_i \sum ... ...}{S_{DD}} \bar{D} \\ \\ \frac{S_{DY}}{ S_{DD}} \end{array}\right] \end{align*}$

The data are

Dose	Count

0	27043
0	26902
0	25959
150	27700
150	27530
150	27460
420	30650
420	30150
420	29480
900	34790
900	32020
1800	42280
1800	39370
1800	36200
3600	53230
3600	49260
3600	53030

The design matrix for the linear model is

$\begin{displaymath}X = \left[\begin{array}{rr} 1& 27043 \\ 1& 26902 \\ 1& 2595... ...200 \\ 1& 53230 \\ 1& 49260 \\ 1& 53030 \end{array}\right] \end{displaymath}$

You can compute X^TX in Minitab or Splus. That matrix has 4 numbers each of which is computed as the dot product of 2 columns of X. For instance the first column dotted with itself produces $1^2 + \cdots + 1^2 = 17$ . Here is an example S session which reads in the data, produces the design matrices for the two models and computes X^TX.

[36]skekoowahts% S
S-PLUS : Copyright (c) 1988, 1995 MathSoft, Inc.
S : Copyright AT&T.
Version 3.3 Release 1 for Sun SPARC, SunOS 5.3 : 1995 
Working data will be in .Data 
#
# The data are in a file called linear. The !
# tells S that what follows is not an S command but a standard
# UNIX (or DOS) command
#
> !more linear
  Dose Count
    0 27043
    0 26902
    0 25959
  150 27700
  150 27530
  150 27460
  420 30650
  420 30150
  420 29480
  900 34790
  900 32020
 1800 42280
 1800 39370
 1800 36200
 3600 53230
 3600 49260
 3600 53030
#
# The function help(function) produces help for a function such as
# > help(read.table)
#
# Read in the data from a file.  The file has 18 lines:
# 17 lines of data and a first line which has the names of the variables.
# the function read.table reads such data and header=T warns S
# that the first line is variable names.  The first argument of
# read.table is a character string containing the name of the file
# to read from.
#
> x_read.table("linear",header=T)
> x
   Dose Count 
 1    0 27043
 2    0 26902
 3    0 25959
 4  150 27700
 5  150 27530
 6  150 27460
 7  420 30650
 8  420 30150
 9  420 29480
10  900 34790
11  900 32020
12 1800 42280
13 1800 39370
14 1800 36200
15 3600 53230
16 3600 49260
17 3600 53030
#
# the design matrix has a column of 1s and also
# a column consisting of the first column of x
# which is just the list of covariate values
# The notation x[,1] picks out the first column of x
#
> design.mat_cbind(rep(1,17),x[,1])
#
# To print out an object you type its name!
#
> design.mat
      [,1] [,2] 
 [1,]    1    0
 [2,]    1    0
 [3,]    1    0
 [4,]    1  150
 [5,]    1  150
 [6,]    1  150
 [7,]    1  420
 [8,]    1  420
 [9,]    1  420
[10,]    1  900
[11,]    1  900
[12,]    1 1800
[13,]    1 1800
[14,]    1 1800
[15,]    1 3600
[16,]    1 3600
[17,]    1 3600
#
# Compute X^T X  -- this uses %*% to multiply matrices
# and t(x) to compute the transpose of a matrix x.
#
> xprimex <- t(design.mat)%*% design.mat
> xprimex
      [,1]     [,2] 
[1,]    17    19710
[2,] 19710 50816700
#
# Compute X^T Y
#
> xprimey <- t(design.mat)%*% x[,2] 
> xprimey
          [,1] 
[1,]    593054
[2,] 882452100
#
# Next compute the least squares estimates by solving the
# normal equations
#
> solve(xprimex,xprimey)
             [,1] 
[1,] 26806.734691
[2,]     6.968012
#
# solve(A,b) computes the solution of Ax=b for A a 
# square matrix and b a vector. Of course x=A^{-1}b.
#
#
# The next piece of code regresses the variable Count on 
# Dose taking the data from x.
#
> lm( Count~Dose,data=x)
Call:
lm(formula = Count ~ Dose, data = x)

Coefficients:
 (Intercept)     Dose 
    26806.73 6.968012

Degrees of freedom: 17 total; 15 residual
Residual standard error: 1521.238 
#
# Notice that the estimates agree with our calculations
# The residual standard error is the usual estimate of sigma
# namely the square root of the Mean Square for Error.
#
#
# Now we add a third column to fit the quadratic model
#
> design.mat2_cbind(design.mat,design.mat[,2]^2)
#
# Here is X^T X
#
> t(design.mat2)%*% design.mat2
         [,1]         [,2]         [,3] 
[1,]       17        19710 5.081670e+07
[2,]    19710     50816700 1.591544e+11
[3,] 50816700 159154389000 5.367847e+14
#
# Here is X^T Y
#
> t(design.mat2)%*% x[,2]
             [,1] 
[1,] 5.930540e+05
[2,] 8.824521e+08
[3,] 2.469275e+12
#
# But the following illustrates the dangers of doing computations
# blindly on the computer.  The trouble is that the design matrix 
# has a third column which is so much larger that the first two.
#
> solve(t(design.mat2)%*% design.mat2, t(design.mat2)%*% x[,2])
Error in solve.qr(a, b): apparently singular matrix
Dumped
#
# However, good packages know numerical techniques which avoid
# the danger.
#
> lm(Count ~ Dose+Dose^2,data=x)
Call:
lm(formula = Count ~ Dose + Dose^2, data = x)

Coefficients:
 (Intercept)     Dose     I(Dose^2) 
    26718.11 7.240314 -7.596867e-05

Degrees of freedom: 17 total; 14 residual
Residual standard error: 1571.277 
> 
#
# WARNING: you can't tell from the size of the estimate
# of an estimate such as that of beta_3 whether or not
# it is important -- you have to compare it to values
# of the corresponding covariate and to its standard error
#
> q()
# Used to quit S:  pay attention to () -- that part is essential!

$next$ $up$ $previous$

Richard Lockhart
1999-01-12