next up previous


Postscript version of these notes

STAT 350: Lecture 22

Reading:

Goodness-of-fit: Pure Error Sum of Squares

If, for each (or at least sufficiently many) combination of covariates in a data set, there are several observations, we can carry out an extra sum of squares F-test to see if our regression model is adequate. Suppose that $x_1,\ldots,x_K$ are the distinct rows of the design matrix and suppose we have n1 observations for which the covariate values are those in x1, n2 observations with covariate pattern x2 and so on. Of course $n_1+\cdots+n_K = n$. We compare our final fitted model with a so-called saturated model by an extra sum of squares F-test. To be precise we let $\alpha_1$ be the mean value of Y when the covariate pattern is x1, $\alpha_2$ the mean corresponding to x2 and so on. Relabel the n data points as $Y_{i,j}; j=,\ldots,n_i;i=1,\ldots,K$ and fit a one way ANOVA model to the Yi,j. The error sum of squares for this FULL model is

\begin{displaymath}ESS_{FULL}= \sum_{i=1}^K \sum_{j=1}^{n_i}(Y_{i,j}-\bar{Y}_{i,\cdot})^2
\end{displaymath}

This ESS is called the pure error sum of squares because we have not assumed any particular relation between the mean of Y and the covariate vector x. We form the F statistic for testing the overall quality of our model by computing the ``lack of fit SS'' as

ESSRestricted - ESSFULL

where the restricted model is the final model whose fit we are checking.

As an example return to the plaster hardness data of Lecture 12 There are 9 different covariate patterns corresponding to all the possible combinations of the 3 levels of SAND and 3 levels of FIBRE. There are two ways to compute the pure error sum of squares: create a new variable with 9 levels which labels the 9 categories or fit a two way ANOVA with interactions:

DATA
0 0 1 61 34
0 0 1 63 16
15 0 2 67 36
15 0 2 69 19
30 0 3 65 28
30 0 3 74 17
0 25 4 69 49
0 25 4 69 48
15 25 5 69 43
15 25 5 74 29
30 25 6 74 31
30 25 6 72 24
0 50 7 67 55
0 50 7 69 60
15 50 8 69 45
15 50 8 74 43
30 50 9 74 22
30 50 9 74 48

SAS CODE

  options pagesize=60 linesize=80;
  data plaster;
  infile 'plaster1.dat';
  input sand fibre combin hardness strength;
  proc glm  data=plaster;
   model hardness = sand fibre;
  run;
  proc glm  data=plaster;
   class sand fibre;
   model hardness = sand | fibre ;
  run;
  proc glm  data=plaster;
   class combin;
   model hardness = combin;
  run;

EDITED OUTPUT (Complete output)

                           Sum of         Mean
Source           DF       Squares       Square  F Value  Pr > F
Model             2  167.41666667  83.70833333    11.53  0.0009
Error            15  108.86111111   7.25740741
Corrected Total  17  276.27777777
_______________________________________________________________
                           Sum of         Mean
Source           DF       Squares       Square  F Value  Pr > F
Model             8  202.77777778  25.34722222     3.10  0.0557
Error             9   73.50000000   8.16666667
Corrected Total  17  276.27777778
_______________________________________________________________
                           Sum of         Mean
Source           DF       Squares       Square  F Value  Pr > F
Model             8  202.77777778  25.34722222     3.10  0.0557
Error             9   73.50000000   8.16666667
Corrected Total  17  276.27777778

From the output we can put together a summary ANOVA table

Source df SS MS F P
Model 2 167.417 83.708    
Lack of Fit 6 35.361 5.894 0.722 0.64
Pure Error 9 73.500 8.167    
Total (Corrected) 17 276.278      


next up previous

Richard Lockhart
1999-03-03