No Title

STAT 350: Lecture 19

Reading:

A modelling exercise: The SCENIC data set, continued

See Lecture 18 for plots of the data. Here we fit several models and discuss interpretation of the coefficients.

I begin by thinking about how the variables should influence Risk. It seems natural that risk of infection would increase with length of stay. Of course this data provides only ecological correlations, that is, average risk for a whole group of patients is being related to average stay. Nothing in the data can indicate clearly whether or not patients with long stays are actually getting infected more often. Nevertheless, it seems reasonable that if we could hold other features of the hospital environment constant then hospitals with long average stays would expose their patients to more risk, that is, would have a higher probability of infection. Similar remarks should be made about all the discussion below.

The variable Age would be expected to be positively correlated with risk of infection since, presumably, older patients are more susceptible.

The two variables Chest and Culture are measures of how hard the hospital looks for otherwise unsuspected infection. If you look harder you should find more so that the coefficients of these variables in any regression model would be expected to be positive.

The variables Beds, Census and Nurses all measure the size of the hospital. It is not clear if a big hospital should put a patient at risk of infection or not. However, I point out that the relations between these three variables might make a difference. A hospital with high Census compared to Beds may be overcrowded and so more likely to support infections. A hospital with lots of Nurses relative to patients might be expected to be kept in better condition and so lower the risk.

The variable Facilities seems to measure the sophistication of medical treatment at the hospital. This might be positive or it might suggest more exotic diseases among the patient population so I can't guess intelligently about the direction of the effect.

We begin by fitting a model using all the continuous predictors, that is ignoring only SCHOOL and REGION. Here is Splus code and results.

> fit.full <- lm(Risk ~ Stay + Age + Culture + Chest + Beds + Census +
Nurses + Facilities, data=scenic)
> summary(fit.full)

Call: lm(formula = Risk ~ Stay + Age + Culture + Chest + Beds + Census +
             Nurses + Facilities, data = scenic)
Residuals:
    Min      1Q  Median     3Q   Max 
 -2.222 -0.5918 0.01833 0.5453 2.556

Coefficients:
              Value Std. Error t value Pr(>|t|) 
(Intercept) -0.7473  1.2076    -0.6188  0.5374 
       Stay  0.1769  0.0691     2.5621  0.0118 
        Age  0.0162  0.0223     0.7285  0.4679 
    Culture  0.0470  0.0107     4.3719  0.0000 
      Chest  0.0120  0.0055     2.1943  0.0304 
       Beds -0.0014  0.0027    -0.5340  0.5945 
     Census  0.0007  0.0035     0.2097  0.8343 
     Nurses  0.0019  0.0018     1.0873  0.2794 
 Facilities  0.0163  0.0102     1.5978  0.1131 

Residual standard error: 0.959 on 104 degrees of freedom
Multiple R-Squared: 0.5251 
F-statistic: 14.37 on 8 and 104 degrees of freedom, the p-value is 6.117e-14 

Correlation of Coefficients:
           (Intercept)    Stay     Age Culture   Chest    Beds  Census  Nurses 
      Stay -0.0146
 
       Age -0.8622     -0.3347
 
   Culture -0.1606     -0.2930  0.3042
 
     Chest -0.1914     -0.3039  0.0281 -0.2911
 
      Beds -0.0377      0.2726 -0.0775 -0.0551  0.0077
 
    Census  0.0536     -0.4625  0.1369  0.1530  0.0657 -0.8766
 
    Nurses  0.0041      0.2486 -0.0410 -0.1935 -0.0657 -0.1681 -0.2265
 
Facilities -0.1544     -0.0475 -0.0232 -0.0363 -0.0613 -0.2009  0.0670 -0.2229

The output suggests that Stay, Culture and Chest are important predictors but none of the others are. So we fit next the model retaining only these 3 predictors.

> fit.1 <- lm( Risk ~ Stay + Culture + Chest, data=scenic)
> summary(fit.1)

Call: lm(formula = Risk ~ Stay + Culture + Chest, data = scenic)
Residuals:
    Min      1Q   Median    3Q   Max 
 -2.181 -0.7678 -0.04002 0.696 2.594

Coefficients:
             Value Std. Error t value Pr(>|t|) 
(Intercept) 0.3092 0.5425     0.5700  0.5698  
       Stay 0.2450 0.0540     4.5349  0.0000  
    Culture 0.0494 0.0103     4.8008  0.0000  
      Chest 0.0110 0.0056     1.9839  0.0498  

Residual standard error: 0.99 on 109 degrees of freedom
Multiple R-Squared: 0.4696 
F-statistic: 32.16 on 3 and 109 degrees of freedom, the p-value is 5.662e-15 

Correlation of Coefficients:
        (Intercept)    Stay Culture 
   Stay -0.6633                    
Culture  0.1766     -0.1963        
  Chest -0.4611     -0.2848 -0.3435

We can compare these models in Splus by carrying out an extra Sum of Squares F-test. I have edited the output to make it fit - changing the columns labelled Terms to the shorthand FULL and REDUCED

> anova(fit.1,fit.full)
Analysis of Variance Table

Response: Risk

    Terms     Resid.  Df    RSS       Test Df  Sum of Sq    F Value      Pr(F) 
1   REDUCED	109	   106.8206  
2   FULL	104	    95.6398     5       11.18079    2.431627    0.03972748

You see that the test of the hypothesis of no influence of the 5 extra variables suggests that you cannot delete them all, even though individual t-tests for the 5 individual coefficients are not significant. To see what can happen consider adding each of the three variables which measure size

> fit.b <-  lm(Risk ~ Stay + Culture + Chest + Beds , data = scenic)
> fit.c <- lm(Risk ~ Stay + Culture + Chest + Census, data = scenic)
> fit.n <- lm(Risk ~ Stay + Culture + Chest + Nurses, data = scenic)
> summary(fit.b)

Call: lm(formula = Risk ~ Stay + Culture + Chest + Beds, data = scenic)
Residuals:
    Min      1Q  Median   3Q   Max 
 -1.993 -0.7365 0.05695 0.66 2.291

Coefficients:
             Value Std. Error t value Pr(>|t|) 
(Intercept) 0.4149 0.5309     0.7816  0.4362  
       Stay 0.1845 0.0578     3.1940  0.0018  
    Culture 0.0480 0.0101     4.7701  0.0000  
      Chest 0.0130 0.0055     2.3761  0.0193  
       Beds 0.0013 0.0005     2.5516  0.0121  

Residual standard error: 0.9658 on 108 degrees of freedom
Multiple R-Squared: 0.4997 
F-statistic: 26.97 on 4 and 108 degrees of freedom, the p-value is
1.554e-15 

Correlation of Coefficients:
        (Intercept)    Stay Culture   Chest 
   Stay -0.6351                            
Culture  0.1714     -0.1558                
  Chest -0.4439     -0.3155 -0.3475        
   Beds  0.0780     -0.4099 -0.0560  0.1424
> summary(fit.c)

Call: lm(formula = Risk ~ Stay + Culture + Chest + Census, data = scenic)
Residuals:
    Min      1Q  Median     3Q   Max 
 -1.984 -0.7584 0.07387 0.6545 2.447

Coefficients:
             Value Std. Error t value Pr(>|t|) 
(Intercept) 0.5233 0.5357     0.9770  0.3307  
       Stay 0.1719 0.0599     2.8694  0.0049  
    Culture 0.0484 0.0101     4.8199  0.0000  
      Chest 0.0132 0.0055     2.3952  0.0183  
     Census 0.0017 0.0007     2.5656  0.0117  

Residual standard error: 0.9655 on 108 degrees of freedom
Multiple R-Squared: 0.5 
F-statistic: 27 on 4 and 108 degrees of freedom, the p-value is 1.554e-15 

Correlation of Coefficients:
        (Intercept)    Stay Culture   Chest 
   Stay -0.6504                            
Culture  0.1683     -0.1542                
  Chest -0.4270     -0.3189 -0.3452        
 Census  0.1558     -0.4757 -0.0384  0.1497
> summary(fit.n)

Call: lm(formula = Risk ~ Stay + Culture + Chest + Nurses, data = scenic)
Residuals:
    Min      1Q  Median     3Q   Max 
 -1.958 -0.7093 0.02961 0.5473 2.453

Coefficients:
             Value Std. Error t value Pr(>|t|) 
(Intercept) 0.3703 0.5241     0.7065  0.4814  
       Stay 0.1936 0.0549     3.5263  0.0006  
    Culture 0.0456 0.0100     4.5505  0.0000  
      Chest 0.0127 0.0054     2.3480  0.0207  
     Nurses 0.0021 0.0007     2.9949  0.0034  

Residual standard error: 0.9556 on 108 degrees of freedom
Multiple R-Squared: 0.5102 
F-statistic: 28.13 on 4 and 108 degrees of freedom, the p-value is 5.551e-16 

Correlation of Coefficients:
        (Intercept)    Stay Culture   Chest 
   Stay -0.6417                            
Culture  0.1700     -0.1450                
  Chest -0.4544     -0.3008 -0.3519        
 Nurses  0.0389     -0.3126 -0.1276  0.1013

You should note that each of the measures of size is significant; that is, it appears that you should add each of them. But because the three measures of size are quite multicolinear you get no additional predictive power out of including more than one of them in the model. Look what happens when I add both Census and Beds:

> fit.bc <- lm(Risk ~ Stay + Culture + Chest + Census + Beds, data = scenic)
> summary(fit.bc)

Call: lm(formula = Risk ~ Stay + Culture + Chest + Census + Beds, data = scenic)
Residuals:
    Min      1Q  Median     3Q   Max
 -1.986 -0.7498 0.07771 0.6563 2.386

Coefficients:
             Value Std. Error t value Pr(>|t|)
(Intercept) 0.4841 0.5742     0.8431  0.4010
       Stay 0.1760 0.0637     2.7611  0.0068
    Culture 0.0483 0.0101     4.7610  0.0000
      Chest 0.0131 0.0055     2.3797  0.0191
     Census 0.0011 0.0034     0.3243  0.7463
       Beds 0.0005 0.0026     0.1956  0.8453

Residual standard error: 0.9699 on 107 degrees of freedom
Multiple R-Squared: 0.5002
F-statistic: 21.42 on 5 and 107 degrees of freedom, the p-value is 8.549e-15

Correlation of Coefficients:
        (Intercept)    Stay Culture   Chest  Census
   Stay -0.6906
Culture  0.1887     -0.1749
  Chest -0.3926     -0.3080 -0.3417
 Census  0.3715     -0.4140  0.0810  0.0513
   Beds -0.3492      0.3301 -0.0906 -0.0215 -0.9794

Notice again that neither the coefficient of Beds nor that of Census is significantly different from 0. The reason for the confusion is in the matrix of correlations (which is (X^TX)^-1 converted to correlation form by dividing the ijth entry by the square root of the product of the ith and jth diagonal entries). In particular the $\beta$ components for Beds and Census are very highly negatively correlated.

Here is Splus code for an ANOVA table comparing the model with Beds and Census to the model without them.

> anova(fit.1,fit.bc)
Analysis of Variance Table

Response: Risk

                                   Terms Resid. Df      RSS         Test Df Sum of Sq  F Value      Pr(F) 
1                 Stay + Culture + Chest       109 106.8206
2 Stay + Culture + Chest + Census + Beds       107 100.6482 +Census+Beds  2  6.17243 3.280984 0.04140679

Finally look what happens if we put in Beds and Nurses.

> fit.nb <- lm(Risk ~ Stay + Culture + Chest + Nurses + Beds, data = scenic)
> summary(fit.nb)

Call: lm(formula = Risk ~ Stay + Culture + Chest + Nurses + Beds, data = scenic)
Residuals:
    Min      1Q  Median     3Q   Max
 -1.959 -0.7159 0.05094 0.5114 2.513

Coefficients:
              Value Std. Error t value Pr(>|t|)
(Intercept)  0.3523  0.5290     0.6660  0.5069
       Stay  0.1998  0.0582     3.4312  0.0009
    Culture  0.0451  0.0102     4.4383  0.0000
      Chest  0.0125  0.0055     2.2807  0.0245
     Nurses  0.0026  0.0017     1.5528  0.1234
       Beds -0.0004  0.0012    -0.3336  0.7394

Residual standard error: 0.9596 on 107 degrees of freedom
Multiple R-Squared: 0.5107
F-statistic: 22.34 on 5 and 107 degrees of freedom, the p-value is 2.776e-15

Correlation of Coefficients:
        (Intercept)    Stay Culture   Chest  Nurses
   Stay -0.6371
Culture  0.1819     -0.1818
  Chest -0.4364     -0.3217 -0.3285
 Nurses -0.0763      0.1693 -0.1821 -0.0678
   Beds  0.1019     -0.3230  0.1422  0.1211 -0.9079

Notice particularly that Beds now has a negative coefficient (though not significantly so).

SUMMARY: Because this is an observational study you cannot interpret the parameter estimates as measuring the amount by which the response would be expected to increase if you increased the corresponding variable by 1 unit. The trouble is that in the population, large values of Beds go with, usually, larger values of Census. So the coefficient of Beds measures something rather more like: how much would the response increase if I increased Beds by 1 unit and all the other covariates changed as you would expect from the relations in the variables seen in this population. Thus you can't really base a policy decision about building more beds in a hospital on the sign of the coefficient of Beds in a regression model like this.

$next$ $up$ $previous$

Richard Lockhart
1999-01-07