next up previous


Postscript version of these notes

STAT 350: Lecture 33

Theory of Generalized Linear Models

If Y has a Poisson distribution with parameter $\mu$ then

\begin{displaymath}P(Y=y) = \frac{\mu^ye^{-\mu}}{y!}
\end{displaymath}

for y a non-negative integer. We can use the method of maximum likelihood to estimate $\mu$ if we have a sample $Y_1,\ldots, Y_n$ of independent Poisson random variables all with mean $\mu$. If we observe Y1=y1, Y2=y2 and so on then the likelihood function is

\begin{displaymath}P(Y_1=y_1,\ldots,Y_n=y_n) = \prod_{i=1}^n \frac{\mu^{y_i}e^{-
\mu}}{y_i!} = \frac{\mu^{\sum y_i} e^{-n\mu}}{\prod y_i!}
\end{displaymath}

This function of $\mu$ can be maximized by maximizing its logarithm, the log likelihood function. We set the derivative of the log likelihood with respect to $\mu$ equal to 0, getting

\begin{displaymath}\frac{d}{d\mu} [\sum y_i \log \mu - n\mu - \sum \log(y_i!)]=
\sum y_i /\mu- n= 0 \, .
\end{displaymath}

This is the likelihood equation and its solution $\hat\mu =
\bar{y}$ is the maximum likelihood estimate of $\mu$.

In a regression problem all the Yi will have different means $\mu_i$. Our log-likelihood is now

\begin{displaymath}\sum y_i \log\mu_i -\sum \mu_i - \sum \log(y_i!)
\end{displaymath}

If we treat all n of the $\mu_i$ as unknown parameters we can maximize the log likelihood by setting each of the n partial derivatives with respect to $\mu_k$ for k from 1 to n equal to 0. The kth of these n equations is just

\begin{displaymath}y_k/\mu_k -1 = 0 \, .
\end{displaymath}

This leads to $\hat\mu_k=y_k$. In glm jargon this model is the saturated model.

A more useful model is one in which there are fewer parameters but more than 1. A typical glm model is

\begin{displaymath}\mu_i = \exp(x_i^T\beta)
\end{displaymath}

where the xi are covariate values for the ith observation (often including an intercept term just as in standard linear regression).

In this case the log-likelihood is

\begin{displaymath}\sum y_i x_i^T\beta- \sum \exp(x_i^T\beta) - \sum\log(y_i!)
\end{displaymath}

which should be treated as a function of $\beta$ and maximized.

The derivative of this log-likelihood with respect to $\beta_k$ is

\begin{displaymath}\sum y_i x_{ik} - \sum \exp(x_i^T\beta) x_{i,k} = \sum (y_i -
\mu_i) x_{i,k}
\end{displaymath}

If $\beta$ has p components then setting these p derivatives equal to 0 gives the likelihood equations.

It is no longer possible to solve the likelihood equations analytically. We have, instead, to settle for numerical techniques. One common technique is called iteratively re-weighted least squares. For a Poisson variable with mean $\mu_i$ the variance is $\sigma_i^2=\mu_i$. Ignore for a moment the fact that if we knew $\sigma_i$ we would know $\mu_i$ and consider fitting our model by least squares with the $\sigma_i^2$ known. We would minimize (see our discussion of weighted least squares)

\begin{displaymath}\sum
\frac{ (Y_i-\mu_i)^2}{\sigma_i^2}
\end{displaymath}

by taking the derivative with respect to $\beta_k$ and (again ignoring the fact that $\sigma_i^2$ depends on $\beta_k$ we would get

\begin{displaymath}-2\sum \frac{(Y_i - \mu_i)\partial
\mu_i/\partial\beta_k}{\sigma_i^2} = 0
\end{displaymath}

But the derivative of $\mu_i$ with respect to $\beta_k$ is $\mu_ix_{ik}$ and replacing $\sigma_i^2$ by $\mu_i$ we get the equation

\begin{displaymath}\sum(Y_i - \mu_i)x_{ik} = 0
\end{displaymath}

exactly as before. This motivates the following estimation scheme.

1.
Begin with a guess for the standard deviations $\sigma_i$ (taking them all equal to 1 is simple).

2.
Do (non-linear) weighted least squares using the guessed weights. Get estimated regression parameters $\hat\beta$.

3.
Use these to compute estimated variances $\hat\sigma_i^2$. Go back to do weighted least squares with these weights.

4.
Iterate (repeat over and over) until estimates not really changing.

NOTE: if the estimation converges then the final estimate is a fixed point of the algorithm which solves the equation

\begin{displaymath}\sum(Y_i - \mu_i)x_{ik} = 0
\end{displaymath}

derived above.

Non-linear least squares

In a regression model of the form

\begin{displaymath}Y_i = \mu_i +\epsilon_i
\end{displaymath}

we call the model linear if $\mu_i = x_i^T\beta$ for some covariates xi and a parameter $\beta$. Often, however, we have non-linear models such as $\mu_i =\beta_0(1-\exp(- \beta_1 x_i))$ or in general $\mu_i = f(x_i,\beta)$ for some function f which need not be linear in $\beta$. If the errors are homo-scedastic normal errors then the maximum likelihood estimates of the $\beta_i$ are least squares estimates, minimizing $\sum(Y_i - \mu_i)^2$. The value $\hat\beta$ then solves

\begin{displaymath}\sum (Y_i-\mu_i) \frac{\partial \mu_i}{\partial\beta} = 0 \, .
\end{displaymath}

This equation can usually not be solved analytically. In general the process of solving it numerically is iterative. One approach is:

1.
Begin with an initial guess (no general method available to do this step) for the parameter estimates, say $\hat\beta^{(0)}$.

2.
Linearize the function $\mu_i(\beta)$ around $\hat\beta^{(0)}$by doing a Taylor expansion. This involves writing

\begin{displaymath}\mu_i(\beta) \approx \mu_i(\hat\beta^{(0)}) + x_i^T (\beta -\hat\beta^{(0)})
\end{displaymath}

where xi is the vector of partial derivatives of $\mu_i$ with respect to the entries in $\beta$ (computed using $\hat\beta^{(0)}$).

3.
Replace $\mu_i$ in the sum of squares by this approximation and get that you are trying to minimize

\begin{displaymath}\sum (Y_i - \mu_i(\hat\beta^{(0)})- x_i^T(\beta -\hat\beta^{(0)}))^2
\end{displaymath}

with respect to $\beta$.

4.
This is really now an ordinary least squares problem in which we are regressing $Y_i^*=Y_i - \mu_i(\hat\beta^{(0)})$ on xiT. Let X0be the $n \times p$ matrix with ith row $x_i^T=\partial\mu_i/\partial\beta$ where the derivative is evaluated at $\hat\beta^{(0)}$. Then we are to minimize

\begin{displaymath}\vert Y^*-X_0\delta\vert^2\end{displaymath}

where $\delta = \beta -\hat\beta^{(0)}$. The solution is $\hat\delta = (X_0^TX_0)^{-1}X_0^TY^*$. We compute this $\delta$.

5.
Update $\hat\beta^{(0)}$ to $\hat\beta^{(1)}= \hat\beta^{(0)}+\hat\delta$.

6.
Recompute the entries in X with the new value $\hat\beta^{(1)}$, get a new $\hat\delta$ then a $\hat\beta^{(2)}$ and so on.

7.
Continue until the estimates change little or the sum of squares changes little.

Example

In thermoluminescence dating (see Lecture 1) a common model is

\begin{displaymath}Y_i = \beta_1(1-\exp(-(d_i-\beta_2)/\beta_3)) + \epsilon_i
\end{displaymath}

Thus in this case

\begin{displaymath}\mu_i(\beta) = \beta_1(1-\exp(-(d_i-\beta_2)/\beta_3)
\end{displaymath}

The derivatives of $\mu_i$ with respect to the entries of $\beta$ are

\begin{displaymath}\frac{\partial\mu_i}{\partial\beta_1} = (1-\exp(-(d_i-\beta_2)/\beta_3)
\end{displaymath}


\begin{displaymath}\frac{\partial\mu_i}{\partial\beta_2} = -\beta_1\exp(-(d_i-\beta_2)/\beta_3)/\beta_3
\end{displaymath}

and

\begin{displaymath}\frac{\partial\mu_i}{\partial\beta_3} = -\beta_1(d_i-\beta_2)
\exp(-(d_i-\beta_2)/\beta_3)/\beta_3^2
\end{displaymath}

Software and computing details

In SAS the procedure proc nlin will do non-linear least squares. You need to specify quite a few ingredients. The model statement must specify the exact functional form of the function $\mu_i(\beta)$. You have a statement parms which is needed to specify the initial guess for $\beta$. You usually will have a bunch of der statements to specify derivatives of the predicted values with respect to each parameter and, if you use some algorithms, you will even specify some second derivatives. (You get a choice of 5 different iterative algorithms. The one described above is called Gauss-Newton.)

In Splus the function nls can be used to to the same thing.


next up previous



Richard Lockhart
1999-04-09