next up previous


Postscript version of these notes

STAT 804: Notes on Lecture 9

Fitting higher order autoregressions

For the model

\begin{displaymath}X_t- \mu = \sum_1^p a_i(X_{t-1}-\mu) + \epsilon_t
\end{displaymath}

we will use conditional likelihood again. Let $\phi$ denote the vector $(a_1,\ldots, a_p)^t$. Now we condition on the first p values of X and use

\begin{displaymath}\ell_c(\phi,\mu,\sigma) = -\frac{1}{2\sigma^2}
\sum_p^{T-1} \...
...-\mu - \sum_1^p a_i(X_{t-i} - \mu)\right]^2
-(T-p)\log(\sigma)
\end{displaymath}

If we estimate $\mu$ using $\bar{X}$ we find that we are trying to maximize

\begin{displaymath}-\frac{1}{2\sigma^2}
\sum_p^{T-1} \left[ X_t-\bar{X} - \sum_1^p a_i(X_{t-i} - \bar{X})\right]^2
-(T-p)\log(\sigma)
\end{displaymath}

To estimate $a_1,\ldots,a_p$ then we merely minimize the sum of squares

\begin{displaymath}\sum_p^{T-1} \hat\epsilon_t^2 = \sum_p^{T-1} \left[ X_t-\bar{X} - \sum_1^p
a_i(X_{t-i} - \bar{X})\right]^2
\end{displaymath}

This is a straightforward regression problem. We regress the response vector

\begin{displaymath}\left[\begin{array}{c} X_p -\bar{X} \\ \vdots \\ X_{T-1} -\bar{X} \end{array} \right]
\end{displaymath}

on the design matrix

\begin{displaymath}\left[\begin{array}{ccc}
X_{p-1} -\bar{X}& \cdots & X_0 -\bar...
...{T-2} -\bar{X}& \cdots & X_{T-p-1} -\bar{X}
\end{array}\right]
\end{displaymath}

An alternative to estimating $\mu$ by $\bar{X}$ is to define $\alpha = \mu(1-\sum a_i)$ and then recognize that

\begin{displaymath}\ell(\alpha,\phi,\sigma) = -\frac{1}{2\sigma^2}
\sum_p^{T-1} ...
...[ X_t-\alpha - \sum_1^p a_iX_{t-i}\right]^2
-(T-p)\log(\sigma)
\end{displaymath}

is maximized by regressing the vector

\begin{displaymath}\left[\begin{array}{c} X_p \\ \vdots \\ X_{T-1} \end{array} \right]
\end{displaymath}

on the design matrix

\begin{displaymath}\left[\begin{array}{ccc}
X_{p-1} & \cdots & X_0
\\
\vdots &...
... & \vdots
\\
X_{T-2} & \cdots & X_{T-p-1}
\end{array}\right]
\end{displaymath}

From $\hat\alpha$ and $\hat\phi$ we would get an estimate for $\mu$ by

\begin{displaymath}\hat\mu = \frac{\hat\alpha}{1-\sum \hat{a}_i}\end{displaymath}

Notice that if we put

\begin{displaymath}Z=\left[\begin{array}{ccc}
X_{p-1} -\bar{X}& \cdots & X_0 -\b...
...{T-2} -\bar{X}& \cdots & X_{T-p-1} -\bar{X}
\end{array}\right]
\end{displaymath}

then

\begin{displaymath}Z^tZ \approx T \left[\begin{array}{cccc}
\hat{C}(0) & \hat{C}...
...
\cdots & \cdots & \hat{C}(1) & \hat{C}(0)
\end{array}\right]
\end{displaymath}

and if

\begin{displaymath}Y=\left[\begin{array}{c} X_p -\bar{X} \\ \vdots \\ X_{T-1} -\bar{X}
\end{array} \right]
\end{displaymath}

then

\begin{displaymath}Z^tY \approx T \left[\begin{array}{c} \hat{C}(1) \\ \vdots \\ \hat{C}(p)
\end{array}\right]
\end{displaymath}

so that the normal equations (from least squares)

\begin{displaymath}Z^t Z \phi = Z^T Y
\end{displaymath}

are nearly the Yule-Walker equations again.

Full maximum likelihood

To compute a full mle of $\theta=(\mu,\phi,\sigma)$you generally begin by finding preliminary estimates $\hat\theta$ say by one of the conditional likelihood methods above and then iterate via Newton-Raphson or some other scheme for numerical maximization of the log-likelihood.

Fitting MA(q) models

Here we consider the model with known mean (generally this will mean we estimate $\hat\mu=\bar{X}$ and subtract the mean from all the observations):

\begin{displaymath}X_t = \epsilon_t - b_1 \epsilon_{t-1} - \cdots - b_q\epsilon_{t-q}
\end{displaymath}

In general X has a $MVN(0,\Sigma)$ distribution and, letting $\psi$ denote the vector of bis we find

\begin{displaymath}\ell(\psi,\sigma) = -\frac{1}{2}\left[\log(\det(\Sigma))
- X^T \Sigma^{-1} X\right]
\end{displaymath}

Here X denotes the column vector of all the data. As an example consider q=1 so that

\begin{displaymath}\Sigma = \left[\begin{array}{ccccc}
\sigma^2(1+b_1^2) & -b_1\...
... & \cdots & -b_1\sigma^2& \sigma^2(1+b_1^2)
\end{array}\right]
\end{displaymath}

It is not so easy to work with the determinant and inverse of matrices like this. Instead we try to mimic the conditional inference approach above but with a twist; we now condition on something we haven't observed -- $\epsilon_{-1}$.

Notice that
\begin{align*}X_0 & = \epsilon_0 - b\epsilon_{-1}
\\
X_1 & = \epsilon_1 - b\eps...
... = \epsilon_{T-1} - b(X_{T-2} + b (X_{T-3} + \cdots
\epsilon_{-1}))
\end{align*}
Now imagine that the data were actually

\begin{displaymath}\epsilon_{-1}, X_0,\ldots,X_{T-1}
\end{displaymath}

Then the same idea we used for an AR(1) would give
\begin{align*}\ell(b,\sigma) & = \log(f(\epsilon_{-1},\sigma)) +
\log(f(X_0,\ldo...
..._0^{T-1} \log(f(X_t\vert X_{t-1},\ldots,X_0,\epsilon_{-1},b,\sigma)
\end{align*}
The parameters are listed in the conditions in this formula merely to indicate which terms depend on which parameters. For Gaussian $\epsilon$s the terms in this likelihood are squares as usual (plus logarithms of $\sigma$) leading to
\begin{align*}\ell(b,\sigma) = &
\frac{-\epsilon_{-1}}{2\sigma^2} - \log(\sigma)...
...2 X_{t-2} - \cdots -\psi^{t+1} \epsilon_{-1})^2+\log(\sigma)\right]
\end{align*}
We will estimate the parameters by maximizing this function after getting rid of $\epsilon_{-1}$ somehow.

Method A: Put $\epsilon_{-1}=0$ since 0 is the most probable value and maximize

\begin{displaymath}-T\log(\sigma) - \frac{1}{2\sigma^2} \sum_0^{T-1} \left[X_t - \psi X_{t-1}
- \psi^2 X_{t-2} - \cdots -\psi^{t}X_0\right]^2
\end{displaymath}

Notice that for large T the coefficients of $\epsilon_{-1}$ are close to 0 for most t and the remaining few terms are negligible relatively to the total.

Method B: Backcasting is the process of guessing $\epsilon_{-1}$ on the basis of the data; we replace $\epsilon_{-1}$ in the log likelihood by

\begin{displaymath}\text{E}(\epsilon_{-1}\vert X_0,\ldots,X_{T-1})\, .
\end{displaymath}

The problem is that this quantity depends on b and $\sigma$.

We will use the EM algorithm to solve this problem. The algorithm can be applied when we have (real or imaginary) missing data. Suppose the data we have is X; some other data we didn't get is Y and Z=(X,Y). It often happens that we can think of a Y we didn't observe in such a way that the likelihood for the whole data set Z would be simple. In that case we can try to maximize the likelihood for X by following a two step algorithm first discussed in detail by Dempster, Laid and Rubin. This algorithm has two steps:

1.
The E or Estimation step. We ``estimate'' the missing data Y by computing $\text{E}(Y\vert X)$. Technically, we are supposed to estimate the likelihood function based on Z. Factor the density of Z as

fZ = fY|XfX

and take logs to get

\begin{displaymath}\ell(\theta\vert Z) = \log(f_{Y\vert X}) + \ell(\theta\vert X)
\end{displaymath}

We actually estimate the log conditional density (which is a function of $\theta$) by computing

\begin{displaymath}\text{E}_{\theta_0}(\log(f_{Y\vert X})\vert X)
\end{displaymath}

Notice the subscript $\theta_0$ on $\text{E}$. This indicates that you have to know the parameter to compute the conditional expectation. Notice too that there is another $\theta$ in the conditional expectation - the log conditional density has a parameter in it.

2.
We then maximize our estimate of $\ell(\theta\vert Z)$ to get a new value $\theta_1$ for $\theta$. Go back to step 1 with this $\theta_1$ replacing $\theta_0$ and iterate.

To get started we need a preliminary estimate. Now look at our problem. In our case the quantity Y is $\epsilon_{-1}$. Rather than work with the log-likelihood directly we work with Y. Our preliminary estimate of Y is 0. We use this value to estimate $\theta$ as above getting an estimate $\theta_0$. Then we compute $\text{E}_{\theta_0}(\epsilon_{-1}\vert X)$ and replace $\epsilon_{-1}$ in the log-likelihood above by this conditional expectation. Then iterate. This process of guessing $\epsilon_{-1}$ is called backcasting.

Summary


next up previous



Richard Lockhart
1999-10-12