No Title

STAT 804: Notes on Lecture 8

Fitting ARIMA(p,d,q) models to data

Fitting the I part is easy we simply difference d times. The same observation applies to seasonal multiplicative model. Thus to fit an ARIMA(p,d,q) model to X you compute Y =(I-B)^d X (shortening your data set by d observations) and then you fit an ARMA(p,q) model to Y. So we assume that d=0.

Simplest case: fitting the AR(1) model

$\begin{displaymath}X_t = \mu+\rho(X_{t-1}-\mu) + \epsilon_t \end{displaymath}$

We must estimate 3 parameters: $\mu, \rho$ and $\sigma^2=\text{Var}(\epsilon_t)$ .

Our basic strategy will be:

Estimate the parameters by maximum likelihood as if the series were Gaussian.
Investigate the properties of the estimates for non-Gaussian data.

Generally the full likelihood is rather complicated; we will use conditional likelihoods and ad hoc estimates of some parameters to simplify the situation.

The likelihood: Gaussian data

If the errors $\epsilon$ are normal then so is the series X. In general the vector $X=(X_0,\ldots,X_{T-1})^t$ has a $MVN({\bf\mu},\Sigma)$ where $\Sigma_{ij} = C(i-j)$ and ${\bf\mu}$ is a vector all of whose entries are $\mu$ . The joint density of X is

$\begin{displaymath}f_X(x) = \frac{1}{(2\pi)^{T/2} \det(\Sigma)^{1/2}} \exp\left\{-\frac{1}{2} (x-{\bf\mu})^t \Sigma^{-1} (x-{\bf\mu})\right\} \end{displaymath}$

so that the log likelihood is

$\begin{displaymath}\ell(\mu,a_1,\ldots,a_p,b_1,\ldots,b_q,\sigma) = -\frac{1}{2}... ...bf\mu})^t \Sigma^{-1} (x-{\bf\mu}) + \log(\det(\Sigma))\right] \end{displaymath}$

Here I have indicated precisely (for an ARMA(p,q)) the parameters on which the quantity depends.

It is possible to carry out full maximum likelihood by maximizing the quantity in question numerically. In general this is hard, however.

Here I indicate some standard tactics. In your homework I will be asking you to carry through this analysis for one particular model.

The AR(1) model

Consider the model

$\begin{displaymath}X_t-\mu = \rho(X_{t-1} - \mu) + \epsilon_t \end{displaymath}$

This model formula permits us to write down the joint density of X in a simpler way:

$\begin{displaymath}f_X = f_{X_{T-1}\vert X_{T-2},\ldots,X_0} f_{X_{T-2}\vert X_{T-3},\ldots,X_0}\cdots f_{X_1\vert X_0} f_{X_0} \end{displaymath}$

Each of the conditional densities is simply

$\begin{displaymath}f_{X_{k+1}\vert X_k,\ldots,X_0} (x_k\vert x_{k-1},\ldots,x_{0}) = g\left[x_k-\mu-\rho(x_{k-1}-\mu)\right] \end{displaymath}$

where g is the density of an individual $\epsilon$ . For iid $N(0,\sigma^2)$ errors this gives a log likelihood which is

$\begin{displaymath}\ell(\mu,\rho,\sigma) = - \frac{1}{2\sigma^2}\sum_1^{T-1} \le... ...-\rho(x_{k-1}-\mu)\right]^2 -(T-1)\log(\sigma) + \log(f_{X_0}) \end{displaymath}$

Now for a stationary series I showed that $X_t \sim N(\mu,\sigma^2/(1-\rho^2))$ so that

$\begin{displaymath}\log(f_{X_0}(x_0)) = -\frac{1-\rho^2}{2\sigma^2}(x_0-\mu)^2 -\log(\sigma) + \log(1-\rho^2) \end{displaymath}$

This makes
$\begin{align*}\ell(\mu,\rho,\sigma) & = - \frac{1}{2\sigma^2} \left\{ \sum_1^{T-... ...^2)(x_0-\mu)^2\right\} \\ & \qquad -T\log(\sigma) + \log(1-\rho^2) \end{align*}$
We can maximize this over $\mu$ and $\sigma$ explicitly. First

$\begin{displaymath}\frac{\partial}{\partial\mu} \ell = \frac{1}{\sigma^2} \left\... ...rho(x_{k-1}-\mu)\right](1-\rho) + (1-\rho^2)(x_0-\mu)\right\} \end{displaymath}$

Set this equal to 0 to find
$\begin{align*}\hat\mu(\rho) & = \frac{ (1-\rho)\sum_1^{T-1}(x_k -\rho x_{k-1}) ... ...sum_1^{T-1}(x_k -\rho x_{k-1}) +(1+\rho)x_0}{1+\rho+ (1-\rho)(T-1)} \end{align*}$
Notice that this estimate is free of $\sigma$ and that if T is large we may drop the 1 in the denominator and the term inolving x₀ in the denominator and get

$\begin{displaymath}\hat\mu(\rho) \approx \frac{\sum_1^{T-1}(x_k -\rho x_{k-1})}{(T-1)(1-\rho)} \end{displaymath}$

Finally, the numerator is actually

$\begin{displaymath}\sum_0^{T-1} x_k - x_0 -\rho(\sum_0^{T-1} x_k -x_{T-1}) = (1-\rho) \sum_0^{T-1} x_k -x_0 +\rho x_{T-1} \end{displaymath}$

The last two terms here are smaller than the sum; if we neglect them we get

$\begin{displaymath}\hat\mu(\rho) \approx \bar{X} \, . \end{displaymath}$

Now compute

$\begin{displaymath}\frac{\partial}{\partial\sigma} \ell = \frac{1}{\sigma^3}\lef... ...\mu)\right]^2 +(1-\rho^2)(x_0-\mu)^2\right\} -\frac{T}{\sigma} \end{displaymath}$

and set this to 0 to find

$\begin{displaymath}\hat\sigma^2(\rho) = \frac{\left\{ \sum_1^{T-1}\left[x_k-\mu(... ...-\mu(\rho))\right]^2 +(1-\rho^2)(x_0-\mu(\rho))^2\right\}}{T} \end{displaymath}$

When $\rho$ is known it is easy to check that $(\mu(\rho),\sigma(\rho))$ maximizes $\ell(\mu,\rho,\sigma)$ .

To find $\hat\rho$ you now plug $\hat\mu(\rho)$ and $\hat\sigma(\rho)$ into $\ell$ (getting the so called profile likelihood $\ell(\hat\mu(\rho),\rho,\hat\sigma(\rho))$ ) and maximize over $\rho$ . Having thus found $\hat\rho$ the mles of $\mu$ and $\hat\sigma$ are simply $\hat\mu(\hat\rho)$ and $\hat\sigma(\hat\rho)$ .

It is worth observing that fitted residuals can then be calculated:

$\begin{displaymath}\hat\epsilon_t = (X_t-\hat\mu) - \hat\rho(X_{t-1}-\hat\mu) \end{displaymath}$

(There are only T-1 of them since you cannot easily estimate $\epsilon_0$ .) Note, too, that the formula for $\hat\sigma^2$ simplifies to

$\begin{displaymath}\hat\sigma^2 = \frac{ \sum_1^{T-1} \hat\epsilon_t^2 +(1-\rho^... ...)^2}{T} \approx \frac{ \sum_1^{T-1} \hat\epsilon_t^2 }{T} \, . \end{displaymath}$

In general, we simplify the maximum likelihood problem several ways:

We usually simply estimate $\hat\mu = \bar{X}$ .
The term f_X₀ in the likelihood is different in structure and causes considerable trouble. We drop it. The result is called a conditional likelihood. In general in a statistical inference problem if the data can be written in the form X=(Y,Z) then we can factor the density in the form

f_X(x) = f_Y|Z(y|z)f_Z(z)

The first term in the factorization f_Y|Z(y|z) is called a conditional likelihood (when you think of it as a function of the unknown parameters); the second term, f_Z(z) is called a marginal likelihood. Sometimes one or the other of the two terms is conveniently simpler than the full likelihood; in these cases people often suggest using the simple piece. You get less efficient estimates in general but sometimes the loss is not very important.
In the AR(1) case Y is just $(X_1,\ldots,X_{T-1})$ while Z is X₀. We take our conditional log-likelihood to be
$\begin{align*}\ell(\mu,\rho,\sigma) & = \sum_1^{T-1} \log(f_{X_t\vert X_0,\ldots... ...{T-1} \left[X_t-\mu - \rho(X_{t-1}-\mu)\right]^2 -(T-1)\log(\sigma) \end{align*}$
Combining the previous two ideas leads to maximization of

$\begin{displaymath}\ell(\bar{X},\rho,\sigma) = \frac{-1}{2\sigma^2} \sum_1^{T-1}... ..._t-\bar{X} - \rho(X_{t-1}-\bar{X})\right]^2 -(T-1)\log(\sigma) \end{displaymath}$

This may be maximized explicitly to get

$\begin{displaymath}\hat\rho = \frac{\sum_1^{T-1} (X_t-\bar{X})(X_{t-1}-\bar{X})}{ \sum_0^{T-2} (X_t-\bar{X})^2} \end{displaymath}$

and

$\begin{displaymath}\hat\sigma^2 = \frac{ \sum_1^{T-1} \left[X_t-\bar{X} - \hat\rho(X_{t-1}-\bar{X})\right]^2}{T-1} \end{displaymath}$
Changing the range of summation in the previously formula for $\hat\rho$ to include all possible terms gives

$\begin{displaymath}\hat\rho = \frac{\sum_1^{T-1} (X_t-\bar{X})(X_{t-1}-\bar{X})}{ \sum_0^{T-1} (X_t-\bar{X})^2 } = \frac{\hat{C}(1)}{\hat{C}(0)} \end{displaymath}$

Notice that we have made a great many suggestions for simplifications and adjustments. This is typical of statistical research - many ideas, only slightly different from each other, are suggested and compared. In practice it seems likely that there is very little difference between all the methods. I am asking you in a homework problem to investigate the differences between several of these methods on a single data set.

Higher order autoregressions

For the model

$\begin{displaymath}X_t- \mu = \sum_1^p a_i(X_{t-1}-\mu) + \epsilon_t \end{displaymath}$

we will use conditional likelihood again. Let $\phi$ denote the vector $(a_1,\ldots, a_p)^t$ . Now we condition on the first p values of X and use

$\begin{displaymath}\ell_c(\phi,\mu,\sigma) = -\frac{1}{2\sigma^2} \sum_p^{T-1} \... ...-\mu - \sum_1^p a_i(X_{t-i} - \mu)\right]^2 -(T-p)\log(\sigma) \end{displaymath}$

If we estimate $\mu$ using $\bar{X}$ we find that we are trying to maximize

$\begin{displaymath}-\frac{1}{2\sigma^2} \sum_p^{T-1} \left[ X_t-\bar{X} - \sum_1^p a_i(X_{t-i} - \bar{X})\right]^2 -(T-p)\log(\sigma) \end{displaymath}$

To estimate $a_1,\ldots,a_p$ then we merely minimize the sum of squares

$\begin{displaymath}\sum_p^{T-1} \hat\epsilon_t^2 = \sum_p^{T-1} \left[ X_t-\bar{X} - \sum_1^p a_i(X_{t-i} - \bar{X})\right]^2 \end{displaymath}$

This is a straightforward regression problem. We regress the response vector

$\begin{displaymath}\left[\begin{array}{c} X_p -\bar{X} \\ \vdots \\ X_{T-1} -\bar{X} \end{array} \right] \end{displaymath}$

on the design matrix

$\begin{displaymath}\left[\begin{array}{ccc} X_{p-1} -\bar{X}& \cdots & X_0 -\bar... ...{T-2} -\bar{X}& \cdots & X_{T-p-1} -\bar{X} \end{array}\right] \end{displaymath}$

An alternative to estimating $\mu$ by $\bar{X}$ is to define $\alpha = \mu(1-\sum a_i)$ and then recognize that

$\begin{displaymath}\ell(\alpha,\phi,\sigma) = -\frac{1}{2\sigma^2} \sum_p^{T-1} ... ...[ X_t-\alpha - \sum_1^p a_iX_{t-i}\right]^2 -(T-p)\log(\sigma) \end{displaymath}$

is maximized by regressing the vector

$\begin{displaymath}\left[\begin{array}{c} X_p \\ \vdots \\ X_{T-1} \end{array} \right] \end{displaymath}$

on the design matrix

$\begin{displaymath}\left[\begin{array}{ccc} X_{p-1} & \cdots & X_0 \\ \vdots &... ... & \vdots \\ X_{T-2} & \cdots & X_{T-p-1} \end{array}\right] \end{displaymath}$

From $\hat\alpha$ and $\hat\phi$ we would get an estimate for $\mu$ by

$\begin{displaymath}\hat\mu = \frac{\hat\alpha}{1-\sum \hat{a}_i}\end{displaymath}$

Notice that if we put

$\begin{displaymath}Z=\left[\begin{array}{ccc} X_{p-1} -\bar{X}& \cdots & X_0 -\b... ...{T-2} -\bar{X}& \cdots & X_{T-p-1} -\bar{X} \end{array}\right] \end{displaymath}$

then

$\begin{displaymath}Z^tZ \approx T \left[\begin{array}{cccc} \hat{C}(0) & \hat{C}... ... \cdots & \cdots & \hat{C}(1) & \hat{C}(0) \end{array}\right] \end{displaymath}$

and if

$\begin{displaymath}Y=\left[\begin{array}{c} X_p -\bar{X} \\ \vdots \\ X_{T-1} -\bar{X} \end{array} \right] \end{displaymath}$

then

$\begin{displaymath}Z^tY \approx T \left[\begin{array}{c} \hat{C}(1) \\ \vdots \\ \hat{C}(p) \end{array}\right] \end{displaymath}$

so that the normal equations (from least squares)

$\begin{displaymath}Z^t Z \phi = Z^T Y \end{displaymath}$

are nearly the Yule-Walker equations again.

Full maximum likelihood

To compute a full mle of $\theta=(\mu,\phi,\sigma)$ you generally begin by finding preliminary estimates $\hat\theta$ say by one of the conditional likelihood methods above and then iterate via Newton-Raphson or some other scheme for numerical maximization of the log-likelihood.

Fitting MA(q) models

Here we consider the model with known mean (generally this will mean we estimate $\hat\mu = \bar{X}$ and subtract the mean from all the observations):

$\begin{displaymath}X_t = \epsilon_t - b_1 \epsilon_{t-1} - \cdots - b_q\epsilon_{t-q} \end{displaymath}$

In general X has a $MVN(0,\Sigma)$ distribution and, letting $\psi$ denote the vector of b_is we find

$\begin{displaymath}\ell(\psi,\sigma) = -\frac{1}{2}\left[\log(\det(\Sigma)) - X^T \Sigma^{-1} X\right] \end{displaymath}$

Here X denotes the column vector of all the data. As an example consider q=1 so that

$\begin{displaymath}\Sigma = \left[\begin{array}{ccccc} \sigma^2(1+b_1^2) & -b_1\... ... & \cdots & -b_1\sigma^2& \sigma^2(1+b_1^2) \end{array}\right] \end{displaymath}$

It is not so easy to work with the determinant and inverse of matrices like this. Instead we try to mimic the conditional inference approach above but with a twist; we now condition on something we haven't observed -- $\epsilon_{-1}$ .

Notice that
$\begin{align*}X_0 & = \epsilon_0 - b\epsilon_{-1} \\ X_1 & = \epsilon_1 - b\eps... ... = \epsilon_{T-1} - b(X_{T-2} + b (X_{T-3} + \cdots \epsilon_{-1})) \end{align*}$
Now imagine that the data were actually

$\begin{displaymath}\epsilon_{-1}, X_0,\ldots,X_{T-1} \end{displaymath}$

Then the same idea we used for an AR(1) would give
$\begin{align*}\ell(b,\sigma) & = \log(f(\epsilon_{-1},\sigma)) + \log(f(X_0,\ldo... ..._0^{T-1} \log(f(X_t\vert X_{t-1},\ldots,X_0,\epsilon_{-1},b,\sigma) \end{align*}$
The parameters are listed in the conditions in this formula merely to indicate which terms depend on which parameters. For Gaussian $\epsilon$ s the terms in this likelihood are squares as usual (plus logarithms of $\sigma$ ) leading to
$\begin{align*}\ell(b,\sigma) = & \frac{-\epsilon_{-1}}{2\sigma^2} - \log(\sigma)... ...2 X_{t-2} - \cdots -\psi^{t+1} \epsilon_{-1})^2+\log(\sigma)\right] \end{align*}$
We will estimate the parameters by maximizing this function after getting rid of $\epsilon_{-1}$ somehow.

Method A: Put $\epsilon_{-1}=0$ since 0 is the most probable value and maximize

$\begin{displaymath}-T\log(\sigma) - \frac{1}{2\sigma^2} \sum_0^{T-1} \left[X_t - \psi X_{t-1} - \psi^2 X_{t-2} - \cdots -\psi^{t}X_0\right]^2 \end{displaymath}$

Notice that for large T the coefficients of $\epsilon_{-1}$ are close to 0 for most t and the remaining few terms are negligible relatively to the total.

Method B: Backcasting is the process of guessing $\epsilon_{-1}$ on the basis of the data; we replace $\epsilon_{-1}$ in the log likelihood by

$\begin{displaymath}\text{E}(\epsilon_{-1}\vert X_0,\ldots,X_{T-1})\, . \end{displaymath}$

The problem is that this quantity depends on b and $\sigma$ .

We will use the EM algorithm to solve this problem.

$next$ $up$ $previous$

Richard Lockhart
1999-10-07