next up previous


STAT 804: Notes on Lecture 7

Model identification summary: The simplest model identification tactic is to look for either a pure MA or pure AR model. To do so:

1.
Estimate the sample autocorrelation function. Plot $\widehat{ACF}(h)$ against h. If the process is an MA(q) then the ACF will be 0 after lag q.

2.
Estimate the sample partial autocorrelation function. Plot the estimated PACF(h) against h. If the process is an AR(p) then the PACF will be 0 after lag p.

Non stationary processes

If X is not stationary will need to transform X to find a related stationary series. We will consider in this course two sorts of non-stationarity -- non constant mean and integration.

Non constant mean: If $\text{E}(X_t)$ is not constant we will hope to model $\mu_t=\text{E}(X_t)$ using a small number of parameters and then model $Y_t = X_t-\mu_t$ as a stationary series. Three common structures for $\mu_t$ are linear, polynomial and periodic:

Linear trend: Suppose

\begin{displaymath}\mu_t = \alpha+\beta t
\end{displaymath}

We seek to estimate $\alpha$ and $\beta$ in order to analyze $Y_t = X_t - \alpha-\beta t$.

Method 1: regression (detrending). We regress Xt on t to get $\hat\alpha$ and $\hat\beta$ and analyze

\begin{displaymath}\hat{Y}_t =X_t-\hat\alpha-\hat\beta t
\end{displaymath}

We hope

\begin{displaymath}\hat{Y} \approx Y
\end{displaymath}

Method 2: differencing. Define
\begin{align*}W_t & = X_t-X_{t-1}
\\
& = \left[(I-B)X\right]_t
\end{align*}
Then
\begin{align*}W_t =& \left[(\alpha+\beta t +Y_t) - (\alpha+\beta (t-1) +Y_{t-1})\right]
\\
& = \beta+Y_{t}-Y_{t-1}
\end{align*}
Since this is a linear filter applied to the series Y it is stationary if Y is. BUT, it might be stationary even if Y is not. Suppose that $\epsilon_t$ is an iid mean 0 sequence. Then

\begin{displaymath}Y_t = \sum_1^t \epsilon_j
\end{displaymath}

is a random walk. Notice that $Y_t-Y_{t-1}=\epsilon_t$is stationary. To see that Yt, the random walk, is not stationary notice that $\text{Var}(Y_t) = t\text{Var}(\epsilon_1)$.

These random walk models are common in Economics. In physics they are used in the limit of very small time increments - this leads to Brownian motion.

Definition: X satisfies an ARIMA(p,d,q) model if

\begin{displaymath}\phi(B) \nabla^d X = \psi(B) \epsilon
\end{displaymath}

where

1.
$\epsilon$ is white noise,

2.
$\phi$ is a polynomial of degree p,

3.
$\psi$ is a polynomial of degree q,

4.
$\nabla = I-B$ is the differencing operator.

Remark: If $X_t = \mu_t+Y_t$ where Y is stationary and $\mu$ is a polynomial of degree less than or equal to p then $\nabla^d X$ is stationary. (So a cubic shaped trend could be removed by differencing 3 times.)

WARNING: it is a common mistake in students' data analyses to over difference. When you difference a stationary ARMA(p,q) you introduce a unit root in the defining polynomial - the result cannot be written as an infinite order moving average.

Detrending: Define a response vector

\begin{displaymath}U = \left[\begin{array}{c} X_0 \\ \vdots \\ X_{T-1} \end{array} \right]
\end{displaymath}

and a design matrix by

\begin{displaymath}V=\left[\begin{array}{cc} 1 & 0 \\ 1 & 1 \\ \vdots & \vdots \\ 1 & T-1
\end{array}\right]
\end{displaymath}

and write

\begin{displaymath}U = V\theta +Y
\end{displaymath}

with

\begin{displaymath}Y = \left[\begin{array}{c} Y_0 \\ \cdots \\ Y_{T-1} \end{array} \right]
\end{displaymath}

and

\begin{displaymath}\theta = \left[\begin{array}{c} \alpha \\ \beta\end{array}\right] \, .
\end{displaymath}

Then we estimate by

\begin{displaymath}\hat\theta = (V^TV)^{-1} V^T U
\end{displaymath}

Our detrended series is
\begin{align*}\hat{Y} & = U-V\hat\theta
\\
&= (I-V(V^TV)^{-1} V^T)U
\end{align*}
Now since $\text{Cov}(Y) \neq \sigma^2 I$ ordinary least squares is not strictly appropriate. In general in the model

\begin{displaymath}U=V\theta+W
\end{displaymath}

where W has mean 0 and variance covariance matrix $\Sigma$, the minimum variance linear estimator of $\theta$ is the generalized least squares estimate

\begin{displaymath}\hat\theta_{GLS} = (V^T \Sigma^{-1} V)^T V^T \Sigma^{-1} U \, .
\end{displaymath}

This estimate is unbiased and has variance

\begin{displaymath}(V^T \Sigma^{-1} V)^{-1}
\end{displaymath}

For this model the ordinary least squares estimate is also unbiased and has variance

\begin{displaymath}(V^TV)^{-1} V^T \Sigma V (V^TV)^{-1}
\end{displaymath}

The problem in our context (it is almost always a problem) is that you can only use $\hat\theta_{GLS}$ is you know $\Sigma$. In our context you won't know $\Sigma$ until you have removed a trend, selected a suitable ARMA model and estimated the parameters. The natural proposal is to follow an iterative process:

1.
Fit the linear model by ordinary least squares.

2.
Compute the residual series.

3.
Identify a suitable ARMA(p,q) model.

4.
Fit the parameters of this model.

5.
Using the estimates compute an estimated variance covariance of the residual process.

6.
Refit the linear model in step 1 using Generalized Least Squares with the variance covariance as estimated in the previous step, then go back to 2).

The process is repeated until the estimates stop changing in any important way.

Folklore: There is evidence that the OLS estimator has a variance which is not too much different from GLS in common ARMA models.

Seasonal Non-stationarity

Every winter the measured (not reported) unemployment rate in Canada rises. A simple model which has this feature has a non-stationary mean of the form

\begin{displaymath}\mu_{t+S}=\mu_t
\end{displaymath}

Typically the value of S is 12 for monthly data or 4 for quarterly data (common in economic data). Normally, $\mu_1,\ldots,\mu_S$ are not the same, of course.

Definition: Deseasonalization is the process of transforming Xto eliminate this sort of seasonal variation in the mean.

Method 1: Regression. Estimate

\begin{displaymath}\hat\mu_t = \frac{X_t+X_{t+S} + \cdots }{\char93 \text{ of terms}}
\end{displaymath}

and then

\begin{displaymath}\hat{Y}_t = X_t - \hat\mu_t
\end{displaymath}

This is ordinary least squares regression with

\begin{displaymath}\theta = \left[\begin{array}{c} \mu_0 \\ \vdots \\ \mu_{S-1}
\end{array}\right]
\end{displaymath}

and

\begin{displaymath}V=\left[\begin{array}{ccccc} 1 & 0 & 0 & \cdots & 0
\\
0 & 1...
...ots & \vdots & \vdots & \vdots
\end{array}\right]_{T \times S}
\end{displaymath}

Method B: Seasonal differencing:

Wt = Xt+S - Xt

is stationary. BUT: if Y was an ARMA process then now W has a unit root.

Definition: A multiplicative $ARIMA(p,d,q) \times (P,D,Q)_S$model has the form:

\begin{displaymath}\Phi(B^S) \phi(B) (I-B^S)^D (I-B)^d X = \Psi(B^S) \psi(B) \epsilon
\end{displaymath}

where $\Phi,\phi,\Psi,\psi$ are polynomials of degree P,p,Q,q respectively with all roots outside the unit circle (and other technical conditions - see assignment 2, question 1) and $\epsilon$ is white noise.

As an example consider the model


\begin{displaymath}(1-\Phi_1 B^{12})(1-\phi_1B)X = (1-\Psi_1 B^{12} )(1-\psi_1B) \epsilon
\end{displaymath}

which is an $ARIMA(1,0,1) \times (1,0,1)_{12}$ model. Multiplying this out we get

\begin{displaymath}(I -\phi_1 B -\Phi_1B^{12} + \phi_1\Phi_1 B^{13}) X
=
(I -\psi_1 B -\Psi_1B^{12} + \psi_1\Psi_1 B^{13}) \epsilon
\end{displaymath}

which looks like an ARMA(13,13). The key point is that this new model has only 4 parameters (plus the variance of $\epsilon$) instead of the 26 for a general ARMA(13,13).

Fitting ARIMA(p,d,q) models to data

Fitting the I part is easy we simply difference d times. The same observation applies to seasonal multiplicative model. Thus to fit an ARIMA(p,d,q) model to X you compute Y =(I-B)d X (shortening your data set by d observations) and then you fit an ARMA(p,q) model to Y. So we assume that d=0.

Simplest case: fitting the AR(1) model

\begin{displaymath}X_t = \mu+\rho(X_{t-1}-\mu) + \epsilon_t
\end{displaymath}

We must estimate 3 parameters: $\mu, \rho$ and $\sigma^2=\text{Var}(\epsilon_t)$.

Our basic strategy will be:

Generally the full likelihood is rather complicated; we will use conditional likelihoods and ad hoc estimates of some parameters to simplify the situation.

The likelihood: Gaussian data

If the errors $\epsilon$ are normal then so is the series X. In general the vector $X=(X_0,\ldots,X_{T-1})^t$ has a $MVN({\bf\mu},\Sigma)$ where $\Sigma_{ij} = C(i-j)$ and ${\bf\mu}$ is a vector all of whose entries are $\mu$. The joint density of X is

\begin{displaymath}f_X(x) = \frac{1}{(2\pi)^{T/2} \det(\Sigma)^{1/2}} \exp\left\{-\frac{1}{2}
(x-{\bf\mu})^t \Sigma^{-1} (x-{\bf\mu})\right\}
\end{displaymath}

so that the log likelihood is

\begin{displaymath}\ell(\mu,a_1,\ldots,a_p,b_1,\ldots,b_q,\sigma) =
-\frac{1}{2}...
...bf\mu})^t \Sigma^{-1} (x-{\bf\mu}) + \log(\det(\Sigma))\right]
\end{displaymath}

Here I have indicated precisely (for an ARMA(p,q)) the parameters on which the quantity depends.

It is possible to carry out full maximum likelihood by maximizing the quantity in question numerically. In general this is hard, however.

Here I indicate some standard tactics. In your homework I will be asking you to carry through this analysis for one particular model.

The AR(1) model

Consider the model

\begin{displaymath}X_t-\mu = \rho(X_{t-1} - \mu) + \epsilon_t
\end{displaymath}

This model formula permits us to write down the joint density of X in a simpler way:

\begin{displaymath}f_X = f_{X_{T-1}\vert X_{T-2},\ldots,X_0} f_{X_{T-2}\vert X_{T-3},\ldots,X_0}\cdots f_{X_1\vert X_0} f_{X_0}
\end{displaymath}

Each of the conditional densities is simply

\begin{displaymath}f_{X_{k+1}\vert X_k,\ldots,X_0} (x_k\vert x_{k-1},\ldots,x_{0}) =
g\left[x_k-\mu-\rho(x_{k-1}-\mu)\right]
\end{displaymath}

where g is the density of an individual $\epsilon$. For iid $N(0,\sigma^2)$ errors this gives a log likelihood which is

\begin{displaymath}\ell(\mu,\rho,\sigma) = - \frac{1}{2\sigma^2}\sum_1^{T-1} \le...
...-\rho(x_{k-1}-\mu)\right]^2
-(T-1)\log(\sigma) + \log(f_{X_0})
\end{displaymath}

Now for a stationary series I showed that $X_t \sim N(\mu,\sigma^2/(1-\rho^2))$ so that

\begin{displaymath}\log(f_{X_0}(x_0)) = -\frac{1-\rho^2}{2\sigma^2}(x_0-\mu)^2 -\log(\sigma) + \log(1-\rho^2)
\end{displaymath}

This makes
\begin{align*}\ell(\mu,\rho,\sigma) & = - \frac{1}{2\sigma^2} \left\{
\sum_1^{T-...
...^2)(x_0-\mu)^2\right\}
\\
& \qquad -T\log(\sigma) + \log(1-\rho^2)
\end{align*}
We can maximize this over $\mu$ and $\sigma$ explicitly. First

\begin{displaymath}\frac{\partial}{\partial\mu} \ell = \frac{1}{\sigma^2} \left\...
...rho(x_{k-1}-\mu)\right](1-\rho)
+ (1-\rho^2)(x_0-\mu)\right\}
\end{displaymath}

Set this equal to 0 to find
\begin{align*}\hat\mu(\rho) & = \frac{ (1-\rho)\sum_1^{T-1}(x_k -\rho x_{k-1})
...
...sum_1^{T-1}(x_k -\rho x_{k-1}) +(1+\rho)x_0}{1+\rho+
(1-\rho)(T-1)}
\end{align*}
Notice that this estimate is free of $\sigma$ and that if T is large we may drop the 1 in the denominator and the term involving x0 in the denominator and get

\begin{displaymath}\hat\mu(\rho) \approx \frac{\sum_1^{T-1}(x_k -\rho x_{k-1})}{(T-1)(1-\rho)}
\end{displaymath}

Finally, the numerator is actually

\begin{displaymath}\sum_0^{T-1} x_k - x_0 -\rho(\sum_0^{T-1} x_k -x_{T-1}) = (1-\rho)
\sum_0^{T-1} x_k -x_0 +\rho x_{T-1}
\end{displaymath}

The last two terms here are smaller than the sum; if we neglect them we get

\begin{displaymath}\hat\mu(\rho) \approx \bar{X} \, .
\end{displaymath}

Now compute

\begin{displaymath}\frac{\partial}{\partial\sigma} \ell = \frac{1}{\sigma^3}\lef...
...\mu)\right]^2 +(1-\rho^2)(x_0-\mu)^2\right\}
-\frac{T}{\sigma}
\end{displaymath}

and set this to 0 to find

\begin{displaymath}\hat\sigma^2(\rho) = \frac{\left\{
\sum_1^{T-1}\left[x_k-\mu(...
...-\mu(\rho))\right]^2
+(1-\rho^2)(x_0-\mu(\rho))^2\right\}}{T}
\end{displaymath}

When $\rho$ is known it is easy to check that $(\mu(\rho),\sigma(\rho))$ maximizes $\ell(\mu,\rho,\sigma)$.

To find $\hat\rho$ you now plug $\hat\mu(\rho)$ and $\hat\sigma(\rho)$ into $\ell$ (getting the so called profile likelihood $\ell(\hat\mu(\rho),\rho,\hat\sigma(\rho))$) and maximize over $\rho$. Having thus found $\hat\rho$ the mles of $\mu$ and $\hat\sigma$ are simply $\hat\mu(\hat\rho)$ and $\hat\sigma(\hat\rho)$.

It is worth observing that fitted residuals can then be calculated:

\begin{displaymath}\hat\epsilon_t = (X_t-\hat\mu) - \hat\rho(X_{t-1}-\hat\mu)
\end{displaymath}

(There are only T-1 of them since you cannot easily estimate $\epsilon_0$.) Note, too, that the formula for $\hat\sigma^2$simplifies to

\begin{displaymath}\hat\sigma^2 = \frac{ \sum_1^{T-1} \hat\epsilon_t^2
+(1-\rho^...
...)^2}{T} \approx \frac{ \sum_1^{T-1}
\hat\epsilon_t^2 }{T} \, .
\end{displaymath}


next up previous



Richard Lockhart
1999-11-01