No Title

STAT 804: Notes on Lecture 6

The Multivariate Normal Distribution

Definition: $Z \in R^1 \sim N(0,1)$ iff

$\begin{displaymath}f_Z(z) = \frac{1}{\sqrt{2\pi}} e^{-z^2/2} \end{displaymath}$

Definition: $Z \in R^p \sim MVN_p(0,I)$ iff $Z=(Z_1,\ldots,Z_p)^t$ (a column vector for later use) with the Z_i independent and each $Z_i\sim N(0,1)$ .

In this case,
$\begin{align*}f_Z(z_1,\ldots,z_p) &= \prod \frac{1}{\sqrt{2\pi}} e^{-z_i^2/2} \\ & = (2\pi)^{-p/2} \exp\{ -z^t z/2\} \end{align*}$
where the superscript t denotes matrix transpose.

Definition: $X\in R^p$ has a multivariate normal distribution if it has the same distribution as $AZ+\mu$ for some $\mu\in R^p$ , some $p\times q$ matrix of constants A and $Z\sim MVN_q(0,I)$ .

Lemma: The matrix A can be taken to be square with no loss of generality.

Proof: The simplest proof involves multivariate characteristic functions:

$\begin{displaymath}\phi_X(t) = \text{E}(e^{it^T X}) \end{displaymath}$

You check that for $Z \sim MVN_p(0,I)$ we have

$\begin{displaymath}\phi_Z(t) = e^{-t^Tt/2} \end{displaymath}$

Then
$\begin{align*}\phi_{AZ+\mu}(u) & = \text{E}(e^{iu^T(AZ+\mu)}) \\ & = e^{iu^T\mu} \phi_Z(A^Tu) \\ & = e^{iu^T\mu-u^TAA^Tu/2} \end{align*}$
Since the result depends only on $\mu$ and $\Sigma=AA^T$ the distribution is the same for any two A giving the same AA^T. In particular given an A which is $p\times q$ we need only use ideas of Cholesky decomposition to find a square B with BB^T=AA^T. I omit a proof.

If the matrix A is singular then X will not have a density. If A is invertible then we can derive the multivariate normal density by the change of variables formula:

$\begin{displaymath}X=AZ+\mu \Leftrightarrow Z=A^{-1}(X-\mu) \end{displaymath}$

We get

$\begin{displaymath}f_X(x) = (2\pi)^{-p/2}\text{det}(\Sigma)^{-1/2} \exp\{ -\frac{1}{2} (x-\mu)^T \Sigma^{-1} (x-\mu)\} \end{displaymath}$

where $\Sigma=AA^t$ .

Properties of the MVN distribution

1.

If $X \sim MVN(\mu,\Sigma)$ then $MX+b \sim MVN(A\mu+b, A\Sigma A^t)$ follows easily from our definition. (You work with the real definition not densities, etc.)

2.

All margins are multivariate normal: if

$\begin{displaymath}X = \left[\begin{array}{c} X_1\\ X_2\end{array} \right] \end{displaymath}$

$\begin{displaymath}\mu = \left[\begin{array}{c} \mu_1\\ \mu_2\end{array} \right] \end{displaymath}$

and

$\begin{displaymath}\Sigma = \left[\begin{array}{cc} \Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22} \end{array} \right] \end{displaymath}$

then $X \sim MVN(\mu,\Sigma)$ implies that $X_1\sim MVN(\mu_1,\Sigma_{11})$ . This is really a special case of the previous property.

3.

All conditionals are normal: the conditional distribution of X₁given X₂=x₂ is $MVN(\mu_1+\Sigma_{12}\Sigma_{22}^{-1} (x_2-\mu_2),\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21})$ To prove this you have to do algebra with the definition of the conditional density. It is possible to give a sensible formula even if $\Sigma_{22}$ is singular.

Partial Correlations

If X is an AR(p) process then

$\begin{displaymath}X_t = \sum_{j=1}^p a_j X_{t-j} + \epsilon_t \end{displaymath}$

If you hold $X_{t-1},\ldots,X_{t-p}$ fixed then X_t is $\epsilon_t$ plus a fixed constant. Since $\epsilon_t$ is independent of X_s for all s<t we see that given $X_{t-1},\ldots,X_{t-p}$ the variables X_t and X_t-r for r>p are independent. This would imply

$\begin{displaymath}\text{Cov}(X_t,X_{t-q}\vert X_{t-1}, \ldots,X_{t-p})=0 \end{displaymath}$

for all q>p. When the process X is Gaussian our results above on conditional distributions guarantee that we can calculate this partial autocovariance from the variance covariance matrix of the vector $(X_t,X_{t-q},X_{t-1}, \ldots,X_{t-p})$ . If we partition this vector into a first piece with 2 entries and a second piece with p entries we find that the conditional variance covariance matrix of (X_t,X_t-q) given the others is

$\begin{displaymath}\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21} \end{displaymath}$

In this formula $\Sigma_{11}$ is the 2 by 2 unconditional covariance matrix of (X_t,X_t-q), namely,

$\begin{displaymath}\left[\begin{array}{cc}C(0) & C(q) \\ C(q) & C(0) \end{array}\right] \end{displaymath}$

The matrix $\Sigma_{22}$ is the $p \times p$ Toeplitz matrix with C(0) down the diagonal, C(1) down the first sub and super diagonal, and so on. The matrix $\Sigma_{12}=\Sigma_{21}^T$ is the $2 \times p$ matrix

$\begin{displaymath}\left[\begin{array}{ccc} C(1) & \cdots & C(p) \\ C(q-1)& \cdots & C(q-p) \end{array}\right] \end{displaymath}$

Once you have then calculated the resulting $2 \times 2$ matrix you pick out the off diagonal element; this is the partial autocovariance. Even when the data are not Gaussian we will be doing this piece of arithmetic with the autocovariance to define the partial autocovariance.

To use this idea we define the partial autocorrelation function to be

$\begin{displaymath}PACF(h) =\text{Corr}(X_t,X_{t-h}\vert X_{t-1},\ldots,X_{t-h+1}) \end{displaymath}$

If h=1 then there are no Xs between X_t and X_t-h so there is nothing to condition on. This makes PACF(1)=ACF(1).

Qualitative idea: A plot of the sample partial autocorrelation function (derived by replacing C_X by $\hat{C}_X$ in the definition of PACF) is examined. For an AR(p) process this sample plot ought to drop suddenly to 0 for h>p.

AR(1) example

Now I calculate PACF(2) for a mean 0 AR(1) process. We have $C_X(h) = C_X(0) \rho^{\vert h\vert}$ . Let

$\begin{displaymath}\left[\begin{array}{c}Y \\ Z\end{array}\right] = \left[\begin{array}{c} X_t \\ X_{t-2} \\ X_{t-1}\end{array}\right] \end{displaymath}$

with Y=(X_t,X_t-2)^T and Z= X_t-1. Then (Y,Z)^T has a $MVN(0,\Sigma)$ distribution with

$\begin{displaymath}\Sigma = C_X(0)\left[\begin{array}{ccc} 1 & \rho^2 & \rho \\ \rho^2 & 1 & \rho \\ \rho & \rho & 1 \end{array}\right] \end{displaymath}$

Following the partitioning into Y and Z we find

$\begin{displaymath}\Sigma_{11} = \Sigma = C_X(0)\left[\begin{array}{cc} 1 & \rho^2 \\ \rho^2 & 1\end{array}\right] \end{displaymath}$

$\begin{displaymath}\Sigma_{12} = C_X(0)\left[\begin{array}{c}\rho\\ \rho \end{array} \right] \end{displaymath}$

and

$\begin{displaymath}\Sigma_{22} = C_X(0) \left[ 1\right] \end{displaymath}$

We find
$\begin{align*}\text{Cov}(X_t,X_{t-2} \vert X_{t-1}) & = \Sigma_{11} - \Sigma_{12... ...t[\begin{array}{cc} 1-\rho^2 & 0 \\ 0 & 1-\rho^2 \end{array}\right] \end{align*}$
Now to compute the correlation we divide the conditional covariance by the product of the two conditional SDs, that is:
$\begin{align*}PACF(2) & = \frac{\text{Cov}(X_t,X_{t-2} \vert X_{t-1})}{\sqrt{ \t... ... X_{t-1})}} \\ & = \frac{0}{\sqrt{(1-\rho^2)(1-\rho^2)}} \\ & = 0 \end{align*}$

Consider the problem of prediction X_t knowing $X_{t-1},\ldots,X_{t-K}$ . In this course we choose a predictor $\hat{X}_t=f(X_{t-1},\ldots,X_{t-K})$ to minimize the mean squared prediction error:

$\begin{displaymath}\text{E}[(X_t-\hat{X}_t)^2] \end{displaymath}$

The solution is to take

$\begin{displaymath}\hat{X}_t = \text{E}(X_t\vert X_{t-1},\ldots,X_{t-K}) \end{displaymath}$

For multivariate normal data this predictor is the usual regression solution:

$\begin{displaymath}\text{E}(X_t\vert X_{t-1},\ldots,X_{t-K}) = \mu+A\left[\begin... ...}{c} X_{t-1} - \mu \\ \vdots \\ X_{t-k}-\mu \end{array}\right] \end{displaymath}$

where the matrix A is $\Sigma_{12}\Sigma_{22}^{-1}$ in the notation of the earlier notes. This can be written in the form

$\begin{displaymath}\hat{X}_t= \mu + \sum_1^k \alpha_i(X_{t-i}-\mu) \end{displaymath}$

for suitable constants $\alpha$ . Let $Y-t=X_t-\mu$ . The $\alpha_i$ minimize
$\begin{align*}\text{E}[(Y_t-\sum\alpha_i Y_{t-i})^2] & = \text{E}(Y_t^2) - 2 \su... ...\ & = C(0) -2\sum\alpha_i C(i) + \sum_{ij} \alpha_i\alpha_j C(i-j) \end{align*}$
Take the derivative with respect to $\alpha_m$ and set the resulting k quantities equal to 0 giving

$\begin{displaymath}C(m) = \alpha_m \sum_j C(m-j) \end{displaymath}$

or, dividing by C(0)

$\begin{displaymath}\rho(m) = \alpha_m \sum_j \rho(m-j) \end{displaymath}$

(These are the Yule Walker equations again!) It is a fact that the solution $\alpha_m$ is

$\begin{displaymath}\text{Corr}(X_t,X_{t-m}\vert X_{t-1},\ldots,X_{t-k}\text{ but not } X_{t-m}) \end{displaymath}$

so that $\alpha_k$ (the last $\alpha$ ) is PACF(k).

Remark: This makes it look as if you would compute the first 40 values of the PACF by solving 40 different problems and picking out the last $\alpha$ in each one but in fact there is an explicit recursive algorithm to compute these things.

Estimation of PACF

To estimate the PACF you can either estimate the ACF and do the arithmetic above with estimates instead of theoretical values or minimize a sample version of the mean squared prediction error:

The Mean Squared Prediction Error is

$\begin{displaymath}\text{E}\left\{[ (X_t - \mu - \sum_{j-1}^k \alpha_k (X_{t-k} - \mu)]^2\right\} \end{displaymath}$

We estimate $\mu$ in this formula with $\bar{X}$ and then replace the expected value with an average over the data so that we minimize

$\begin{displaymath}\sum_{t=k}^{T-1}\left \{[ (X_t -\bar{X}-\sum_{j-1}^k \alpha_k (X_{t-k} - \mu)]^2\right\}/T \end{displaymath}$

This is just a least squares problem regressing the vector

$\begin{displaymath}Y = \left[\begin{array}{c} X_k-\bar{X} \\ \vdots \\ X_{T-1} - \bar{X} \end{array}\right] \end{displaymath}$

on the design matrix

$\begin{displaymath}Z = \left[\begin{array}{ccc} X_{t-1} - \bar{X} & \cdots & X_0... ...2} - \bar{X} & \cdots & X_{T-k-1} - \bar{X} \end{array}\right] \end{displaymath}$

The estimate of PACF(k) is the kth entry in

$\begin{displaymath}\hat\alpha = (Z^TZ)^{-1}Z^T Y \, . \end{displaymath}$

Some example plots

What follows are plots of 5 series, their fitted ACFs and their fitted PACFs. The series are:

1.: A generated MA(2) series: $X_t = \epsilon_t +0.8 \epsilon_{t-1}/3 -0.9 \epsilon_{t-2}$
2.: A generated AR(2) series $X_t = \epsilon_t + X_{t-1} - 0.99X_{t-2}$
3.: A generated AR(3) series: $X_t = \epsilon_t +0.8 X_{t-1} - X_{t-2}/3 +.8X_{t-3}/\sqrt{3}$
4.: A series of monthly sunspot counts over a 200 year period.
5.: A series of annual rain fall measurements in New York City.

Here is the SPlus code I used to make the following plots. You should note the use of the function acf which calculates and plots ACF and PACF.

#
# Comments begin with #
#
# Begin by generating the series.  The AR series generated are NOT
# stationary.  But I generate 10000 values from the recurrence relation
# and use the last 500. Asymptotic stationarity should guarantee that
# these last 500 are pretty stationary.
#
n<- 10000
ep <- rnorm(n)
ma2 <- ep[3:502]+0.8*ep[2:501]-0.9*ep[1:500]
ar2 <- rep(0,n)
ar2[1:2] <- ep[1:2]
for(i in 3:n) {ar2[i] <- ep[i] 
              + ar2[i-1] -0.99*ar2[i-2]}
ar2 <- ar2[(n-499):n]
ar2 <- ts(ar2)
ar3 <- rep(0,n)
ar3[1:3] <- ep[1:3]
for(i in 4:n) {ar3[i] <- ep[i] + 0.8*ar3[i-1] 
           -ar3[i-2]/3 +(0.8/1.712)*ar3[i-3]}
ar3 <- ar3[(n-499):n]
ar3 <- ts(ar3)
#
# The next line turns on a graphics device -- in this case the 
# graph will be made in postscript in a file called ma2.ps. It
# will come out in portrait, not landscape, format.
#
postscript(file="ma2.ps",horizontal=F)
#
# The next line says to put 3 pictures 
# in a single column on the plot
#
par(mfcol=c(3,1))
tsplot(ma2,main="MA(2) series")
acf(ma2)
acf(ma2,type="partial")
#
# When you finish a picture you turn 
# off the graphics device
# with the next line.
#
dev.off()
#
# 
#
postscript(file="ar2.ps",horizontal=F)
par(mfcol=c(3,1))
tsplot(ar2,main="AR(2) series")
acf(ar2)
acf(ar2,type="partial")
dev.off()
#
#
postscript(file="ar3.ps",horizontal=F)
par(mfcol=c(3,1))
tsplot(ar3,main="AR(3) series")
acf(ar3)
acf(ar3,type="partial")
dev.off()
#
#
postscript(file="sunspots.ps",horizontal=F)
par(mfcol=c(3,1))
tsplot(sunspots,main="Sunspots series")
acf(sunspots,lag.max=480)
acf(sunspots,lag.max=480,type="partial")
dev.off()
#
#
postscript(file="rain.nyc1.ps",horizontal=F)
par(mfcol=c(3,1))
tsplot(rain.nyc1,main="New York Rain Series")
acf(rain.nyc1)
acf(rain.nyc1,type="partial")
dev.off()

Sunspot data

New York City Rainfall

You should notice:

The MA series has an ACF which is near 0 for lags over 2 as predicted theoretically.
The AR(p) series have complicated ACF's but the PACFs vanish at lags more than p.
The sunspot data has significant autocorrelation over many years but the PACF is pretty small after say 2 years or so. Notice that SPlus has labelled the lags in years so that the number 2 on the x axis corresponds to about 24 months. An AR(p) model with p around 24 would then be a possibility.
The New York rainfall series looks quite a bit like white noise.

$next$ $up$ $previous$

Richard Lockhart
1999-10-01