Unit Root Test
We’ve been applying different orders of differencing on simulated data by looking at the time series plot and ACF plot. In this chapter, we introduce the DickeyFuller unitroot test
, or the DF test, that helps us check whether differencing is necessary or not. The test shouldn’t be used as the single source of truth, but rather an addition to the existing plotting methods to decide whether to apply differencing.
DickeyFuller test
If $X_t$ is either an AR(1) model or an AR(1) with some mean trend:
$$ X_t = \phi X_{t1} + Z_t, \quad X_t = \delta + \phi X_{t1} + Z_t, \quad X_t = \delta_0 + \delta_1 t + \phi X_{t1} + Z_t, $$
then the null hypothesis of the unitroot test is $H_0: \phi = 1$ (nonstationary)^{1} and the alternative hypothesis is $H_1: \phi < 1$ (stationary). If we fail to reject $H_0$, then the series might be a random walk and the stochastic trend could be removed by differencing. If we reject the null, then the time series is stationary and no differencing is necessary.
The test statistic is
$$ DF = t\text{ratio} = \frac{\hat\phi  1}{se(\hat\phi)} $$
where $\hat\phi$ is the LSE. The pvalue could be calculated by a nonstandard asymptotic distribution, which is repeatedly generating data (resampling / bootstrapping) from a random walk to get the distribution of $\hat\phi$. In R,


Augmented DF unitroot test
The DF test has a strong model assumption, and it’s very rare for a real time series to be explained by a random walk. If we observe an increasing pattern in the time series plot, and a slowly decaying pattern in the ACF plot, but the DF test says no differencing is needed, then maybe the data is not generated from the specific model.
The augmented DickeyFuller
(ADF) test relaxes the assumption to an AR(p) model:
$$ \nabla X_t = c_t + \beta X_{t1} + \sum_{i=1}^{p1} \phi_i \nabla X_{ti} + Z_t $$
The null hypothesis is $H_0: \beta_0$ (nonstationary), and the alternative hypothesis is $H_1: \beta < 0$. Similarly, if we fail to reject the null, then differencing might be necessary. If we reject $H_0$, then the series might be stationary or is an explosive AR^{2}.
In the following section we’re going to see how the ADF test works on different datasets.


By default, the alternative
parameter is set as “stationary” in the adf.test()
function. With a pvalue smaller than 0.01, we reject the null and conclude that the process is stationary. We could also check if the data is an explosive AR:


And clearly it’s not. What if we use this on an actual explosive AR series?


With a pvalue smaller than 0.01, we reject the null and conclude that we have a nonstationary explosive AR series.
Overdifferencing
Unnecessary differencing can cause artificial autocorrelation and make the model fitting process complicated. For example, suppose $X_t$ is a random walk, then
$$ \nabla X_t = X_t  X_{t1} = Z_t $$
But what happens after second order differencing?
$$ \nabla^2 X_t = \nabla(Z_t) = Z_t  Z_{t1} $$
With first order differencing we can simply fit a white noise, but after another round differencing an MA(1) model would be needed. Always keep the principle of parsimony
in mind!
Another example with $X_t \sim WN$. With some manipulation,
$$
\begin{gathered}
X_t = Z_t, X_{t1} = Z_{t1} \\
0.5 X_{t1} + 0.5 Z_{t1} = 0 \\
\Rightarrow X_t = 0.5 X_{t1} + Z_t + 0.5 Z_{t1}
\end{gathered}
$$
The WN could be modeled as an ARMA(1, 1). If we have a choice between the two, WN is preferred by the principle of parsimony.
Model selection
We’ll test what we’ve learnt on removing the mean trend on the Lake Huron dataset, which contains the yearly levels of Lake Huron in the years 18751972.
The mean seems to be decreasing over the years, and the variance seems to be slightly increasing. The ACF is more of a tailoff pattern than a slowly decreasing one. If we assume stationarity and fit an AR(2) model with an intercept:


Of course we could also assume nonstationarity and fit ARIMA models of different orders. So how do we choose the “best” model? In general, we should
 minimize AIC or BIC,
 principle of parsimony, and
 maximize prediction ability – there’s many criteria available, such as
MAPE
(mean absolute percentage error).
The MAPE is given by
$$ MAPE = \frac{1}{T^*} \sum_{l=1}^{T^*}\left \frac{X_{t+l}  X_t(l)}{X_{t+l}} \right $$
where $T^*$ is the length of the training time series. The MAPE is a measure of prediction accuracy of our forecasting model. It has a very intuitive interpretation in terms of relative error.
To calculate the MAPE, we need to split our series into two parts: one for training and one for testing. Instead of a random split as often seen in machine learning methods, we cut our series at some point and use the remaining data for testing^{3}.
Let’s try this in R on the Lake Huron dataset. We’re going to fit four different models, starting from the AR(2) model:


If we examine the diagnosis plots, there seems to be a spike at lag 9 for the residuals:
So we also consider the following models:
$$
\begin{gathered}
(1  \phi_1 B  \phi_2 B^2  \phi_9 B^9)X_t = Z_t \\
(1  \phi_1 B  \phi_2 B^2)X_t = (1 + \theta_9 B^9)Z_t \\
(1  \phi_1 B)X_t = (1 + \theta_1 B)Z_t
\end{gathered}
$$
where ARMA(1, 1) is considered because it’s a competing model with AR(2).


Here only the NA
s in the fixed
parameter will be varied. We should also check the diagnosis plots for each model, but the figures are omitted for the sake of brevity. To calculate the BIC, we can use the BIC()
function on the model objects. To calculate the MAPE, the following snippet can be used:


We summarize our results in the table below. The rank of each model is given in bold for each criterion. Overall ARMA(1, 1) seems like the best model as it ranks first or second in most cases, and is a simpler model than MA(9).
Criteria  AR(2)  AR(9)  MA(9)  ARMA(1, 1) 

AIC  193.8007 (4)  193.6452 (3)  190.973 (1)  191.9973 (2) 
BIC  203.7101 (3)  206.0319 (4)  203.3597 (2)  201.9066 (1) 
# parameters  3 (1)  4 (3)  4 (3)  3 (1) 
MAPE  0.001737 (2)  0.001964 (4)  0.001711 (1)  0.001815 (3) 
After we’ve decided which model to use, we should go back and use the entire dataset including the testing data. The final model is
$$ (1  0.7449B)(X_t  579.0555) = (1 + 0.3206B)Z_t, \quad \hat\sigma^2 = 0.4749 $$
Nov 17  Spectral Analysis  9 min read 
Nov 02  Decomposition and Smoothing Methods  14 min read 
Oct 13  Seasonal Time Series  17 min read 
Oct 09  Variability of Nonstationary Time Series  13 min read 
Oct 05  ARIMA Models  9 min read 