```
set.seed(2023)
library(fpp3)
df <- tibble(
time = seq(100),
x = rnorm(100),
y = x + rnorm(100)
)
fit <- lm(y ~ x, data = df)
AIC(fit)
```

`[1] 275.6267`

Statistical forecasting involves using historical data to predict future outcomes. The goal is to build a stochastic model that captures the underlying patterns in the data, and how those have changed over time, and use it to make predictions about future values. I have developed many statistical forecasting models that are now widely used in diverse domains including retail, energy, demography, and tourism.

The strength of statistical forecasting lies in its ability to provide quantifiable predictions based on historical data, and to include uncertainty in those predictions. Statistical forecasts are typically presented as probability distributions of future outcomes, which can be used to make decisions that account for the uncertainty in the predictions.

Most recently I have been developing tools that allow statistical forecasts to be produced for thousands of related time series, such as sales forecasts for individual products in a retail store, or electricity demand forecasts for small regions in a country. The disaggregated data tends to be noisy and sparse, and the models need to be able to share information across related series to make accurate predictions.

Because of the assumption that the future will continue to evolve similarly to the past, statistical forecasting tends to perform best when: (a) there is substantial historical data; (b) the underlying environment is relatively stable; and (c) the aim is to produce short-term forecasts. When there is limited historical data, or when the data generating process is subject to significant structural changes, statistical forecasting can struggle to make accurate predictions.

I am also interested in using statistical forecasting techniques to identify anomalies, such as sudden changes in the underlying process, or unusual patterns in the data. This can be useful for detecting fraud, monitoring the health of complex systems, or identifying recording errors in data.

There is sometimes confusion between “forecast variance” and “forecast error variance”. In fact, they are the same thing.

Suppose we have a probabilistic forecast with estimated mean and estimated variance . That is, our estimated distribution of the future observation , based on the data , has mean and variance . The **forecast variance** refers to , the estimated variance of the forecast distribution.

Now let our point forecast be denoted by . In most cases, the point forecast is the mean of the forecast distribution, i.e., ; but occasionally it might the median, or some other quantity.

The **forecast error variance** refers to the variance of the forecast error, i.e., . Ignoring any estimation error, this is clearly the same thing as the , regardless of whether the point forecast is the mean of the forecast distribution or not.

Occasionally I see the term “forecast variance” used to mean ; i.e., the variance of the point forecast estimate. However, that quantity is rarely of interest to a forecaster — we are usually interested in the distribution of a future observation, not in the distribution of the point forecast estimate.

]]>Forecast reconciliation ensures forecasts of time series in a hierarchy adhere to aggregation constraints, enabling aligned decision-making. While forecast reconciliation can enhance overall accuracy in hierarchical or grouped structures, the most substantial improvements occur in series with initially poor-performing base forecasts. Nevertheless, certain series may experience deteriorations in reconciled forecasts. In practical settings, series in a structure often exhibit poor base forecasts due to model misspecification or low forecastability. To prevent their negative impact, we propose two categories of forecast reconciliation methods that incorporate time series selection based on out-of-sample and in-sample information, respectively. These methods keep “poor” base forecasts unused in forming reconciled forecasts, while adjusting weights allocated to the remaining series accordingly when generating bottom-level reconciled forecasts. Additionally, our methods ameliorate disparities stemming from varied estimates of the base forecast error covariance matrix, alleviating challenges associated with estimator selection. Empirical evaluations through two simulation studies and applications using Australian labour force and domestic tourism data demonstrate improved forecast accuracy, particularly evident in higher aggregation levels, longer forecast horizons, and cases involving model misspecification.

]]>Forecasting interrupted time series data is a major challenge for forecasting teams, especially in light of events such as the COVID-19 pandemic. This paper investigates several strategies for dealing with interruptions in time series forecasting, including highly adaptable models, intervention models, marking interrupted periods as missing, forecasting what may have been, downweighting the interruption period, and ensemble models. Each approach offers specific advantages and disadvantages, such as adaptability, memory retention, data integrity, flexibility, and accuracy. We evaluate the effectiveness of these strategies using two actual datasets that were interrupted by COVID-19, and we provide recommendations for how to handle these interruptions. This work contributes to the literature on time series forecasting, offering insights for academics and practitioners dealing with interrupted data in numerous domains.

]]>Collections of time series formed via aggregation are prevalent in many fields. These are commonly referred to as hierarchical time series and may be constructed cross-sectionally across different variables, temporally by aggregating a single series at different frequencies, or even generalised beyond aggregation as time series that respect linear constraints. When forecasting such time series, a desirable condition is for forecasts to be coherent: to respect the constraints. The past decades have seen substantial growth in this field with the development of reconciliation methods that ensure coherent forecasts and improve forecast accuracy. This paper serves as a comprehensive review of forecast reconciliation and an entry point for researchers and practitioners dealing with hierarchical time series. The scope of the article includes perspectives on forecast reconciliation from machine learning, Bayesian statistics and probabilistic forecasting, as well as applications in economics, energy, tourism, retail demand and demography.

]]>Talk to be given at Australian Statistical Conference, Wollongong, Australia.

Accurate forecasts of ambulance demand are crucial inputs when planning and deploying staff and fleet. Such demand forecasts are required at national, regional and sub-regional levels, and must take account of the nature of incidents and their priorities. These forecasts are often generated independently by different teams within the organization. As a result, forecasts at different levels may be inconsistent, resulting in conflicting decisions and a lack of coherent coordination in the service. To address this issue, we exploit the hierarchical and grouped structure of the demand time series, and apply forecast reconciliation methods to generate both point and probabilistic forecasts that are coherent and use all the available data at all levels of disaggregation.

The methods are applied to daily incident data from a UK ambulance service, from October 2015 to July 2019, disaggregated by nature of incident, priority, managing health board, and control area.

We use an ensemble of forecasting models, and show that the resulting forecasts are better than any individual forecasting model. We validate the forecasting approach using time-series cross-validation.

Forecast reconciliation is a post-forecasting process that involves transforming a set of incoherent forecasts into coherent forecasts which satisfy a given set of linear constraints for a multivariate time series. We extend the current state-of-the-art cross-sectional probabilistic forecast reconciliation approach to encompass a cross-temporal framework, where temporal constraints are also applied. Our proposed methodology employs both parametric Gaussian and non-parametric bootstrap approaches to draw samples from an incoherent cross-temporal distribution. To improve the estimation of the forecast error covariance matrix, we propose using multi-step residuals, especially in the time dimension where the usual one-step residuals fail. We evaluate the proposed methods through a detailed simulation study that investigates their theoretical and empirical properties. We further assess the effectiveness of the proposed cross-temporal reconciliation approach by applying it to two empirical forecasting experiments, using the Australian GDP and the Australian Tourism Demand datasets. For both applications, we show that the optimal cross-temporal reconciliation approaches significantly outperform the incoherent base forecasts in terms of the Continuous Ranked Probability Score and the Energy Score. Overall, our study expands and unifies the notation for cross-sectional, temporal and cross-temporal reconciliation, thus extending and deepening the probabilistic cross-temporal framework. The results highlight the potential of the proposed cross-temporal forecast reconciliation methods in improving the accuracy of probabilistic forecasting models.

The AIC returned by

`TSLM()`

is different from that returned by `lm()`

. Why?
I get this question a lot, so I thought it might help to explain some issues with AIC calculation.

First, the equation for the AIC is given by where is the likelihood of the model and is the number of parameters that are estimated (including the error variance). For a linear regression model with iid errors, fitted to observations, the log-likelihood can be written as where is the residual for the th observation. The AIC is then Since we don’t know , we estimate it using the mean squared error (the maximum likelihood estimator), giving where is a constant that depends only on the sample size and not on the model. This constant is often ignored. Thus, different software implementations can lead to different AIC values for the same model, since they may include or exclude the constant .

Now, let’s look at what R returns in a simple case using the `lm()`

function.

```
set.seed(2023)
library(fpp3)
df <- tibble(
time = seq(100),
x = rnorm(100),
y = x + rnorm(100)
)
fit <- lm(y ~ x, data = df)
AIC(fit)
```

`[1] 275.6267`

We can check how this is calculated by computing it ourselves.

```
mse <- mean(residuals(fit)^2)
n <- length(residuals(fit))
k <- length(fit$coefficients) + 1
# With constant
2*k + n*log(mse) + n*log(2*pi) + n
```

`[1] 275.6267`

```
# Without constant
2*k + n*log(mse)
```

`[1] -8.161047`

Clearly, `AIC()`

applied to the output from `lm()`

is using the version with the constant.

Now compare that with what we obtain using the `TSLM()`

function from the fable package.

```
df |>
as_tsibble(index = time) |>
model(TSLM(y ~ x)) |>
glance() |>
pull(AIC)
```

`[1] -8.161047`

This is the AIC without the constant.

The situation is even more confusing with ARIMA models, and some other model classes, because some functions use approximations to the likelihood, rather than the exact likelihood.

Thus, AIC values can be compared across models fitted using the same functions, but not necessarily when models have been fitted using different functions.

]]>
Does it make any sense to compute p-values for prediction intervals?

I received this email today:

My team recently used some techniques found in your writings to perform forecasts … Our work has been well received by reviewers, but one commenter asked two questions that I was hoping you may be able to provide insight on.

First, they wanted to know if we could provide P-values for our prediction intervals. In our work, we said, “Observed rates were deemed significantly different from expected rates when they did not fall within the 95% PI.” This same language has been used by others published in the same journal. I am curious to hear your thoughts on giving P-values for these PIs and what the appropriate method for doing so would be (if any).

Second, they asked about making a correction for multiple comparisons. … I believe we could apply a Bonferroni correction to the PIs, but that feels too liberal. Moreover, I am curious if this is even called for given our statement of what is deemed significant and the fact that our prediction interval construction relies on a non-parametric method.

Here is my reply:

I don’t think this makes any sense. A p-value is the probability of obtaining observations at least as extreme as those observed given a null hypothesis. What’s the hypothesis here? In forecasting, we don’t usually have a hypothesis. Instead, we fit a model to the data, and make predictions based on the model. I guess you could make the null hypothesis “The future observations come from the forecast distributions”, and then the p-value for each future time period would be the probability of the tails beyond the observations. But it is well-known that the estimated prediction intervals are almost always too narrow due to them not taking into account all sources of variance. So the size of this test would not be well-calibrated. I think you’re better off pushing back rather than trying to meet the request.

A Bonferroni correction assumes independence between the intervals, and that is not true for PIs from a forecasting model. The future forecast errors are all correlated (with the strength of the correlation depending on the model and the DGP). Usually we just say that these are pointwise PI, and so we expect 5% of observations to fall outside the 95% prediction intervals. It is possible to generate uniform PI, which contain 95% of all future sample paths, but this is a little tricky due to the correlations between horizons. It could be done via simulation – simulate a 1000 future sample paths and compute the envelope that contains 950 of them.

It sounds like the reviewers are only familiar with inferential statistics, and not with predictive modelling. You could point them to Shmueli’s excellent 2010 paper “To explain or predict”, highlighting the differences between the two paradigms.

]]>Forecast combinations have flourished remarkably in the forecasting community and, in recent years, have become part of the mainstream of forecasting research and activities. Combining multiple forecasts produced from the single (target) series is now widely used to improve accuracy through the integration of information gleaned from different sources, thereby mitigating the risk of identifying a single “best” forecast. Combination schemes have evolved from simple combination methods without estimation, to sophisticated methods involving time-varying weights, nonlinear combinations, correlations among components, and cross-learning. They include combining point forecasts, and combining probabilistic forecasts. This paper provides an up-to-date review of the extensive literature on forecast combinations, together with reference to available open-source software implementations. We discuss the potential and limitations of various methods and highlight how these ideas have developed over time. Some important issues concerning the utility of forecast combinations are also surveyed. Finally, we conclude with current research gaps and potential insights for future research.

]]>
When using a training/test split, or time-series cross-validation, are you choosing a specific model or a model class?

This question arises most time I teach a forecasting workshop, and it was raised again in the following email I received today:

I have a time series that I have split into training and test datasets with an 80%-20% ratio. I fit a series of different models (ETS, BATS, ARIMA, NN etc) to the training data and generate my forecasts from each model. When evaluating the forecasts against the test set I find the model that gives the best outcome is an ARIMA(1,1,1) that was selected using the auto.arima function. My question is this, should I proceed to fit an ARIMA(1,1,1) to the whole data set, or should I use the auto.arima function again which may give me a slightly different (p,d,q) order as there is now an extra 20% of unseen data available to the forecast model? Any guidance would be greatly received.

This is a common issue. If you only have one class of model to consider (e.g., only ETS or only ARIMA), then it is easy enough to select the model on all available data using the AIC, and use the selected model to forecast the future. But if you are selecting between model classes, then you need to use either a training/test split, or (preferably) a time-series cross-validation procedure.

If you use time-series cross-validation, then there would usually be different models selected for each training set, and the cross-validated error is a measure of how well the model class works for your data. In that case, there is no single model for the training data, and you are selecting the *model class* rather than a specific model. This makes it clear that you should then apply the selected model class to all the data, when forecasting beyond the end of the available data. In other words, if you choose ARIMA over ETS, then you would then fit an ARIMA model to all the data, and use that model to forecast the future.

You can think of a simple training/test split as a special case of time-series cross-validation, where there is a single fold. So the same argument applies. That is, you are selecting the model class that works best for your data, and so you should apply that model class to all the data, when forecasting beyond the end of the available data.

This example also illustrates why it is important to use a time-series cross-validation procedure, rather than a simple training/test split. In this case, the ARIMA model was selected because it happened to work best for the particular training/test split that was used. But if a different split had been used, then a different model might have been selected. So the model selection is not stable. By averaging over multiple folds using a time-series cross-validation procedure, you can get a more stable estimate of the model class that works best for your data.

]]>Collections of time series that are formed via aggregation are prevalent in many fields. These are commonly referred to as hierarchical time series and may be constructed cross-sectionally across different variables, temporally by aggregating a single series at different frequencies, or may even be generalised beyond aggregation as time series that respect linear constraints. When forecasting such time series, a desirable condition is for forecasts to be coherent, that is to respect the constraints. The past decades have seen substantial growth in this field with the development of reconciliation methods that not only ensure coherent forecasts but can also improve forecast accuracy. This talk provides an overview of recent work on forecast reconciliation.

One of the most challenging aspects for managers when building a forecasting system is choosing how to aggregate the data at different levels. This is frequently done without the manager knowing how these choices can compromise the system’s accuracy. This paper illustrates these compromises by comparing different structures and aggregation criteria. Our paper proposes and empirically tests a framework on how to build a coherent and more accurate forecasting system. The framework’s first phase compares different time series forecasting methods, including statistical, “standard” machine learning, and deep learning. Results show that one of the statistical methods (autoregressive integrated moving average, or, for short, ARIMA) outperforms machine and deep learning methods. The second phase compares different combinations of aggregation criteria, structures of the forecasting system, and coherent forecast methods (i.e., adjustments to the forecasts at different levels of aggregation). The results show that using different criteria and structures indeed impacts predictions’ accuracy. When it is necessary to disaggregate the forecast, our results show that it is best to add more information in a grouped structure, adjusted by a bottom-up method. This combination provides the best performance, i.e., the lowest mean absolute scaled error (MASE) in most nodes, compared to the other structures and coherent forecast methods used. The results also suggest that aggregating the time series further by geographical regions is essential to improve accuracy when forecasting products’ and channels’ sales.

]]>This paper discusses the use of forecast reconciliation with stock price time series and the corresponding stock index. The individual stock price series may be grouped using known meta-data or other clustering methods. We propose a novel forecasting framework that combines forecast reconciliation and clustering, to lead to better forecasts of both the index and the individual stock price series. The proposed approach is applied to the Dow Jones Industrial Average Index and its component stocks. The results demonstrate empirically that reconciliation improves forecasts of the stock market index and its constituents.

]]>
Over the past 6 months, George Athanasopoulos and I have added videos to most sections of the 3rd edition of our textbook *Forecasting: principles and practice*.

We have taught from the book many times, but this year we decided to pre-record short videos for each section. Our students often prefer a video explanation than reading the textbook, and we thought other readers might appreciate hearing from us as well.

These videos are embedded in most sections of the book. So far, we’ve covered the sections that we include in our own courses, but we hope to eventually have videos for all sections. Most of these were done in a single take, so they are sometimes a little rough, but hopefully still useful.

You can view the entire playlist on YouTube.

]]>*(Postponed until further notice)*

**Distinguished Lecture Series for the International Institute of Forecasters**

It is common to forecast at different levels of aggregation. For example, a retail company will want national forecasts, state forecasts, and store-level forecasts. And they will want them for all products, for groups of products, and for individual products. It is natural to want the forecasts to be “coherent” — that is, that the forecasts add up in the same way as the data. For example, forecasts of regional sales should add up to forecasts of state sales, which should in turn add up to give a forecast for national sales. Coherent forecasts are needed to allow effective planning, such as the allocation of resources across an organization based on forecasts of sales. Coherent forecasts are also more accurate than incoherent forecasts.

Over the past 15 years, forecast reconciliation methods have been developed to ensure forecasts are coherent. Forecasts at all levels of aggregation are produced, and the results are “reconciled” so they are consistent with each other.

I will provide an up-to-date overview of this area, and show how reconciliation methods can lead to better forecasts and better forecasting practices.

*(Click title for slides)*

**Hierarchical time series and forecast reconciliation**- Hyndman et al. (2011)
- Wickramasuriya, Athanasopoulos, and Hyndman (2019)
- Hyndman, Lee, and Wang (2016)

**Perspectives on forecast reconciliation**- Di Fonzo and Girolimetto (2022)
- Panagiotelis et al. (2021)
- Wickramasuriya (2021)
- van Erven and Cugliari (2015)
- Ben Taieb and Koo (2019)
- Wickramasuriya, Turlach, and Hyndman (2020)
- Zhang et al. (2022)
- Ashouri, Hyndman, and Shmueli (2022)
- Spiliotis et al. (2021)

**Probabilistic forecast reconciliation**- Ben Taieb, Taylor, and Hyndman (2021)
- Panagiotelis et al. (2023)
- Corani, Azzimonti, and Rubattu (2023)
- Rostami-Tabar and Hyndman (2023)

**Temporal and cross-temporal forecast reconciliation**- Athanasopoulos et al. (2017)
- Di Fonzo and Girolimetto (2023)
- Girolimetto et al. (2023)

Ashouri, Mahsa, Rob J Hyndman, and Galit Shmueli. 2022. “Fast Forecast Reconciliation Using Linear Models.” *J Computational & Graphical Statistics* 31 (1): 263–82. robjhyndman.com/publications/lhf.

Athanasopoulos, George, Rob J Hyndman, Nikolaos Kourentzes, and Fotios Petropoulos. 2017. “Forecasting with Temporal Hierarchies.” *European J Operational Research* 262 (1): 60–74. robjhyndman.com/publications/temporal-hierarchies/.

Ben Taieb, Souhaib, and Bonsoo Koo. 2019. “Regularized Regression for Hierarchical Forecasting Without Unbiasedness Conditions.” In *Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining*, 1337–47. KDD ’19. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3292500.3330976.

Ben Taieb, Souhaib, James W Taylor, and Rob J Hyndman. 2021. “Hierarchical Probabilistic Forecasting of Electricity Demand with Smart Meter Data.” *J American Statistical Association* 116 (533): 27–43. robjhyndman.com/publications/hpf-electricity/.

Corani, Giorgio, Dario Azzimonti, and Nicolo Rubattu. 2023. “Probabilistic Reconciliation of Count Time Series.” *International Journal of Forecasting*.

Di Fonzo, Tommaso, and Daniele Girolimetto. 2022. “Forecast Combination-Based Forecast Reconciliation: Insights and Extensions.” *International Journal of Forecasting* forthcoming. https://doi.org/10.1016/j.ijforecast.2022.07.001.

———. 2023. “Cross-Temporal Forecast Reconciliation: Optimal Combination Method and Heuristic Alternatives.” *International Journal of Forecasting* 39 (1): 39–57. https://doi.org/10.1016/j.ijforecast.2021.08.004.

Girolimetto, Daniele, George Athanasopoulos, Tommaso Di Fonzo, and Rob J Hyndman. 2023. “Cross-Temporal Probabilistic Forecast Reconciliation.” robjhyndman.com/publications/ctprob.html.

Hyndman, Rob J, Roman A Ahmed, George Athanasopoulos, and Han Lin Shang. 2011. “Optimal Combination Forecasts for Hierarchical Time Series.” *Computational Statistics & Data Analysis* 55 (9): 2579–89. robjhyndman.com/publications/hierarchical/.

Hyndman, Rob J, Alan Lee, and Earo Wang. 2016. “Fast Computation of Reconciled Forecasts for Hierarchical and Grouped Time Series.” *Computational Statistics & Data Analysis* 97: 16–32. robjhyndman.com/publications/hgts/.

Panagiotelis, Anastasios, Puwasala Gamakumara, George Athanasopoulos, and Rob J Hyndman. 2021. “Forecast Reconciliation: A Geometric View with New Insights on Bias Correction.” *International J Forecasting* 37 (1): 343–59. robjhyndman.com/publications/hierarchical-geometry.

———. 2023. “Probabilistic Forecast Reconciliation: Properties, Evaluation and Score Optimisation.” *European J Operational Research* 306 (2): 693–706. robjhyndman.com/publications/coherentprob/.

Rostami-Tabar, Bahman, and Rob J Hyndman. 2023. “Hierarchical Time Series Forecasting in Emergency Medical Services.” robjhyndman.com/publications/fem.html.

Spiliotis, Evangelos, Mahdi Abolghasemi, Rob J Hyndman, Fotios Petropoulos, and Vassilios Assimakopoulos. 2021. “Hierarchical Forecast Reconciliation with Machine Learning.” *Applied Soft Computing* 112: 107756. robjhyndman.com/publications/hfrml/.

van Erven, Tim, and Jairo Cugliari. 2015. “Game-Theoretically Optimal Reconciliation of Contemporaneous Hierarchical Time Series Forecasts.” In *Modeling and Stochastic Learning for Forecasting in High Dimension*, edited by Anestis Antoniadis, Jean-Michel Poggi, and Xavier Brossat, 297–317. Cham: Springer International Publishing. hal.inria.fr/hal-00920559.

Wickramasuriya, Shanika L. 2021. “Properties of Point Forecast Reconciliation Approaches.” *arXiv Preprint arXiv:2103.11129*. arxiv.org/abs/2103.11129.

Wickramasuriya, Shanika L, George Athanasopoulos, and Rob J Hyndman. 2019. “Optimal Forecast Reconciliation for Hierarchical and Grouped Time Series Through Trace Minimization.” *J American Statistical Association* 114 (526): 804–19. robjhyndman.com/publications/mint/.

Wickramasuriya, Shanika L, Berwin A Turlach, and Rob J Hyndman. 2020. “Optimal Non-Negative Forecast Reconciliation.” *Statistics & Computing* 30 (5): 1167–82. robjhyndman.com/publications/nnmint/.

Zhang, Bohan, Yanfei Kang, Anastasios Panagiotelis, and Feng Li. 2022. “Optimal Reconciliation with Immutable Forecasts.” *European Journal of Operational Research* forthcoming.

Global Forecasting Models (GFM) that are trained across a set of multiple time series have shown superior results in many forecasting competitions and real-world applications compared with univariate forecasting approaches. One aspect of the popularity of statistical forecasting models such as ETS and ARIMA is their relative simplicity and interpretability (in terms of relevant lags, trend, seasonality, and others), while GFMs typically lack interpretability, especially towards particular time series. This reduces the trust and confidence of the stakeholders when making decisions based on the forecasts without being able to understand the predictions. To mitigate this problem, in this work, we propose a novel local model-agnostic interpretability approach to explain the forecasts from GFMs. We train simpler univariate surrogate models that are considered interpretable (e.g., ETS) on the predictions of the GFM on samples within a neighbourhood that we obtain through bootstrapping or straightforwardly as the one-step-ahead global black-box model forecasts of the time series which needs to be explained. After, we evaluate the explanations for the forecasts of the global models in both qualitative and quantitative aspects such as accuracy, fidelity, stability and comprehensibility, and are able to show the benefits of our approach.

]]>
The Ljung-Box test is widely used to test for autocorrelation remaining in the residuals after fitting a model to a time series. In this post, I look at the degrees of freedom used in such tests.

Suppose an ARMA() model is fitted to a time series of length , giving a series of residuals , and let the autocorrelations of this residual series be denoted by The first autocorrelations are used to construct the statistic

This statistic was discussed by Box and Pierce (1970), who argued that if is large, and the model parameters correspond to the true data generating process, then has a distribution with degrees of freedom. Later, Ljung and Box (1978) showed that if the model is correct, but with unknown parameters, then has a distribution with degrees of freedom.

These days, the Ljung-Box test is applied to a lot more models than non-seasonal ARMA models, and it is not clear what the degrees of freedom should be for other models. For example:

- What if the model includes an intercept term? Should that be included in the degrees of freedom calculation?
- What about a seasonal ARIMA model? Do we just count all coefficients?
- Or a regression with ARMA errors? Should we include the regression coefficients when computing the degrees of freedom?
- Or an ETS model? Do we count just the smoothing parameters, or do we include the states as well, or something else?

Not long ago, I had naively assumed that the correct degrees of freedom would be where is the number of parameters estimated. I am in good company because Andrew Harvey in Harvey (1990, p259) made exactly the same conjecture. That was what was coded in the `forecast::checkresiduals()`

function prior to v8.21, and how the test was applied in Hyndman and Athanasopoulos (2018) and Hyndman and Athanasopoulos (2021) until February 2023. But a recent github discussion with Achim Zeilis convinced me that it is incorrect.

Let’s look at a few examples. For each model, we will simulate 5000 series, each of length 250 observations. For each series, we compute the p-value of a Ljung-Box test with and degrees of freedom, for different values of . Under the null hypothesis of uncorrelated residuals, the values should have a uniform distribution.

```
library(forecast)
library(ggplot2)
set.seed(0)
# Function to simulate p-values given a DGP model and
# a function to fit the model to a time series
simulate_pvalue <- function(model, fit_fn, l=10) {
## simulate series
if(is.null(model$xreg)) {
y <- simulate(model, n = 250)
} else {
y <- simulate(model, xreg=model$xreg, n=250)
}
if(inherits(model, "ets")) {
# If multiplicative errors, fix non-positive values
if(model$components[1] == "M")
y[y <= 0] <- 1e-5
}
## Fit model
m <- fit_fn(y)
## compute p-values for various df
pv <- purrr::map_vec(0:3, m=m,
function(x, m) {
Box.test(residuals(m), lag = l, fitdf = x, type = "Ljung-Box")$p.value
}
)
names(pv) <- paste("K =", 0:3)
return(pv)
}
# Function to replicate the above function
simulate_pvalues <- function(model, fit_fn, nsim = 5000, l=10) {
purrr::map_dfr(seq(nsim), function(x) {
simulate_pvalue(model, fit_fn, l=l)
})
}
# Histograms of p values
hist_pvalues <- function(pv) {
pv |>
tidyr::pivot_longer(cols = seq(NCOL(arima_pvalues))) |>
ggplot(aes(x = value)) +
geom_histogram(bins = 30, boundary = 0) +
facet_grid(. ~ name) +
labs(title = "P value distributions")
}
# A nice table of the size of the test
table_pvalues <- function(pv) {
tibble::tibble(`test size` = c(0.01, 0.05, 0.1)) %>%
dplyr::bind_cols(
purrr::map_df(pv, function(x) {ecdf(x)(.$`test size`)})
) |>
knitr::kable()
}
```

We will simulate from an ARIMA(2,0,0) model with a non-zero intercept. For the Ljung-Box test, we will consider . Note that was the original proposal by Box and Pierce (1970), counts only ARMA coefficients, and counts all parameters estimated in the model. The resulting distributions of the p-values are shown below.

```
model <- Arima(sqrt(lynx), order=c(2,0,0))
fit_fn <- function(y) {
Arima(y, order = c(2, 0, 0), include.mean = TRUE)
}
arima_pvalues <- simulate_pvalues(model, fit_fn)
hist_pvalues(arima_pvalues)
```

`table_pvalues(arima_pvalues)`

test size | K = 0 | K = 1 | K = 2 | K = 3 |
---|---|---|---|---|

0.01 | 0.0046 | 0.0072 | 0.0124 | 0.0220 |

0.05 | 0.0226 | 0.0354 | 0.0534 | 0.0818 |

0.10 | 0.0474 | 0.0678 | 0.1016 | 0.1514 |

Clearly the one with is better than the alternatives. The table shows the empirical size of the test for different threshold levels. The empirical sizes are closest to the nominal sizes when . So we shouldn’t count the intercept when computing the degrees of freedom.

Next, we will simulate from an ARIMA(0,1,1)(0,1,1) model, often called the “airline model” due to its application to the Air passenger series in Box et al. (2016). In fact, our DGP for the simulations will be a model fitted to the `AirPassengers`

data set. Again, we consider . There are two parameters to be estimated.

```
model <- Arima(log(AirPassengers), order=c(0,1,1), seasonal=c(0,1,1))
fit_fn <- function(y) {
Arima(y, order = c(0,1,1), seasonal=c(0,1,1))
}
sarima_pvalues <- simulate_pvalues(model, fit_fn)
hist_pvalues(sarima_pvalues)
```

`table_pvalues(sarima_pvalues)`

test size | K = 0 | K = 1 | K = 2 | K = 3 |
---|---|---|---|---|

0.01 | 0.0100 | 0.0170 | 0.0282 | 0.0456 |

0.05 | 0.0484 | 0.0684 | 0.1018 | 0.1452 |

0.10 | 0.0882 | 0.1218 | 0.1774 | 0.2526 |

Interesting. Although there are two parameters here, the tests with and do better than . I would have expected to be the right choice, but the test with has empirical size about twice the nominal size.

As a guess, perhaps the seasonal parameters aren’t having an effect with . We can test what happens for larger by setting (covering two years), and repeating the exercise.

```
sarima_pvalues <- simulate_pvalues(model, fit_fn, l=24)
hist_pvalues(sarima_pvalues)
```

`table_pvalues(sarima_pvalues)`

test size | K = 0 | K = 1 | K = 2 | K = 3 |
---|---|---|---|---|

0.01 | 0.0134 | 0.0164 | 0.0216 | 0.0278 |

0.05 | 0.0492 | 0.0614 | 0.0748 | 0.0940 |

0.10 | 0.0864 | 0.1066 | 0.1312 | 0.1590 |

I was expecting to do best there, but not so. is the most uniform, and gives empirical sizes closest to the nominal sizes, with the results getting worse as increases. Perhaps always setting would be a sensible strategy for ARIMA models, even if they contain seasonal components. This needs some theoretical analysis.

We will simulate from a linear trend model with AR(1) errors. Here, counts only ARMA coefficients, while counts all parameters estimated. The resulting distributions of the p-values are shown below.

```
model <- Arima(10 + seq(250)/10 + arima.sim(list(ar=0.7), n=250),
order = c(1,0,0), xreg = seq(250))
fit_fn <- function(y) {
Arima(y, order = c(1,0,0), include.constant = TRUE, xreg = seq(250))
}
regarima_pvalues <- simulate_pvalues(model, fit_fn)
hist_pvalues(regarima_pvalues)
```

`table_pvalues(regarima_pvalues)`

test size | K = 0 | K = 1 | K = 2 | K = 3 |
---|---|---|---|---|

0.01 | 0.0072 | 0.0104 | 0.0176 | 0.0294 |

0.05 | 0.0310 | 0.0532 | 0.0782 | 0.1200 |

0.10 | 0.0696 | 0.0996 | 0.1478 | 0.2150 |

The test with looks the most uniform, with the size of the test closest to the nominal values. So only counting ARMA coefficients seems to be correct here.

Next, we will consider a linear trend model with iid errors. That is the same as the previous model, but with a simpler error structure.

```
model <- Arima(10 + seq(250)/10 + rnorm(250),
order = c(0,0,0), xreg = seq(250))
fit_fn <- function(y) {
Arima(y, include.constant = TRUE, xreg = seq(250))
}
trend_pvalues <- simulate_pvalues(model, fit_fn)
hist_pvalues(trend_pvalues)
```

`table_pvalues(trend_pvalues)`

test size | K = 0 | K = 1 | K = 2 | K = 3 |
---|---|---|---|---|

0.01 | 0.0148 | 0.0214 | 0.0338 | 0.0554 |

0.05 | 0.0580 | 0.0844 | 0.1204 | 0.1746 |

0.10 | 0.1054 | 0.1478 | 0.2062 | 0.2816 |

The test with looks best. If we think of a regression model as a RegARIMA model with ARIMA(0,0,0) errors, this is consistent with the previous results, setting .

Now let’s try an ETS(A,N,N) model, again using 5000 series each of length 250. If we count only the smoothing parameter, , but if we count all estimated parameters, . The distributions of p-values are shown below.

```
model <- ets(fma::strikes, "ANN")
fit_fn <- function(y) {
ets(y, model = "ANN", damped = FALSE)
}
ets_pvalues <- simulate_pvalues(model, fit_fn)
hist_pvalues(ets_pvalues)
```

`table_pvalues(ets_pvalues)`

test size | K = 0 | K = 1 | K = 2 | K = 3 |
---|---|---|---|---|

0.01 | 0.0064 | 0.0112 | 0.0174 | 0.0316 |

0.05 | 0.0334 | 0.0522 | 0.0778 | 0.1196 |

0.10 | 0.0702 | 0.0960 | 0.1442 | 0.2098 |

looks about right. That makes sense as an ETS(A,N,N) model is equivalent to an ARIMA(0,1,1) model.

Next, let’s try an ETS(M,N,N) model, which has no ARIMA equivalent, but which has one smoothing parameter and one initial state to estimate.

```
model <- ets(fma::strikes, "MNN")
fit_fn <- function(y) {
ets(y, model = "MNN", damped = FALSE)
}
ets_pvalues <- simulate_pvalues(model, fit_fn)
hist_pvalues(ets_pvalues)
```

`table_pvalues(ets_pvalues)`

test size | K = 0 | K = 1 | K = 2 | K = 3 |
---|---|---|---|---|

0.01 | 0.0054 | 0.0084 | 0.0172 | 0.0308 |

0.05 | 0.0334 | 0.0548 | 0.0814 | 0.1274 |

0.10 | 0.0708 | 0.1062 | 0.1520 | 0.2136 |

Again, appears to be the best.

An ETS(A,A,N) model is equivalent to an ARIMA(0,2,2) model, so I expect this one to need .

```
model <- ets(fma::strikes, "AAN")
fit_fn <- function(y) {
ets(y, model = "AAN", damped = FALSE)
}
ets_pvalues <- simulate_pvalues(model, fit_fn)
hist_pvalues(ets_pvalues)
```

`table_pvalues(ets_pvalues)`

test size | K = 0 | K = 1 | K = 2 | K = 3 |
---|---|---|---|---|

0.01 | 0.0034 | 0.0062 | 0.0106 | 0.0172 |

0.05 | 0.0180 | 0.0308 | 0.0514 | 0.0802 |

0.10 | 0.0430 | 0.0658 | 0.1008 | 0.1486 |

This time, my conjecture is correct, and works well.

Finally, we will check a seasonal ETS model

```
model <- ets(log(AirPassengers), model="AAA", damped=FALSE)
fit_fn <- function(y) {
ets(y, model = "AAA", damped = FALSE)
}
ets_pvalues <- simulate_pvalues(model, fit_fn)
hist_pvalues(ets_pvalues)
```

`table_pvalues(ets_pvalues)`

test size | K = 0 | K = 1 | K = 2 | K = 3 |
---|---|---|---|---|

0.01 | 0.0178 | 0.0276 | 0.0396 | 0.0602 |

0.05 | 0.0634 | 0.0916 | 0.1360 | 0.1942 |

0.10 | 0.1206 | 0.1672 | 0.2244 | 0.3078 |

Here there are 3 smoothing parameters, and 13 initial states to estimate. So I was expecting to do best, but it is the worst. Instead, is the best. I’m not sure what to make of this result.

Based only on this empirical evidence:

- For ARIMA models, use degrees of freedom.
- For seasonal ARIMA models, it appears that also gives the best results.
- For regression with ARIMA errors, use degrees of freedom.
- For OLS regression, use degrees of freedom.
- For non-seasonal ETS models, use the number of smoothing parameters.
- For seasonal ETS models, use .

The last two of these appear to be contradictory, and it is not clear why.

It seems like this might be a good project for a PhD student to explore. In particular, can these suggestions based on empirical evidence be supported theoretically? It would also be good to explore other models such as TBATS, ARFIMA, NNETAR, etc.

For now, I might avoid teaching the Ljung-Box test, and just get students to look at the ACF plot of the residuals instead.

- Kim et al. (2004) shows that for an AR(1) model with ARCH errors.
- McLeod and Li (1983) consider the equivalent test applied to autocorrelations of squared residuals, and show that .
- Mahdi (2016) discusses a variation on the LB test for seasonal ARIMA models considering only autocorrelations at the seasonal lags.
- Several other portmanteau tests (i.e., based on multiple autocorrelations) are available, and perhaps we should be using them and not the older Ljung-Box test. See Mahdi (2021) for some recent developments.

Box, George E P, Gwilym M Jenkins, Gregory C Reinsel, and Greta M Ljung. 2016. *Time Series Analysis: Forecasting and Control*. 5th ed. John Wiley; Sons.

Box, George E P, and David A Pierce. 1970. “Distribution of Residual Autocorrelations in Autoregressive-Integrated Moving Average Time Series Models.” *Journal of the American Statistical Association* 65 (332): 1509–26. https://doi.org/10.2307/2284333.

Harvey, Andrew C. 1990. *Forecasting, Structural Time Series Models and the Kalman Filter*. Cambridge University Press.

Hyndman, Rob J, and George Athanasopoulos. 2018. *Forecasting: Principles and Practice*. 2nd ed. Melbourne, Australia: OTexts. OTexts.org/fpp2.

———. 2021. *Forecasting: Principles and Practice*. 3rd ed. Melbourne, Australia: OTexts. OTexts.org/fpp3.

Kim, Eunhee, Jeongcheol Ha, Youngsook Jeon, and Sangyeol Lee. 2004. “Ljung-Box Test in Unit Root AR-ARCH Model.” *Communications for Statistical Applications and Methods* 11 (2): 323–27. https://doi.org/10.5351/ckss.2004.11.2.323.

Ljung, Greta M, and George E P Box. 1978. “On a Measure of Lack of Fit in Time Series Models.” *Biometrika* 65 (2): 297–303. https://doi.org/10.1093/biomet/65.2.297.

Mahdi, Esam. 2016. “Portmanteau Test Statistics for Seasonal Serial Correlation in Time Series Models.” *SpringerPlus* 5 (1): 1485. https://doi.org/10.1186/s40064-016-3167-4.

———. 2021. “New Goodness-of-Fit Tests for Time Series Models.” http://arxiv.org/abs/2008.08176.

McLeod, A I, and W K Li. 1983. “Diagnostic Checking ARMA Time Series Models Using Squared-Residual Autocorrelations.” *Journal of Time Series Analysis* 4 (4): 269–73. https://doi.org/10.1111/j.1467-9892.1983.tb00373.x.

Providing forecasts for ultra-long time series plays a vital role in various activities, such as investment decisions, industrial production arrangements, and farm management. This paper develops a novel distributed forecasting framework to tackle challenges associated with forecasting ultra-long time series by using the industry-standard MapReduce framework. The proposed model combination approach facilitates distributed time series forecasting by combining the local estimators of time series models delivered from worker nodes and minimizing a global loss function. In this way, instead of unrealistically assuming the data generating process (DGP) of an ultra-long time series stays invariant, we make assumptions only on the DGP of subseries spanning shorter time periods. We investigate the performance of the proposed approach with AutoRegressive Integrated Moving Average (ARIMA) models using the real data application as well as numerical simulations. Compared to directly fitting the whole data with ARIMA models, our approach results in improved forecasting accuracy and computational efficiency both in point forecasts and prediction intervals, especially for longer forecast horizons. Moreover, we explore some potential factors that may affect the forecasting performance of our approach.

]]>