View code at: https://github.com/apang782/CTSA

View other my other projects at: https://apang782.github.io/


Abstract

Is candy production seasonal? Does production follow regular trends? Considering the copious amounts of candy consumed during Halloween and weight loss being one of the most popular New Year’s resolutions, one would imagine regular patterns of production to be present every year. Using the “US Candy Production by Month” dataset from kaggle.com, this project will fit the best (S)ARIMA model to the data and use it to predict future values and aims to utilize time series techniques to investigate these patterns as well as predict future production values. In addition to forecasting, the following questions will be explored in this project:

This dataset uses the 2012 industrial production (IP) index as the unit for output values.

Introduction

To evaluate the validity of the model’s predicted future values, the main dataset is subsetted into training and test sets. The model itself will be based on the training set (values from Jan 1972 to Aug 2017, a total of 536 values), while the test set (containing values from Sep 2016 to Aug 2017 – a total of 12 values) will be used to evaluate model predictions. The Box-Cox transformation is used to normalize the data.

To answer whether a seasonal component is present:
- Present seasonality is removed by differencing the period of the data (here: 12 months, or 1 year).

To answer whether there is an ongoing trend in candy production:
- Trend is then removed from the data by taking a difference of lag 1.

During each transformation, the variance of the series is taken to ensure appropriate transformations and differencing. The stationarity of the data is also checked before proceeding. Then, the ACF and PACF of the data is examined to estimate potentially viable models. Maximum likelihood and AICc are used to choose the models with the most potential. Model diagonstics and spectral analysis are performed to ensure the model is representative of the data. The estimated parameter values are also reviewed to check for model stationarity and invertibility. The final model is then used to predict future values. The predicted values are then compared to the actual true values from the test set.

Conclusions

This project shows the increasing trend present in candy production in the United States over the years and the presence of seasonality each year. All transformations and differencings performed have been proven to be statistically significant. The final model does not entirely match predictions drawn from the ACF/PACF plots and do not pass every test of normality. The predictions are accurate based on the subsetted test values.

Acknowledgments

Rachael Tatman’s dataset “US Candy Production by Month” on kaggle.com is acknowledged as the data source of this project:
https://www.kaggle.com/rtatman/us-candy-production-by-month

R and RStudio is acknowledged as the software used to make this project possible.

Plotting the Time Series

A set seed value is used for reproducability. Initial observations of the truncated time series suggests both trend (upward) and seasonality (relavitely similar patterns at regular intervals through time). At a quick glace, some similarity appears at the head and tail ends of the graph, as well as around 2010. The histogram suggests non-normal data due to a slight skew. Transformations, as well as differencing, are both expected to be necessary. No sharp changes in behavior are apparent apart from the sharp downward spike towards the start of the data.





















Transforming the Data

## [1] 0.7474747

The Box-Cox plot above gives a 95% confidence interval for the best lambda value to use for the transformation. The calculated best 𝜆 to use is 0.7474747. The formula used for transforming the data is:

\[y_i^\lambda = 1/\lambda (y_i^\lambda - 1)\]

## [1] 330.5025
## [1] 32.69409

The variance of the transformed time series is much smaller than that of the original (32.69409 vs 330.5025). The histograms show that the transformation has increased the normality of the data. The time series “tc” will be used from here instead of “c”.

The decompositions of both the base and the transformed time series display an approximately increasing linear trend present in the data, confirming initial considerations. The transformation did not change the seasonality and trend present in the data, as expected.

Both the ACF and PACFs of the transformed and untransformed data echo the conclusions drawn from the decompositions. It is evident from the ACF plots that seasonality is present.

Removing Seasonality

To remove the seasonality present in the data, a difference at lag 12 (the period for monthly data is a year). Further differencing will be explored to ensure seasonality is removed.

## [1] 32.694087  4.657606 11.404850

Differencing at lag 12 once yields a decreased variance (4.657606 from 32.69409 – the variance of the transformed TS). Differencing at 12 again increases the variance (11.40485), indicating overdifferencing. The time series “tcs” will be used moving forward.

Removing Trend

To remove the linear trend from the data, a difference at lag 1 is taken. Further differencing will be investigated to ensure trend is removed. In addition, differencing at lag 2 and beyond will be tested to check for the possibility of a quadratic trend or otherwise as the trend shown from the decomposition was not completely linear.

## [1] 4.657606
##           once    twice
## lag 1 2.181647 5.235023
## lag 2 3.462656 8.595181
## lag 3 4.279379 4.279379
## lag 4 5.238942 4.279379

From the above output, differencing once at lag 1 provides the smallest variance (2.181647 from time series “tcs1” vs 4.657606 – the variance of the time series without trend removed). This confirms that the trend seen in the decomposition earlier is linear. Because taking the difference at lag 1 again increases the variance, taking the difference once is appropriate and does not suffer from overdifferencing.

Relating current findings back to the context of the dataset itself, an increasing linear trend suggests increasing production volume through the years, and by extension – consumption.

Visualizing the Effects of Differencing

The changes to the time series is visualized above. The ∇1∇12 transformed TS does not have the increasing linear trend from the base data. This time series also appears the most stationary visually.

Both sets of ACF/PACF plots display similar information. It is evident from the ACF plots that the seasonality seen in the undifferenced time series does not appear in the ∇1∇12 transformed TS.

The changes in the histogram from both transformations and differencing are visualized above. The histogram for the time series becomes increasingly more Gaussian and regular with each procedure. The transformed time series with ∇1∇12 yields the most normal, symmetric, and centered plot – confirming the conclusions drawn above.

The decomposition for the ∇1∇12 transformed TS shows a nearly “flat” trend, which was not seen in the previous decompositions. This further confirms the removal of trend from the time series, supporting earlier inferences.

Confirming Stationarity of tcs1 = ∇1∇12tr.(U_t)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  tcs1
## Dickey-Fuller = -7.8651, Lag order = 8, p-value = 0.01
## alternative hypothesis: stationary

To confirm the stationarity of the above time series apart from visual inspection, the augmented Dickey–Fuller test is employed (via adf.test()) to test the presence of an unit root as the null hypothesis. Because the p-value is 0.01 and the test statistic is negative, the null hypothesis (non-stationarity) is rejected in favor of the alternative hypothesis (stationarity). This test confirms the earlier visual inspection of the time series and validates utilitzed procedures.









































Model Identification Through ACF and PACF of tcs1 = ∇1∇12tr.(U_t)







Statistically significant values for both ACF and PACF are seen where lines extend past the confidence interval, which are denoted by the blue dashed lines.

Analysis of the ACF plot can confirm proper differencing and indicate the number (q) of moving average (MA) terms to be used in the model, as well as seasonal effects. The lag at which the ACF plot cuts off denotes the number of MA terms in the model. Here, because the autocorrelations do not take all small values, overdifferencing did not occur. Furthermore, because there is not a strong positive autocorrelation present, underdifferencing did not occur either.

The spikes at lags 1, 2, 4, and 9 are statistically significant (ie: not white noise) because they extend past the confidence interval band. The spike at 1 is most prominent while the those present at lag 2, 4, and 9 take smaller values. The weak spike at lag 9 should not be relevant for the resulting model due to the principle of parsimony (a parsimonious model aims to predict using the smallest number of parameters – in other words, the simplest model that “works”). Thus, q=1, q=2, q=4 are noted as relevant values for further inquiry.

The ACF plot also holds insights to seasonal autoregressive effects. The above output indicates that a spike is present at seasonal lag 12 and at 24 (lags that are multiples of 12). Thus Q=1 and Q=2 are recorded as values to be used in modeling.

The PACF plot indicates the number of autoregressive (AR) terms (p value) to be used in the model at the lag where values cut off. Values of PACF fall back inside the confidence interval after at lags 1, 2, 4, 9, and 11. Following the principle of parsimony, lags 1, 2, and 4 will be further considered. It is, however, worth noting only lags 1, 2, 9, and 11 have strong spikes. Thus p=1, p=2, and p=4 are documented as values to consider for the model.

Like the ACF plot, the PACF plot holds information on seasonal moving average effects. The plot shows seasonal lag spikes at P=1 and P=2 – noted as values to examine for model construction.

Because both ACF and PACF plots taper to zero somehow, both AR and MA parameters are expected in the chosen model. Based on the above conclusions, the principal model to investigate is SARIMA(4,1,4)x(1,1,2)

Exploring Potential Models

##      [,1]               [,2]         
## [1,] "1643.37093913101" "3,1,3 2,1,1"
## [2,] "1643.46793596495" "3,1,4 2,1,1"
## [3,] "1672.43395636354" "4,1,1 1,1,1"

Based on the above matrix of sorted AICc values, the best models are:
1. (3,1,3)x(2,1,1) with AICc = 1643.37093913101
2. (3,1,4)x(2,1,1) with AICc = 1643.46793596495
3. (4,1,1)x(1,1,1) with AICc = 1672.43395636354

The first two models do not completely match the predictions drawn from observing the ACF and PACF plots. This may be due to the interaction between AR and MA. The seasonal portions of the model do match initial expectations. The first two models will be investigated further because they have very similar AICc values. If there are no major performance differences found between the two models, the first will be chosen because it is slightly more simple (principle of parsimony).

##          ar1          ar2          ar3          ma1          ma2          ma3 
## -0.497412810 -0.242485932  0.656160108  0.292594742 -0.006089481 -0.867089677 
##         sar1         sar2         sma1 
##  0.025823790 -0.035384319 -0.720873006
## [1]  0.8036395+0.7651591i -1.2377261-0.0000000i  0.8036395-0.7651591i
## [1]  1.1530578-0.0000000i -0.5800404+0.8147077i -0.5800404-0.8147077i
## [1]  5.693531-0i -4.963722+0i
## [1] 1.387207+0i

Recall that if the roots of the polynomial for the AR portion lie outside the unit circle, then there is a MA representation, implying stationarity and causality.

The outputted values show that all roots of the polynomial lie outside the unit circle, suggesting invertibility as well as stationarity.

##         ar1         ar2         ar3         ma1         ma2         ma3 
## -0.29335181 -0.00694476  0.85953354  0.04327768 -0.15752606 -0.94670473 
##         ma4        sar1        sar2        sma1 
##  0.13915581  0.03787631 -0.02236972 -0.71188464
## [1]  0.5823589+0.8165312i -1.1566381-0.0000000i  0.5823589-0.8165312i
## [1]  1.0298466+0.0000000i -0.5825122+0.8163204i -0.5825122-0.8163204i
## [4]  6.9383773+0.0000000i
## [1]  7.586035-0i -5.892839+0i
## [1] 1.404722+0i

Like the first model, the second model has all of its polynomial roots outside the unit circle, suggesting invertibility, stationarity, and the stationarity/causality of its AR portions.

Because both models hold similar properties, the model SARIMA(3,1,3)x(2,1,1) will be used over SARIMA(4,1,4)x(2,1,1) because it is simpler.

Thus, the final model to be used for forecasting written in algebraic form is: \[(1+0.49741281B+0.242485932B^2-0.656160108B^3)(1-0.02582379B^{12}+0.035384319B^{24})∇1∇12BoxCox(X_t)\\ =\\ (1-0.292594742B+0.006089481B^2+0.867089677B^3)(1+0.720873006B^{12})Z_t\]

Model Diagnostics

The residuals of the model seem to be both uncorrelated and random. There are no significant lags in the ACF and PACF plots. The histogram takes a Gaussian shape, and most of the points on the Q-Q plot fall on the line, with the exception of tail values. Furthermore, all the residuals lie within the blue band on the ACF and PACF plots, suggesting white noise residuals.

## 
##  Box-Pierce test
## 
## data:  fr
## X-squared = 4.7234, df = 6, p-value = 0.5797
## 
##  Box-Ljung test
## 
## data:  fr
## X-squared = 4.8045, df = 6, p-value = 0.5691
## 
##  Box-Ljung test
## 
## data:  fr^2
## X-squared = 65.252, df = 12, p-value = 2.449e-09
## 
##  Shapiro-Wilk normality test
## 
## data:  fr
## W = 0.98542, p-value = 3.362e-05
## 
## Call:
## ar(x = fr, aic = T, order.max = NULL, method = c("yule-walker"))
## 
## 
## Order selected 0  sigma^2 estimated as  1.237

The model passes the Box-Pierce (whether residuals are white noise) and Ljung-Box (whether groups of autocorrelations are nonzero) tests at lag 6, as the p-values for both tests are large (above 0.05). However, the model fails the Shapiro-Wilk normality test and the McLeod-Li test (presence of heteroskedasticity), as the p-values resulting from the test are very small (less than 0.05). Thus, there does appear to be autoregressive conditional heteroskedasticity.

Spectral Analysis

As seen in the periodogram above, there is a large spike around 0.08 – indicative of the period of the series (1/12 = 0.0833333). Because there are no other spikes, there does not seem to be any other periods.

## [1] 0.5226995

From the p-value above, the model passes the Fisher’s exact g test for periodicity.

The Kolmogorov-Smirnov (KS) test for periodicity echoes the conclusion drawn from the Fisher’s test above. The residuals seem to be Gaussian white noise.

Because the model passes all spectral analysis tests and a number of model diagnostic tests, the model is deemed satisfactory for forecasting.

Forecasting

The twelve values subsetted before fitting the model are used here to test the performance of the model and the validity of the predictions.

##       Lower     Pred    Upper   Actual
## 1  41.60864 43.86207 46.11550 43.35904
## 2  45.36166 48.23789 51.11412 46.30079
## 3  45.08617 48.30957 51.53298 45.63908
## 4  44.37994 47.84262 51.30530 45.52194
## 5  40.46373 44.10746 47.75119 43.40411
## 6  39.45177 43.23758 47.02339 44.62046
## 7  38.70257 42.61501 46.52745 42.10165
## 8  37.34538 41.37091 45.39645 42.78006
## 9  35.81766 39.94409 44.07053 41.07814
## 10 36.19575 40.42144 44.64712 41.78580
## 11 36.67147 40.99037 45.30926 41.28489
## 12 38.17520 42.58162 46.98805 44.80054

Forecasted values are reasonable, as all of the true (test) values fall within the predicted interval. The point estimates are also close to the true values.

Conclusion

The procedures in this project suggest the presence of trend and seasonality in the process of fitting a model for predictions. The data was successfully transformed to be more Gaussian for model fitting – removing the trend and seasonality (which was deemed to be yearly). Box-Cox transformations were also performed and variances taken at each step supported these results. The model passed the Augmented Dickey-Fuller test, and was deemed viable for modelling. After exploring potential models, the best three were chosen based on the AICc criterion. This model (SARIMA(3,1,3)x(2,1,1)[12]) was shown to be viable in its diagnostics, passing spectral analysis testing and most model diagnostics. The roots of the polynomial further support this conclusion. The forecasts performed are also considered successful as they all fall within the confidence intervals. There is room for doubt due to the model not fitting ACF/PACF predictions as well as the failure of both the McLeod-Li and Shapiro-Wilk tests. It is entirely possible that there are factors to the data that the model used did not capture. Heavy-tailed models may potentially be a better fit for the data.