In this case study, we will learn about time series analysis for a production operation. Time series analysis and modelling have many business and social applications. It is extensively used to forecast company sales, product demand, stock market trends, agricultural production etc. The fundamental idea for time series analysis is to decompose the original time series (sales, stock market trends, etc.) into several independent components. Typically, business time series are divided into the following four components:
- Trend – overall direction of the series i.e. upwards, downwards etc.
- Seasonality – monthly or quarterly patterns
- Cycle – long-term business cycles, they usually come after 5 or 7 years
- Irregular remainder – random noise left after extraction of all the components
Interference of these components produces the final series.
Free Step-by-step Guide To Become A Data Scientist
Subscribe and get this detailed guide absolutely FREE
ARIMA is a combination of 3 parts i.e.
Integrated (I) – Subtract time series from its lagged series to extract trends from the data. In this pass of ARIMA juicer, we extract trend(s) from the original time series data. Differencing is one of the most commonly used mechanisms for extraction of trends. Here, the original series is subtracted from its lagged series e.g. November’s sales values are subtracted with October’s values to produce trend-less residual series. The formulae for different orders of differencing are as follow:
No Differencing (d=0):
1st Differencing (d=1):
2nd Differencing (d=1)
The residual data of most time series usually become trend-less after the first order differencing which is represented as ARIMA (0,1,0). Notice, AR (p), and MA (q) values in this notation are 0 and the integrated (I) value has order one. If the residual series still has a trend it is further differenced and is called 2nd order differencing. This trend-less series is called stationary on mean series i.e. mean or average value for series does not change over time.
AutoRegressive (AR) – extract the influence of the previous periods’ values on the current period. After the time series data is made stationary through the integrated (I) pass, the AR part of the ARIMA juicer gets activated. As the name auto-regression suggests, here we try to extract the influence of the values of previous periods on the current period (e.g. the influence of the September and October’s sales value on the November’s sales). This is done through developing a regression model with the time lagged period values as independent or predictor variables. The general form of the equation for this regression model is shown below.
AR model of order 1 i.e. p=1 or ARIMA (1,0,0) is represented by the following regression equation:
Let’s understand AR models using the case below:
The current GDP of a country say y(t) is dependent on the last year’s GDP i.e. y(t – 1). The hypothesis being that the total cost of production of products & services in a country in a fiscal year (known as GDP) is dependent on the set up of manufacturing plants / services in the previous year and the newly set up industries / plants / services in the current year. But the primary component of the GDP is the former one. Hence, we can formally write the equation of GDP as:
This equation is known as AR(1) formulation. The numeral one (1) denotes that the next instance is solely dependent on the previous instance. The alpha is a coefficient which we seek to minimize the error function. Notice that Y(t- 1) is indeed linked to Y(t-2) in the same fashion. Hence, any shock to Y(t) will gradually fade off in future.
Moving Average (MA) – extract the influence of the previous period’s error terms on the current period’s error. Finally, the last component of ARIMA juicer i.e. MA involves finding relationships between the previous periods’ error terms on the current period’s error term. Keep in mind, this moving average (MA) has nothing to do with moving average we learned about in the previous article on time series decomposition. Moving Average (MA) part of ARIMA is developed with the following simple multiple linear regression values with the lagged error values as independent or predictor variables.
MA model of order 1 i.e. q=1 or ARIMA(0,0,1) is represented by the following regression equation
Forecasting with ARIMA
Forecasting is the difficult process of predicting the future, based on past and present data. One of the most common methods for this is the ARIMA model, which stands for AutoRegressive Integrated Moving Average. In an ARIMA model there are 3 parameters that are used to help model the major aspects of a times series: seasonality, trend, and noise. These parameters are labelled p, d, and q.
p is the parameter associated with the auto-regressive aspect of the model, which incorporates past values. For example, forecasting that if it rained a lot over the past few days, you state it’s likely that it will rain tomorrow as well.
d is the parameter associated with the integrated part of the model, which effects the amount of differencing to apply to a time series. You can imagine an example of this as forecasting that the amount of rain tomorrow will be similar to the amount of rain today, if the daily amounts of rain have been similar over the past few days.
q is the parameter associated with the moving average part of the model.
The methods we will employ in this notebook example will only take in data from a uni-variate time series. That means we really are only considering the relationship between the y-axis value the x-axis time points. We’re not considering outside factors that may be affecting the time series.
One of the most important features of a time series is variation. Variations are patterns in the times series data. A time series that has patterns that repeat over known and fixed periods of time is said to have seasonality. Seasonality is a general term for variations that periodically repeat in data. In general, we think of variations as 4 categories.: Seasonal, Cyclic, Trend, and Irregular fluctuations. Seasonal variation is usually defined as variation that is annual in period, such as swimsuit sales being lower in winter and higher in summer. Cyclic Variation is a variation that occurs at other fixed periods, such as the daily variation in temperature. Both Seasonal and Cyclic variation would be examples of seasonality in a time series data set. Trends are long-term changes in the mean level, relative to the number of observations.
#Import tools and libraries import numpy as np import pandas as pd import statsmodels.api as sm import matplotlib.pyplot as plt %matplotlib inline # Read the data df=pd.read_csv('monthly-milk-production-pounds-p.csv') # Check the data structure df.head()
df.columns=['Month','Milk in pounds per cow'] df.head()
# Check missing data
Length: 169, dtype: bool
# Drop the row having missing data (row 168) df.drop(168,axis=0,inplace=True) # Change time to year-month-date format df['Month'] = pd.to_datetime(df['Month']) df.tail()
Output:DatetimeIndex([‘1962-01-01’, ‘1962-02-01’, ‘1962-03-01’, ‘1962-04-01’, ‘1962-05-01’, ‘1962-06-01’, ‘1962-07-01’, ‘1962-08-01’, ‘1962-09-01’, ‘1962-10-01’, … ‘1975-03-01’, ‘1975-04-01’, ‘1975-05-01’, ‘1975-06-01’, ‘1975-07-01’, ‘1975-08-01’, ‘1975-09-01’, ‘1975-10-01’, ‘1975-11-01’, ‘1975-12-01′], dtype=’datetime64[ns]’, name=’Month’, length=168, freq=None)
# Line plot of Month df.plot()
# check columns df.columns Output: Index(['Milk in pounds per cow'], dtype='object') timeseries = df['Milk in pounds per cow'] # Decomposition of multiplicative time series decomposition = sm.tsa.seasonal_decompose(timeseries, model='multiplicative') # Visualisation fig = decomposition.plot() fig.set_figwidth(12) fig.set_figheight(8) fig.suptitle('Decomposition of multiplicative time series') plt.show()
The following are some of our key observations from this analysis:
1) Trend: Looks quite similar to a straight line hence we could have easily used linear regression to estimate the trend in this data.
2) Seasonality: as discussed, seasonal plot displays a fairly consistent month-on-month pattern. The monthly seasonal components are average values for a month after removal of trend. Trend is removed from the time series using the following formula:
3) Irregular Remainder (random): is the residual left in the series after removal of trend and seasonal components. Remainder is calculated using the following formula:
The expectations from remainder component are that it should look like white noise i.e. displays no pattern at all.
#Time series plot timeseries.rolling(12).mean().plot(label='12 Month Rolling Mean') timeseries.plot() plt.legend()
Hypothesis Testing for Stationary:
We can use the Augmented Dickey-Fuller unit root test. In statistics and econometrics, an augmented Dickey–Fuller test (ADF) tests the null hypothesis that a unit root is present in a time series sample i.e. time series sample is non-stationary. The alternative hypothesis is different depending on which version of the test is used, but is usually stationary or trend-stationary. Basically, we are trying to figure out whether to: accept the Null Hypothesis H0 (that the time series has a unit root, indicating it is non-stationary) or, reject H0 and go with the Alternative Hypothesis (that the time series has no unit root and is stationary). We end up deciding this based on the p-value return. A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, so you reject the null hypothesis. A large p-value (> 0.05) indicates weak evidence against the null hypothesis, so you fail to reject the null hypothesis. Let’s run the Augmented Dickey-Fuller test on our data:
# Import stats tools and libraries from statsmodels.tsa.stattools import adfuller #create a function to perform the hypothesis def adf_check(time_series): result=adfuller(time_series) print('Augmented Dicley Fuller test:') labels=['ADF Test Statistics','p-value','#Lags used','Number of Observation used'] for value,label in zip(result,labels): print(label +':'+str(value)) if result<=0.05: print("Strong evidence against the null hypothesis, reject the null hypothesis. Data is stationary") else: print("Weak evidence against the null hypothesis, accept the null hypothesis. Data is not stationary") adf_check(df['Milk in pounds per cow'])
Augmented Dicley Fuller test:ADF Test Statistics:-1.3038115874221299p-value:0.6274267086030314#Lags used:13Number of Observation used:154Weak evidence against the null hypothesis, accept the null hypothesis. Data is not stationary
df['Mil First Difference'] = df['Milk in pounds per cow']- df['Milk in pounds per cow'].shift(1) #Dropping NA values and performing the hypothesis-1 adf_check(df['Mil First Difference'].dropna()) Output:Augmented Dicley Fuller test:ADF Test Statistics:-3.0549955586530606p-value:0.03006800400178645#Lags used:14Number of Observation used:152Strong evidence against the null hypothesis, reject the null hypothesis. Data is stationary #Line plot of the seasonal Mil First Difference df['Mil First Difference'].plot()
df['Seasonal Difference'] = df['Milk in pounds per cow'] - df['Milk in pounds per cow'].shift(12) #Line plot of seasonal difference df['Seasonal Difference'].plot()
#Dropping NA values and performing the hypothesis-2 adf_check(df['Seasonal Difference'].dropna())
Augmented Dicley Fuller test:ADF Test Statistics:-2.3354193143593984p-value:0.16079880527711332#Lags used:12Number of Observation used:143Weak evidence against the null hypothesis, accept the null hypothesis. Data is not stationary
df['Seasonal First Difference']=df['Mil First Difference']-df['Mil First Difference'].shift(12) #Dropping NA values and performing the hypothesis-3 adf_check(df['Seasonal First Difference'].dropna())
Augmented Dicley Fuller test:ADF Test Statistics:-5.0380022749219755p-value:1.865423431878904e-05#Lags used:11Number of Observation used:143Strong evidence against the null hypothesis, reject the null hypothesis. Data is stationary
#Line plot of the seasonal first difference df['Seasonal First Difference'].plot()
# Import ARIMA model from statsmodels.tsa.arima_model import ARIMA # Building and fitting the model model=sm.tsa.statespace.SARIMAX(df['Milk in pounds per cow'],order=(1, 0, 0),seasonal_order=(1,1,1,12)) results=model.fit() # Prediction df['forecast']=results.predict(start=150,end=168,dynamic=True) df[['Milk in pounds per cow','forecast']].plot(figsize=(12,8))
difference = (df[150:168][‘Milk in pounds per cow’] – model.score(df[150:168][‘Milk in pounds per cow’]))
ddd = difference**2