Data Science and Artificial Intelligence

Time Series Case Study

A case study of implementing ARIMA

In this case study, we will learn about time series analysis for a production operation. Time series analysis and modelling have many business and social applications. It is extensively used to forecast company sales, product demand, stock market trends, agricultural production etc. The fundamental idea for time series analysis is to decompose the original time series (sales, stock market trends, etc.) into several independent components. Typically, business time series are divided into the following four components:

  • Trend – overall direction of the series i.e. upwards, downwards etc.
  • Seasonality – monthly or quarterly patterns
  • Cycle – long-term business cycles, they usually come after 5 or 7 years
  • Irregular remainder – random noise left after extraction of all the components

Interference of these components produces the final series.

Free Step-by-step Guide To Become A Data Scientist

Subscribe and get this detailed guide absolutely FREE

ARIMA Modeling

ARIMA is a combination of 3 parts i.e.

Integrated (I) – Subtract time series from its lagged series to extract trends from the data. In this pass of ARIMA juicer, we extract trend(s) from the original time series data. Differencing is one of the most commonly used mechanisms for extraction of trends. Here, the original series is subtracted from its lagged series e.g. November’s sales values are subtracted with October’s values to produce trend-less residual series. The formulae for different orders of differencing are as follow:

No Differencing (d=0):

1st Differencing (d=1):

2nd Differencing (d=1)

The residual data of most time series usually become trend-less after the first order differencing which is represented as ARIMA (0,1,0). Notice, AR (p), and MA (q) values in this notation are 0 and the integrated (I) value has order one. If the residual series still has a trend it is further differenced and is called 2nd order differencing. This trend-less series is called stationary on mean series i.e. mean or average value for series does not change over time.

AutoRegressive (AR) – extract the influence of the previous periods’ values on the current period. After the time series data is made stationary through the integrated (I) pass, the AR part of the ARIMA juicer gets activated. As the name auto-regression suggests, here we try to extract the influence of the values of previous periods on the current period (e.g. the influence of the September and October’s sales value on the November’s sales). This is done through developing a regression model with the time lagged period values as independent or predictor variables. The general form of the equation for this regression model is shown below.

AR model of order 1 i.e. p=1 or ARIMA (1,0,0) is represented by the following regression equation:

Let’s understand AR models using the case below:

The current GDP of a country say y(t) is dependent on the last year’s GDP i.e. y(t – 1). The hypothesis being that the total cost of production of products & services in a country in a fiscal year (known as GDP) is dependent on the set up of manufacturing plants / services in the previous year and the newly set up industries / plants / services in the current year. But the primary component of the GDP is the former one. Hence, we can formally write the equation of GDP as:

This equation is known as AR(1) formulation. The numeral one (1) denotes that the next instance is solely dependent on the previous instance. The alpha is a coefficient which we seek to minimize the error function. Notice that Y(t- 1) is indeed linked to Y(t-2) in the same fashion. Hence, any shock to Y(t) will gradually fade off in future.

Moving Average (MA) – extract the influence of the previous period’s error terms on the current period’s error. Finally, the last component of ARIMA juicer i.e. MA involves finding relationships between the previous periods’ error terms on the current period’s error term. Keep in mind, this moving average (MA) has nothing to do with moving average we learned about in the previous article on time series decomposition. Moving Average (MA) part of ARIMA is developed with the following simple multiple linear regression values with the lagged error values as independent or predictor variables.

MA model of order 1 i.e. q=1 or ARIMA(0,0,1) is represented by the following regression equation

Forecasting with ARIMA

Forecasting is the difficult process of predicting the future, based on past and present data. One of the most common methods for this is the ARIMA model, which stands for AutoRegressive Integrated Moving Average. In an ARIMA model there are 3 parameters that are used to help model the major aspects of a times series: seasonality, trend, and noise. These parameters are labelled pd, and q.

p is the parameter associated with the auto-regressive aspect of the model, which incorporates past values. For example, forecasting that if it rained a lot over the past few days, you state it’s likely that it will rain tomorrow as well.

d is the parameter associated with the integrated part of the model, which effects the amount of differencing to apply to a time series. You can imagine an example of this as forecasting that the amount of rain tomorrow will be similar to the amount of rain today, if the daily amounts of rain have been similar over the past few days.

q is the parameter associated with the moving average part of the model.

The methods we will employ in this notebook example will only take in data from a uni-variate time series. That means we really are only considering the relationship between the y-axis value the x-axis time points. We’re not considering outside factors that may be affecting the time series.

Variation

One of the most important features of a time series is variation. Variations are patterns in the times series data. A time series that has patterns that repeat over known and fixed periods of time is said to have seasonality. Seasonality is a general term for variations that periodically repeat in data. In general, we think of variations as 4 categories.: Seasonal, Cyclic, Trend, and Irregular fluctuations. Seasonal variation is usually defined as variation that is annual in period, such as swimsuit sales being lower in winter and higher in summer. Cyclic Variation is a variation that occurs at other fixed periods, such as the daily variation in temperature. Both Seasonal and Cyclic variation would be examples of seasonality in a time series data set. Trends are long-term changes in the mean level, relative to the number of observations.

#Import tools and libraries
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
%matplotlib inline
# Read the data
df=pd.read_csv('monthly-milk-production-pounds-p.csv')
# Check the data structure
df.head()

Output:

 

 

 

 

 

 

df.tail()

Output:

 

 

 

 

 

 

 

 

df.columns=['Month','Milk in pounds per cow']
df.head()

Output:

 

 

 

 

 

 

 

 

# Check missing data

df.isna().any(1)

Output:

 

 

 

 

 

 

Output:

0      False

1      False

2      False

3      False

4      False

5      False

6      False

7      False

8      False

9      False

10     False

11     False

12     False

13     False

14     False

15     False

16     False

17     False

18     False

19     False

20     False

21     False

22     False

23     False

24     False

25     False

26     False

27     False

28     False

29     False

139    False

140    False

141    False

142    False

143    False

144    False

145    False

146    False

147    False

148    False

149    False

150    False

151    False

152    False

153    False

154    False

155    False

156    False

157    False

158    False

159    False

160    False

161    False

162    False

163    False

164    False

165    False

166    False

167    False

168     True

Length: 169, dtype: bool

# Drop the row having missing data (row 168)
df.drop(168,axis=0,inplace=True)
# Change time to year-month-date format
df['Month'] =  pd.to_datetime(df['Month'])
df.tail()

Output:

 

 

 

 

 

 

df.set_index('Month',inplace=True)
df.index

Output:DatetimeIndex([‘1962-01-01’, ‘1962-02-01’, ‘1962-03-01’, ‘1962-04-01’,               ‘1962-05-01’, ‘1962-06-01’, ‘1962-07-01’, ‘1962-08-01’,               ‘1962-09-01’, ‘1962-10-01’,               …               ‘1975-03-01’, ‘1975-04-01’, ‘1975-05-01’, ‘1975-06-01’,               ‘1975-07-01’, ‘1975-08-01’, ‘1975-09-01’, ‘1975-10-01’,               ‘1975-11-01’, ‘1975-12-01′],              dtype=’datetime64[ns]’, name=’Month’, length=168, freq=None)

# Line plot of Month 
df.plot()

 

 

 

 

 

 

df.head()

Output:

 

 

 

 

 

 

# check columns 
df.columns
Output: Index(['Milk in pounds per cow'], dtype='object')
timeseries = df['Milk in pounds per cow']
# Decomposition of multiplicative time series
decomposition = sm.tsa.seasonal_decompose(timeseries, model='multiplicative')
# Visualisation
fig = decomposition.plot()
fig.set_figwidth(12)
fig.set_figheight(8)
fig.suptitle('Decomposition of multiplicative time series')
plt.show()

 

 

 

 

 

 

 

 

 

 

 

 

The following are some of our key observations from this analysis:

1) Trend: Looks quite similar to a straight line hence we could have easily used linear regression to estimate the trend in this data.

2) Seasonality: as discussed, seasonal plot displays a fairly consistent month-on-month pattern. The monthly seasonal components are average values for a month after removal of trend. Trend is removed from the time series using the following formula:

3) Irregular Remainder (random): is the residual left in the series after removal of trend and seasonal components. Remainder is calculated using the following formula:

The expectations from remainder component are that it should look like white noise i.e. displays no pattern at all.

 

#Time series plot
timeseries.rolling(12).mean().plot(label='12 Month Rolling Mean')
timeseries.plot()
plt.legend()

 

 

 

 

 

 

Hypothesis Testing for Stationary:

We can use the Augmented Dickey-Fuller unit root test. In statistics and econometrics, an augmented Dickey–Fuller test (ADF) tests the null hypothesis that a unit root is present in a time series sample i.e. time series sample is non-stationary. The alternative hypothesis is different depending on which version of the test is used, but is usually stationary or trend-stationary. Basically, we are trying to figure out whether to: accept the Null Hypothesis H0 (that the time series has a unit root, indicating it is non-stationary) or, reject H0 and go with the Alternative Hypothesis (that the time series has no unit root and is stationary). We end up deciding this based on the p-value return. A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, so you reject the null hypothesis. A large p-value (> 0.05) indicates weak evidence against the null hypothesis, so you fail to reject the null hypothesis. Let’s run the Augmented Dickey-Fuller test on our data:

# Import stats tools and libraries
from statsmodels.tsa.stattools import adfuller
#create a function to perform the hypothesis
def adf_check(time_series):
result=adfuller(time_series)
print('Augmented Dicley Fuller test:')
labels=['ADF Test Statistics','p-value','#Lags used','Number of Observation used']
for value,label in zip(result,labels):
print(label +':'+str(value))
if result[1]<=0.05:
print("Strong evidence against the null hypothesis, reject the null hypothesis. Data is stationary")
else:
print("Weak evidence against the null hypothesis, accept the null hypothesis. Data is not stationary")
adf_check(df['Milk in pounds per cow'])

Output:

Augmented Dicley Fuller test:ADF Test Statistics:-1.3038115874221299p-value:0.6274267086030314#Lags used:13Number of Observation used:154Weak evidence against the null hypothesis, accept the null hypothesis. Data is not stationary

df['Mil First Difference'] = df['Milk in pounds per cow']- df['Milk in pounds per cow'].shift(1)
#Dropping NA values and performing the hypothesis-1
adf_check(df['Mil First Difference'].dropna())
Output:Augmented Dicley Fuller test:ADF Test Statistics:-3.0549955586530606p-value:0.03006800400178645#Lags used:14Number of Observation used:152Strong evidence against the null hypothesis, reject the null hypothesis. Data is stationary
 
#Line plot of the seasonal Mil First Difference
df['Mil First Difference'].plot()

 

 

 

 

 

 

 

df['Seasonal Difference'] = df['Milk in pounds per cow'] - df['Milk in pounds per cow'].shift(12)
#Line plot of seasonal difference
df['Seasonal Difference'].plot()

 

 

 

 

 

 

 

#Dropping NA values and performing the hypothesis-2
adf_check(df['Seasonal Difference'].dropna())

Output:

Augmented Dicley Fuller test:ADF Test Statistics:-2.3354193143593984p-value:0.16079880527711332#Lags used:12Number of Observation used:143Weak evidence against the null hypothesis, accept the null hypothesis. Data is not stationary

df['Seasonal First Difference']=df['Mil First Difference']-df['Mil First Difference'].shift(12)
#Dropping NA values and performing the hypothesis-3
adf_check(df['Seasonal First Difference'].dropna())

Output:

Augmented Dicley Fuller test:ADF Test Statistics:-5.0380022749219755p-value:1.865423431878904e-05#Lags used:11Number of Observation used:143Strong evidence against the null hypothesis, reject the null hypothesis. Data is stationary

#Line plot of the seasonal first difference
df['Seasonal First Difference'].plot()

 

 

 

 

 

 

 

ARIMA Modelling:

# Import ARIMA model
from statsmodels.tsa.arima_model import ARIMA
# Building and fitting the model
model=sm.tsa.statespace.SARIMAX(df['Milk in pounds per cow'],order=(1, 0,           0),seasonal_order=(1,1,1,12))
results=model.fit()
# Prediction
df['forecast']=results.predict(start=150,end=168,dynamic=True)
df[['Milk in pounds per cow','forecast']].plot(figsize=(12,8))

 

 

 

 

 

 

 

 

 

 

 

df.tail()

Output:

 

 

 

difference = (df[150:168][‘Milk in pounds per cow’] – model.score(df[150:168][‘Milk in pounds per cow’]))

ddd = difference**2

np.sqrt(ddd).sum()

Output: 1019063702.3493897

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

prateek

An alumnus of the NIE-Institute Of Technology, Mysore, Prateek is an ardent Data Science enthusiast. He has been working at Acadgild as a Data Engineer for the past 3 years. He is a Subject-matter expert in the field of Big Data, Hadoop ecosystem, and Spark.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close