Multivariate Time Series using Auto ARIMA
A time series is a collection of continuous data points recorded over time. It has equal intervals such as hourly, daily, weekly, minutes, monthly, and yearly. <!more> Examples of time series data include annual budgets, company sales, weather records, air traffic, Covid19 caseloads, forex exchange rates, and stock prices.
A time series model analyzes time series values and identifies hidden patterns. Eventually, the model predicts future time series values based on previously observed/historical values.
In this tutorial, we will build on a multivariate time series model. The model will learn using multiple variables. We create the model using Auto ARIMA.
Table of contents
 Prerequisites
 Getting started with Auto ARIMA
 Understanding the ARIMA model
 How to remove nonstationarity components in a time series
 Explaining ARIMA initials
 Why do we use Auto ARIMA?
 Energy consumption dataset
 Plotting the 'demand' column
 Plotting subplots
 Checking for missing or null values
 Imputing missing values
 Dataset resampling
 Implementing the Auto ARIMA model
 Initialize the auto arima function
 Splitting the time series dataset
 Fitting the Auto ARIMA model
 Using the Auto ARIMa model to make predictions
 Predicting the test data frame
 Predict the unseen future time series values
 Plotting the future predicted values
 Conclusion
 Further reading
Prerequisites
For a reader to understand the time series concepts explained in this tutorial, they should understand:
 Introduction to time series
 Time Series decomposition
 Building a simple time series application
 How to run the Python code in Google Colab
Getting started with Auto ARIMA
Auto ARIMA is a time series library that automates the process of building a model using ARIMA. Auto ARIMA applies the concepts of ARIMA in modeling and forecasting.
Auto ARIMA automatically finds the best parameters of an ARIMA model. To follow along with this tutorial, you have to understand the concepts of the ARIMA model.
Understanding the ARIMA model
AutoRegressive Integrated Moving Average (ARIMA) is a time series model that identifies hidden patterns in time series values and makes predictions. For example, an ARIMA model can predict future stock prices after analyzing previous stock prices.
Also, an ARIMA model assumes that the time series data is stationary. Before implementing the ARIMA model, we will remove the nonstationarity components in the time series.
How to remove nonstationarity components in a time series
A nonstationary time series is a series whose properties change over time. A nonstationary time series has trends and seasonality components. Removing the nonstationarity in a time series will make it stationary and apply the ARIMA model.
The properties of time series that should remain constant are variance and mean. Allowing these properties to remain constant will remove the trend and seasonal components. We remove nonstationarity in a time series through differencing.
The differencing technique subtracts the present time series values from the past time series values. We may have to repeat the process of differencing multiple times until we output a stationary time series.
An ARIMA model has three initials: AR, I, and MA. These initials represent the three submodels that form a single uniform model. The function of the initials is as follows:
Explaining ARIMA initials
AR  Auto Regression. I  Integrated. MA  Moving average.
They have the following functionalities:

Auto Regression submodel  This submodel uses past values to make future predictions.

Integrated submodel  This submodel performs differencing to remove any nonstationarity in the time series.

Moving Average submodel.  It uses past errors to make a prediction.
These submodels are parameters of the overall ARIMA model. We initialize the parameters using unique notations as follows:

p: It is the order of the Auto Regression (AR) submodel. It refers to the number of past values that the model uses to make predictions.

d: It is the number of differencing done to remove nonstationary components.

q: It is the order of the Moving Average (MA) submodel. It refers to the number of past errors that an ARIMA Model can have when making predictions.
Why do we use Auto ARIMA?
Before we build an ARIMA model, we pass the p,d, and q values. We use statistical plots and techniques to find the optimal values of these parameters.
We also use statistical plots such as Partial Autocorrelation Function plots and AutoCorrelation Function plot.
The process of using statistical plots is usually hectic and timeconsuming. Many people have difficulties interpreting these plots to find the optimal parameter values. Wrong interpretation leads to people not getting the best/optimal p,d, and q values. It affects the ARIMA model's overall performance.
Auto ARIMA automatically generates the optimal parameter values (p,d, and q). The generated values are the best, and the model will give accurate forecast results.
Auto ARIMA simplifies the process of building a time series model using the ARIMA model. Now we know how an ARIMA works and how Auto ARIMA applies its concepts. We will start exploring the time series dataset.
Energy consumption dataset
We will use the energy consumption dataset to build the Auto ARIMA model. The dataset shows the energy demand from 2012 to 2017 recorded in an hourly interval.
Download the time series dataset using this link. After downloading the time series dataset, we will load it using the Pandas
library.
import pandas as pd
To load the energy consumption dataset, run this code:
df = pd.read_csv('energy_consumption.csv')
To visualize the dataset, use this code:
df
Energy consumption dataset output:
From this output, we have the timeStamp
, demand
, precip
, and temp
columns. The columns are the variables that will build the time series model.
The time series is multivariate since it has threetime dependent variables (demand
, precip
, and temp
). They have the following functions:
 The
timestamp
column shows the time of recording.  The
demand
column shows the hourly energy consumption.  The
precip
andtemp
columns correlate with thedemand
column.
Converting the timestamp
column
We need to convert the timestamp
column to the DateTime format. It will enable us to perform timeseries analysis and operations on this column. We will use the pd.to_datetime
function.
df['timeStamp']=pd.to_datetime(df['timeStamp'])
Plotting the demand
column
Since we are forecasting the demand
, we plot this column to visualize the data points. It will enable us to check for trends or seasonality in the time series. We will use the Plotly Express Python module to plot the line chart.
We import the Plotly Express Python module as follows:
import plotly.express as px
To plot the demand
column, use the following code:
fig = px.line(df, x='timeStamp', y='demand', title='Energy Consumption')
fig.update_xaxes(
rangeslider_visible=True,
rangeselector=dict(
buttons=list([
dict(step="all")
])
)
)
fig.show()
It plots the following line chart:
From the output above, the dataset has seasonality (repetitive cycles). Since the dataset has seasonality, we can say it is nonstationary. But still, we need to perform a statistical check using the Augmented DickeyFuller (ADF) test to assess stationarity in our dataset. The test is more accurate.
If we find the dataset is nonstationary after the ADF test, we will have to perform differencing to make it stationary. Auto ARIMA performs differencing automatically. The next step is to set the timeStamp
as the index column.
el_df=df.set_index('timeStamp')
We set the timeStamp
as the index column for better interaction with the data frame. The Auto ARIMA model also expects the timeStamp
to be the index column.
Plotting subplots
The subplots will show the timedependent variables in the dataset. We will visualize the demand
, precip
, and temp
columns.
el_df.plot(subplots=True)
It produces the following subplots:
Checking for missing or null values
We need to check for missing values in the dataset. Missing values affects the model and leads to inaccurate forecast results.
print ("\nMissing values : ", df.isnull().any())
Output:
From the output, all the columns have missing values. We will handle the missing values using data imputation. It ensures we have a completetime series dataset.
Imputing missing values
We will first impute the missing values in the demand
column. We will use the fillna
method.
Imputing 'demand' column
df['demand']=df['demand'].fillna(method='ffill')
Imputing 'temp' column
df['temp']=df['temp'].fillna(method='ffill')
Imputing 'precip' column
df['temp']=df['precip'].fillna(method='ffill')
To learn more on how to handle missing values in time series using data imputation, go through this article
We check again for missing values to know if we have handled the issue successfully.
print ("\nMissing values : ", df.isnull().any())
Dataset resampling
The time series has many data points that may be difficult to analyze and visualize. We need to resample the time by compressing and aggregating it to monthly intervals. We will have fewer data points that are easier to analyze.
The resample()
method will aggregate all the data points in the time series and change them to monthly intervals.
el_df.resample('M').mean()
Dataset resampling output:
Let's plot new subplots of the resampled dataset.
Plotting new subplots
We plot the new subplot as follows:
el_df.resample('M').mean().plot(subplots=True)
From these new subplots, we have resampled the dataset. It will be easier to model these fewer data points. We will save the resampled dataset in a new variable.
Saving the resampled dataset
We save the resampled dataset as follows:
final_df=el_df.resample('M').mean()
We will use this dataset to train the time series model. We can now start implementing the Auto ARIMA model.
Implementing the Auto ARIMA model
We implement the Auto ARIMA model using the pmdarima timeseries library. This library provides the auto_arima()
function that automatically generates the optimal parameter values.
To install pmdarima
, use this command:
!pip install pmdarima
After the installation, we import it as follows:
import pmdarima as pm
The next step is to initialize the auto_arima()
function.
Initialize the auto arima function
We initialize the auto_arima()
function as follows:
model = pm.auto_arima(final_df['demand'],
m=12, seasonal=True,
start_p=0, start_q=0, max_order=4, test='adf',error_action='ignore',
suppress_warnings=True,
stepwise=True, trace=True)
In the auto_arima()
function we pass the final_df
which is our resampled dataset. We select the demand
column since this is what the model wants to predict.
The function can either use the Grid Search technique, or Random Search technique to find the optimal parameter values. It tries multiple combinations of p,d, and q and then selects the optimal ones.
The auto_arima()
function also has the following parameters:

m=12
 It represents the number of months in a year. 
start_p=0
 It represents the minimump
value that the function can select during the random search. 
start_q=0
 It represents the minimumq
value that the function can select during the random search. 
max_order=4
 It represents the maximump
,d
, andq
values that the model can select during the random search. 
test='adf'
 It is an Augmented DickeyFuller (ADF) test to check for stationarity in our dataset. If the dataset is nonstationary after the ADF test, theauto_arima()
function will automatically generate thed
value for differencing. If the dataset is stationary, it sets d=0 (no need for differencing). 
suppress_warnings=True
 It ignores the warnings during the parameter searching. 
stepwise=True
 It will run the Random Search to find the optimal parameters. Grid Search is more exhaustive since it tries all the parameter combinations, but it is slow. We opt to use Random Search since it is faster.
When you run this code, the function will randomly search the parameters and produce the following output:
From the output above, the best model is ARIMA(1,0,1) (p=1, d=0, and q=1). The function automatically sets d=0 because the ADF test found the dataset is stationary.
We had previously observed the time series dataset plots to have seasonality. Therefore, we thought the time series was nonstationary, hence a need for differencing.
But using the ADF test, which is a statistical test, found the seasonality is insignificant. ADF test is more accurate than observing/visualizing the plots. That is why the function sets d=0, and there is no need for differencing.
After initializing the auto_arima()
function, the next step is to split the time series dataset.
Splitting the time series dataset
We split the time series dataset into a training data frame and a test data frame as follows:
train=final_df[(final_df.index.get_level_values(0) >= '20120131') & (final_df.index.get_level_values(0) <= '20170430')]
The code selects the data points from 20120131 to 20170430 for model training. We get the data points for model testing using the following code:
test=final_df[(final_df.index.get_level_values(0) > '20170430')]
The data points from 20170430 are for model testing. To display the test data points, use this code:
test
From the output, the test data frame has four data points.
Let's fit the Auto ARIMA model to the train data frame.
Fitting the Auto ARIMA model
Fitting the Auto ARIMA model to the train data frame will enable the model to learn from the timeseries dataset. The final model will make future predictions.
model.fit(train['demand'])
After training, it produces the following output:
We train the model using the train data frame. It also uses the optimal p,d, and q parameter values during training. Let's use the model to make predictions.
Using the Auto ARIMA model to make predictions
The Auto ARIMA model will predict using the test data frame. It will also forecast/predict the unseen future time series values.
Predicting the test data frame
We predict the test data frame as follows:
forecast=model.predict(n_periods=4, return_conf_int=True)
n_periods=4
: It represents the number of the data points in the test data frame that the model will predict. To see the predicted values, use this code:
forecast
We need to convert the predicted values to a Pandas data frame. It will be easier to plot the Pandas data frame using Matplotlib.
forecast_df = pd.DataFrame(forecast[0],index = test.index,columns=['Prediction'])
To see the Pandas data frame, run this code:
forecast_df
It produces this output:
The next step is to plot the Pandas data frame using Matplotlib.
Plotting the Pandas data frame
We import Matplotlib as follows:
import matplotlib.pyplot as plt
We plot the line chart as follows:
pd.concat([final_df['demand'],forecast_df],axis=1).plot()
It produces the following line chart:
From the line chart above:
 The blue line is the actual energy demand.
 The orange line is the predicted energy demand.
The Auto ARIMA model has performed well and has made accurate predictions. The blue and orange lines are close to each other.
We can now use this model to predict unseen future values.
Predict the unseen future time series values
To predict/forecast the unseen future values, use this code:
forecast1=model.predict(n_periods=8, return_conf_int=True)
forecast_range=pd.date_range(start='20170531', periods=8,freq='M')
n_periods=8
It represents the number of data points the model will predict in the future. The future dates are from 20170531. We also need to convert the predicted values to a Pandas data frame.
forecast1_df = pd.DataFrame(forecast1[0],index =forecast_range,columns=['Prediction'])
Finally, we plot the future predicted values using Matplotlib
Plotting the future predicted values
To plot the future predicted values, use the following code:
pd.concat([final_df['demand'],forecast1_df],axis=1).plot()
It produces the following line chart:
From the line chart above:
 The blue line represents the actual energy demand.
 The orange line represents the predicted energy demand.
The orange line also shows the unseen future predictions. The Auto ARIMA model has performed well since the orange line maintains the general pattern.
Conclusion
In this tutorial, We have learned how to build a multivariate time series model with Auto ARIMA. We explored how the Auto ARIMA model works and how it automatically finds the best parameters of an ARIMA model.
Finally, we implemented the Auto ARIMA model. We used the Auto ARIMA model to find the p
, d
, and q
values.
We used the trained Auto ARIMA model to predict the energy demand on the test data frame and the unseen future time series values. The final model made accurate predictions observed in the plotted line chart.
You can get the complete Python implementation of this tutorial in Google Colab here
Further reading
 Auto ARIMA documentation
 Pmdarima documentation
 What is auto ARIMA?
 ARIMA Model time series forecasting
 ARIMA model definition
 Hyperparameter Tuning
 Random Search technique
 Grid Search technique
 ARIMA model guide
 ARIMA models
Peer Review Contributions by: Willies Ogola