This tutorial contains brief overview of statistical and machine learning methods for time series forecasting, experiments and comparative analysis of Long short-term memory (LSTM) based architectures for solving above mentioned problem. Single layer, two layer and bidirectional single layer LSTM cases are considered. Metro Interstate Traffic Volume Data Set from UCI Machine Learning Repository and PyTorch deep learning framework are used for analysis.
- Time series forecasting methods overview
- Dataset
- Metro Traffic Prediction using LSTM-based recurrent neural network
- Summary
Time series forecasting methods overview
A time series is a set of observations, each one is being recorded at the specific time \(t\). It can be weather observations, for example, a records of the temperature for a month, it can be observations of currency quotes during the day or any other process aggregated by time. Time series forecasting can be determened as the act of predicting the future by understanding the past. The model for forecasting can rely on one variable and this is a univariate case or when more than one variable taken into consideration it will be multivariate case.
Stochastic Linear Models:
- Autoregressive (AR);
- Moving Average (MA);
- Autoregressive Moving Average (ARMA);
- Autoregressive Integrated Moving Average (ARIMA);
- Seasonal ARIMA (SARIMA).
For above family of models, the stationarity condition must be satisfied. Loosely speaking, a stochastic process is stationary, if its statistical properties do not change with time.
Stochastic Non-Linear Models:
- nonlinear autoregressive exogenous models (NARX);
- autoregressive conditional heteroskedasticity (ARCH);
- generalized autoregressive conditional heteroskedasticity (GARCH).
Machine Learning Models
- Hidden Markov Model;
- Least-square SVM (LS-SVM);
- Dynamic Least-square SVM (DLS-SVM);
- Feed Forward Network (FNN);
- Time Lagged Neural Network (TLNN);
- Seasonal Artificial Neural Network (SANN);
- Recurrent Neural Networks (RNN).
Dataset
We will use Metro Interstate Traffic Volume Data Set from UC Irvine Machine Learning Repository, which contains large number of datasets for various tasks in machine learning. We will investigate how weather and holiday features influence the metro traffic in US.
Attribute Information:
holiday Categorical US National holidays plus regional holiday, Minnesota State;
temp Numeric Average temp in kelvin;
rain_1h Numeric Amount in mm of rain that occurred in the hour;
snow_1h Numeric Amount in mm of snow that occurred in the hour;
clouds_all Numeric Percentage of cloud cover;
weather_main Categorical Short textual description of the current weather;
weather_description Categorical Longer textual description of the current weather;
date_time DateTime Hour of the data collected in local CST time;
traffic_volume Numeric Hourly I-94 ATR 301 reported westbound traffic volume.
Our target variable will be traffic_volume. Here we will consider multivariate multi-step prediction case with LSTM-based recurrent neural network architecture.
Metro Traffic Prediction using LSTM-based recurrent neural network
I used Google Colab Notebooks to calculate experiments. Here, for convinience, I mounted my Google drive where I stored the files.
In [1]:
Out [1]:
Mounted at /content/drive
In [2]:
In [3]:
In [4]:
Exploratory Data Analysis (EDA) and Scaling
In [5]:
Out [5]:
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
---|---|---|---|---|---|---|---|---|---|
0 | None | 288.28 | 0.0 | 0.0 | 40 | Clouds | scattered clouds | 2012-10-02 09:00:00 | 5545 |
1 | None | 289.36 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 10:00:00 | 4516 |
2 | None | 289.58 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 11:00:00 | 4767 |
3 | None | 290.13 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 12:00:00 | 5026 |
4 | None | 291.14 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 13:00:00 | 4918 |
categorical features: holiday, weather_main, weather_description.
Continious features: temp, rain_1h, show_1h, clouds_all.
Target variable: traffic_volume
In [6]:
Out [6]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48204 entries, 0 to 48203
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 holiday 48204 non-null object
1 temp 48204 non-null float64
2 rain_1h 48204 non-null float64
3 snow_1h 48204 non-null float64
4 clouds_all 48204 non-null int64
5 weather_main 48204 non-null object
6 weather_description 48204 non-null object
7 date_time 48204 non-null object
8 traffic_volume 48204 non-null int64
dtypes: float64(3), int64(2), object(4)
memory usage: 3.3+ MB
Checking for Nan values
In [7]:
Out [7]:
holiday 0
temp 0
rain_1h 0
snow_1h 0
clouds_all 0
weather_main 0
weather_description 0
date_time 0
traffic_volume 0
Here I take into consideration two categorical variables, such as ‘holiday’ and ‘weather_main’ and then one-hot encode them.
In [8]:
In [9]:
Out [9]:
temp | rain_1h | snow_1h | clouds_all | weather_description | date_time | traffic_volume | holiday_Columbus Day | holiday_Independence Day | holiday_Labor Day | holiday_Martin Luther King Jr Day | holiday_Memorial Day | holiday_New Years Day | holiday_None | holiday_State Fair | holiday_Thanksgiving Day | holiday_Veterans Day | holiday_Washingtons Birthday | weather_main_Clouds | weather_main_Drizzle | weather_main_Fog | weather_main_Haze | weather_main_Mist | weather_main_Rain | weather_main_Smoke | weather_main_Snow | weather_main_Squall | weather_main_Thunderstorm | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 288.28 | 0.0 | 0.0 | 40 | scattered clouds | 2012-10-02 09:00:00 | 5545 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 289.36 | 0.0 | 0.0 | 75 | broken clouds | 2012-10-02 10:00:00 | 4516 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 289.58 | 0.0 | 0.0 | 90 | overcast clouds | 2012-10-02 11:00:00 | 4767 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 290.13 | 0.0 | 0.0 | 90 | overcast clouds | 2012-10-02 12:00:00 | 5026 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 291.14 | 0.0 | 0.0 | 75 | broken clouds | 2012-10-02 13:00:00 | 4918 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
In [10]:
Out [10]:
(48204, 28)
Here we can see outliers in ‘temp’ variable, lets filter outliers with Interquartile range (IQR) method
In [11]:
Out [11]:
In [12]:
Out [12]:
temp 19.646
rain_1h 0.000
snow_1h 0.000
clouds_all 89.000
traffic_volume 3740.000
holiday_Columbus Day 0.000
holiday_Independence Day 0.000
holiday_Labor Day 0.000
holiday_Martin Luther King Jr Day 0.000
holiday_Memorial Day 0.000
holiday_New Years Day 0.000
holiday_None 0.000
holiday_State Fair 0.000
holiday_Thanksgiving Day 0.000
holiday_Veterans Day 0.000
holiday_Washingtons Birthday 0.000
weather_main_Clouds 1.000
weather_main_Drizzle 0.000
weather_main_Fog 0.000
weather_main_Haze 0.000
weather_main_Mist 0.000
weather_main_Rain 0.000
weather_main_Smoke 0.000
weather_main_Snow 0.000
weather_main_Squall 0.000
weather_main_Thunderstorm 0.000
dtype: float64
In [13]:
In [14]:
Out [14]:
In [15]:
Normalizing the features
In [16]:
In [17]:
In [18]:
In [19]:
Out [19]:
temp | clouds_all | traffic_volume | holiday_Columbus Day | holiday_Independence Day | holiday_Labor Day | holiday_Martin Luther King Jr Day | holiday_Memorial Day | holiday_New Years Day | holiday_None | holiday_State Fair | holiday_Thanksgiving Day | holiday_Veterans Day | holiday_Washingtons Birthday | weather_main_Clouds | weather_main_Drizzle | weather_main_Fog | weather_main_Haze | weather_main_Mist | weather_main_Rain | weather_main_Smoke | weather_main_Snow | weather_main_Squall | weather_main_Thunderstorm | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.581321 | -0.261692 | 1.145383 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0.668789 | 0.638749 | 0.628729 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0.686606 | 1.024652 | 0.754754 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0.731150 | 1.024652 | 0.884797 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0.812948 | 0.638749 | 0.830570 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
48199 | 0.190147 | 0.638749 | 0.140191 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
48200 | 0.134265 | 1.024652 | -0.242404 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
48201 | 0.131835 | 1.024652 | -0.554707 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
48202 | 0.080003 | 1.024652 | -0.910691 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
48203 | 0.082432 | 1.024652 | -1.159730 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
48194 rows × 24 columns
Preparing training dataset/Visualizations
Here we will consider multiple future points prediction case given a past history.
In [20]:
In [21]:
In [22]:
In [23]:
Out [23]:
Train input features shape : (39280, 120, 24)
Train output shape : (39280, 32)
Validation input features shape : (5582, 120, 24)
Validation output shape : (5582, 32)
Test input features shape : (1860, 120, 24)
Test output shape : (1860, 32)
In [24]:
LSTM Time Series Predictor Model
For more efficient storing of long sequences, here I chose the LSTM
architecture. LSTM layer consists from cells which is shown below. Two vectors
are inputs to the LSTM-cell: the new input vector from the data \(x_t\) and the
vector of the hidden state \(h_{t-1}\), which is obtained from the hidden state of
this cell in the previous time step. The LSTM-cell consists of several number of
gates and in addition to the hidden state vector, there is a “memory vector” or
cell state \(C_t\).
Cell state on time step t is a linear combination of cell state on t-1 time
step \(C_{t-1}\) with coefficients from forget gate and new candidate cell
state \(\tilde{C_t}\) with coefficients from input gate. When values of forget
gate \(f_t\) wiil be close to zero, cell state \(C_{t-1}\) will be forgotten. Where
values of input gate vector will be large, the new input vector will be added
to that which already were in memory.
where,
\(i_t\) - input gate;
\(f_t\) - forget gate;
\(o_t\) - output gate;
\(\tilde{C_t}\) - new candidate cell state;
\(C_t\) - cell state;
\(h_t\) - block output.
Peepholes are often added to LSTM cells to increase model connectivity.
In [25]:
In [26]:
In [27]:
Train and Evaluate Helping Functions
In [28]:
In [29]:
1-Layer LSTM model
In [30]:
Out [30]:
====================1====================
Epoch: 1, train loss: 0.90311
Epoch: 1, val loss: 0.78379
Saving model for best loss
====================2====================
Epoch: 2, train loss: 0.77797
Epoch: 2, val loss: 0.73462
Saving model for best loss
====================3====================
Epoch: 3, train loss: 0.74900
Epoch: 3, val loss: 0.70943
Saving model for best loss
====================4====================
Epoch: 4, train loss: 0.73103
Epoch: 4, val loss: 0.69784
Saving model for best loss
====================5====================
Epoch: 5, train loss: 0.71739
Epoch: 5, val loss: 0.68612
Saving model for best loss
====================6====================
Epoch: 6, train loss: 0.70665
Epoch: 6, val loss: 0.67945
Saving model for best loss
====================7====================
Epoch: 7, train loss: 0.69868
Epoch: 7, val loss: 0.67938
Saving model for best loss
====================8====================
Epoch: 8, train loss: 0.69284
Epoch: 8, val loss: 0.67938
====================9====================
Epoch: 9, train loss: 0.68811
Epoch: 9, val loss: 0.66802
Saving model for best loss
====================10====================
Epoch: 10, train loss: 0.68397
Epoch: 10, val loss: 0.67317
In [31]:
Out [31]:
Test
In [32]:
In [33]:
2-Layer LSTM
In [34]:
Out [34]:
====================1====================
Epoch: 1, train loss: 0.89874
Epoch: 1, val loss: 0.77490
Saving model for best loss
====================2====================
Epoch: 2, train loss: 0.77985
Epoch: 2, val loss: 0.73741
Saving model for best loss
====================3====================
Epoch: 3, train loss: 0.75368
Epoch: 3, val loss: 0.71219
Saving model for best loss
====================4====================
Epoch: 4, train loss: 0.73283
Epoch: 4, val loss: 0.69299
Saving model for best loss
====================5====================
Epoch: 5, train loss: 0.71830
Epoch: 5, val loss: 0.68409
Saving model for best loss
====================6====================
Epoch: 6, train loss: 0.70771
Epoch: 6, val loss: 0.67732
Saving model for best loss
====================7====================
Epoch: 7, train loss: 0.69788
Epoch: 7, val loss: 0.67322
Saving model for best loss
====================8====================
Epoch: 8, train loss: 0.69050
Epoch: 8, val loss: 0.67194
Saving model for best loss
====================9====================
Epoch: 9, train loss: 0.68391
Epoch: 9, val loss: 0.66734
Saving model for best loss
====================10====================
Epoch: 10, train loss: 0.67748
Epoch: 10, val loss: 0.66578
Saving model for best loss
In [35]:
Out [35]:
Test
In [36]:
In [37]:
Out [37]:
Bidirectional 1-Layer LSTM
In [38]:
Out [38]:
====================1====================
Epoch: 1, train loss: 0.92054
Epoch: 1, val loss: 0.78858
Saving model for best loss
====================2====================
Epoch: 2, train loss: 0.77843
Epoch: 2, val loss: 0.73070
Saving model for best loss
====================3====================
Epoch: 3, train loss: 0.75058
Epoch: 3, val loss: 0.71063
Saving model for best loss
====================4====================
Epoch: 4, train loss: 0.73297
Epoch: 4, val loss: 0.69804
Saving model for best loss
====================5====================
Epoch: 5, train loss: 0.71834
Epoch: 5, val loss: 0.68801
Saving model for best loss
====================6====================
Epoch: 6, train loss: 0.70628
Epoch: 6, val loss: 0.67993
Saving model for best loss
====================7====================
Epoch: 7, train loss: 0.69642
Epoch: 7, val loss: 0.67463
Saving model for best loss
====================8====================
Epoch: 8, train loss: 0.68780
Epoch: 8, val loss: 0.67683
====================9====================
Epoch: 9, train loss: 0.68003
Epoch: 9, val loss: 0.67213
Saving model for best loss
====================10====================
Epoch: 10, train loss: 0.67364
Epoch: 10, val loss: 0.66894
Saving model for best loss
In [39]:
Out [39]:
Test
In [40]:
In [41]:
In [42]:
Experimental results on val data
Model | Best MSE loss |
---|---|
1-layer LSTM | 0.66802 |
2-layer LSTM | 0.66578 |
Bidirectional 1-layer LSTM | 0.66894 |
Summary
At the first glance, all three models based on LSTM showed approximately the same results, so you can choose a single-layer LSTM without loss of quality. In cases where computational efficiency is a crucial part you may chose GRU model, which has less parameters to optimize than LSTM model. The ways you can improve existing deep learning model:
- work on data cleanliness, look for essential features, which are strongly related to the target variable;
- manual and automatic feature engineering;
- optimization of hyperparameters (length of hidden state vector, batch size, number of epochs, learning rate, number of layers);
- try different optimization algorithm;
- try different loss function which will be differentiable and adequate to task you’re solving;
- use ensembles of prediction vectors.