Time Series Forecasting: LSTM-based model for metro traffic prediction (PyTorch)

This tutorial contains brief overview of statistical and machine learning methods for time series forecasting, experiments and comparative analysis of Long short-term memory (LSTM) based architectures for solving above mentioned problem. Single layer, two layer and bidirectional single layer LSTM cases are considered. Metro Interstate Traffic Volume Data Set from UCI Machine Learning Repository and PyTorch deep learning framework are used for analysis.

• • •

Time series forecasting methods overview

A time series is a set of observations, each one is being recorded at the specific time $t$ . It can be weather observations, for example, a records of the temperature for a month, it can be observations of currency quotes during the day or any other process aggregated by time. Time series forecasting can be determened as the act of predicting the future by understanding the past. The model for forecasting can rely on one variable and this is a univariate case or when more than one variable taken into consideration it will be multivariate case.

Stochastic Linear Models:

Autoregressive (AR);
Moving Average (MA);
Autoregressive Moving Average (ARMA);
Autoregressive Integrated Moving Average (ARIMA);
Seasonal ARIMA (SARIMA).

For above family of models, the stationarity condition must be satisfied. Loosely speaking, a stochastic process is stationary, if its statistical properties do not change with time.

Stochastic Non-Linear Models:

nonlinear autoregressive exogenous models (NARX);
autoregressive conditional heteroskedasticity (ARCH);
generalized autoregressive conditional heteroskedasticity (GARCH).

Machine Learning Models

Hidden Markov Model;
Least-square SVM (LS-SVM);
Dynamic Least-square SVM (DLS-SVM);
Feed Forward Network (FNN);
Time Lagged Neural Network (TLNN);
Seasonal Artificial Neural Network (SANN);
Recurrent Neural Networks (RNN).

Dataset

We will use Metro Interstate Traffic Volume Data Set from UC Irvine Machine Learning Repository, which contains large number of datasets for various tasks in machine learning. We will investigate how weather and holiday features influence the metro traffic in US.

Attribute Information:

holiday — Categorical US National holidays plus regional holiday, Minnesota State;
temp — Numeric, average temp in kelvin;
rain_1h — Numeric, amount in mm of rain that occurred in the hour;
snow_1h — Numeric, amount in mm of snow that occurred in the hour;
clouds_all — Numeric, percentage of cloud cover;
weather_main — Categorical, short textual description of the current weather;
weather_description — Categorical, longer textual description of the current weather;
date_time — DateTime, hour of the data collected in local CST time;
traffic_volume — Numeric, hourly I-94 ATR 301 reported westbound traffic volume.

Our target variable will be traffic_volume. Here we will consider multivariate multi-step prediction case with LSTM-based recurrent neural network architecture.

Metro Traffic Prediction using LSTM-based recurrent neural network

I used Google Colab Notebooks to calculate experiments. Here, for convenience, I mounted my Google Drive where I stored the files.

In [1]:python

from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Out [1]:

Mounted at /content/drive

In [2]:python

import os
import copy
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
from torch.autograd import Variable

import warnings
warnings.filterwarnings('ignore')

In [3]:python

random.seed(0)
np.random.seed(0)
torch.manual_seed(0)
torch.cuda.manual_seed(0)
torch.backends.cudnn.deterministic = True

In [4]:python

data = pd.read_csv(
  '/content/drive/My Drive/time series prediction/Metro_Interstate_Traffic_Volume.csv',
  parse_dates=True
)

Exploratory Data Analysis (EDA) and Scaling

In [5]:python

data.head()

Out [6]:

	holiday	temp	clouds_all	weather_main	weather_description	date_time	traffic_volume
0	None	288.28	40	Clouds	scattered clouds	2012-10-02 09:00:00	5545
1	None	289.36	75	Clouds	broken clouds	2012-10-02 10:00:00	4516
2	None	289.58	90	Clouds	overcast clouds	2012-10-02 11:00:00	4767
3	None	290.13	90	Clouds	overcast clouds	2012-10-02 12:00:00	5026
4	None	291.14	75	Clouds	broken clouds	2012-10-02 13:00:00	4918

Сategorical features: holiday, weather_main, weather_description.
Continious features: temp, rain_1h, show_1h, clouds_all.
Target variable: traffic_volume

In [7]:python

data.info()

Out [7]:

  <class 'pandas.core.frame.DataFrame'>
  RangeIndex: 48204 entries, 0 to 48203
  Data columns (total 9 columns):
   #   Column               Non-Null Count  Dtype
  ---  ------               --------------  -----
   0   holiday              48204 non-null  object
   1   temp                 48204 non-null  float64
   2   rain_1h              48204 non-null  float64
   3   snow_1h              48204 non-null  float64
   4   clouds_all           48204 non-null  int64
   5   weather_main         48204 non-null  object
   6   weather_description  48204 non-null  object
   7   date_time            48204 non-null  object
   8   traffic_volume       48204 non-null  int64
  dtypes: float64(3), int64(2), object(4)
  memory usage: 3.3+ MB

Checking for Nan values

In [8]:python

for column in data.columns:
  print(column, sum(data[column].isna()))

Out [8]:

  holiday 0
  temp 0
  rain_1h 0
  snow_1h 0
  clouds_all 0
  weather_main 0
  weather_description 0
  date_time 0
  traffic_volume 0

Here I take into consideration two categorical variables, such as holiday and weather_main and then one-hot encode them.

In [9]:python

data = pd.get_dummies(data, columns = ['holiday', 'weather_main'], drop_first=True)

In [10]:python

data.head()

Out [10]:

	temp	clouds_all	weather_description	date_time	traffic_volume	holiday_None	weather_main_Clouds
0	288.28	40	scattered clouds	2012-10-02 09:00:00	5545	1	1
1	289.36	75	broken clouds	2012-10-02 10:00:00	4516	1	1
2	289.58	90	overcast clouds	2012-10-02 11:00:00	4767	1	1
3	290.13	90	overcast clouds	2012-10-02 12:00:00	5026	1	1
4	291.14	75	broken clouds	2012-10-02 13:00:00	4918	1	1

In [11]:python

data.shape

Out [11]:


    (48204, 28)

Here we can see outliers in temp variable, lets filter outliers with Interquartile range (IQR) method

In [12]:python

data['temp'].plot()
plt.show()
data['traffic_volume'].plot()
plt.show()

Out [12]:

In [13]:python

Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
print(IQR)

Out [13]:

temp                                   19.646
rain_1h                                 0.000
snow_1h                                 0.000
clouds_all                             89.000
traffic_volume                       3740.000
holiday_Columbus Day                    0.000
holiday_Independence Day                0.000
holiday_Labor Day                       0.000
holiday_Martin Luther King Jr Day       0.000
holiday_Memorial Day                    0.000
holiday_New Years Day                   0.000
holiday_None                            0.000
holiday_State Fair                      0.000
holiday_Thanksgiving Day                0.000
holiday_Veterans Day                    0.000
holiday_Washingtons Birthday            0.000
weather_main_Clouds                     1.000
weather_main_Drizzle                    0.000
weather_main_Fog                        0.000
weather_main_Haze                       0.000
weather_main_Mist                       0.000
weather_main_Rain                       0.000
weather_main_Smoke                      0.000
weather_main_Snow                       0.000
weather_main_Squall                     0.000
weather_main_Thunderstorm               0.000
dtype: float64

In [14]:python

data = data[~((data['temp'] < (Q1['temp'] - 1.5 * IQR['temp'])) |(data['temp'] > (Q3['temp'] + 1.5 * IQR['temp'])))]

In [15]:python

data['temp'].plot()
plt.show()
data['traffic_volume'].plot()
plt.show()

Out [15]:

Traffic volume plot after outlier removal

In [16]:python

data.drop(columns=['rain_1h', 'snow_1h', 'weather_description', 'date_time'], inplace=True)

Normalizing the features

In [17]:python

features_to_norm = ['temp', 'clouds_all', 'traffic_volume']

In [18]:python

TRAIN_SPLIT = 40000
STEP = 6

past_history = 720
future_target = 32
target_index = 2

tmp = data[features_to_norm].values
data_mean = tmp[:TRAIN_SPLIT].mean(axis=0)
data_std = tmp[:TRAIN_SPLIT].std(axis=0)

In [19]:python

data[features_to_norm] = (data[features_to_norm]-data_mean)/data_std

In [20]:python

data

Out [20]:

Index	temp	clouds_all	traffic_volume	holiday_Columbus Day	...
0	0.581321	-0.261692	1.145383	0	...
1	0.668789	0.638749	0.628729	0	...
...	...	...	...	...	...
48203	0.082432	1.024652	-1.159730	0	...

48194 rows × 24 columns

Preparing training dataset and Visualizations

Here we will consider multiple future points prediction case given a past history.

In [21]:python

def multivariate_data(dataset, target, start_index, end_index, history_size,
                    target_size, step, single_step=False):
  data = []
  labels = []

  start_index = start_index + history_size
  if end_index is None:
      end_index = len(dataset) - target_size

  for i in range(start_index, end_index):
      indices = range(i-history_size, i, step)
      data.append(dataset[indices])

      if single_step:
          labels.append(target[i+target_size])
      else:
          labels.append(target[i:i+target_size])

  return np.array(data), np.array(labels)

In [22]:python

trainX, trainY = multivariate_data(data.values, data.iloc[:, target_index].values, 0,
                                 TRAIN_SPLIT, past_history,
                                 future_target, STEP)

valX, valY = multivariate_data(data.values, data.iloc[:, target_index].values,
                             TRAIN_SPLIT, None, past_history,
                             future_target, STEP)

In [23]:python

testX, testY = valX[:len(valX)//4], valY[:len(valY)//4]
valX, valY = valX[len(valX)//4:], valY[len(valY)//4:]

In [24]:python

print('Train input features shape : {}'.format(trainX.shape))
print('\nTrain output shape : {}'.format(trainY.shape))
print('\nValidation input features shape : {}'.format(valX.shape))
print('\nValidation output shape : {}'.format(valY.shape))
print('\nTest input features shape : {}'.format(testX.shape))
print('\nTest output shape : {}'.format(testY.shape))

Out [24]:

Train input features shape : (39280, 120, 24)

Train output shape : (39280, 32)

Validation input features shape : (5582, 120, 24)

Validation output shape : (5582, 32)

Test input features shape : (1860, 120, 24)

Test output shape : (1860, 32)

In [25]:python

def create_time_steps(length):
  return list(range(-length, 0))

def multi_step_plot(history, true_future, prediction, pic_name):
  plt.figure(figsize=(12, 6))
  num_in = create_time_steps(len(history))
  num_out = len(true_future)

  plt.plot(num_in, np.array(history), label='History', linewidth=2.5)
  plt.plot(np.arange(num_out)/STEP, np.array(true_future), 'b', label='True Future', linewidth=2.5)
  if prediction.any():
      plt.plot(np.arange(num_out)/STEP, np.array(prediction), 'ro', label='Predicted Future', linewidth=2.5)
  plt.legend(loc='upper left')
  plt.title(pic_name)
  plt.show()

multi_step_plot(trainX[0,:,target_index], trainY[0,:], np.array([0]), 'Train case')
multi_step_plot(valX[0,:,target_index], valY[0,:], np.array([0]), 'Val case')
multi_step_plot(testX[0,:,target_index], testY[0,:], np.array([0]), 'Test case')

Out [25]:

LSTM Time Series Predictor Model

For more efficient storing of long sequences, here I chose the LSTM architecture. LSTM layer consists from cells which is shown below. Two vectors are inputs to the LSTM-cell: the new input vector from the data $x_t$ and the vector of the hidden state $h_{t-1}$ , which is obtained from the hidden state of this cell in the previous time step.

Where:

$i_t$ - input gate
$f_t$ - forget gate
$o_t$ - output gate
$\tilde{C_t}$ - new candidate cell state
$C_t$ - cell state
$h_t$ - block output

In [26]:python

class LSTMTimeSeriesPredictor(nn.Module):

  def __init__(self,
               num_classes,
               n_features,
               hidden_size,
               num_layers,
               bidirectional=False,
               dp_rate=0.5):

      super(LSTMTimeSeriesPredictor, self).__init__()

      self.num_layers = num_layers
      self.hidden_size = hidden_size
      self.bidirectional = bidirectional

      self.lstm = nn.LSTM(input_size=n_features,
                          hidden_size=hidden_size,
                          num_layers=num_layers,
                          dropout=dp_rate,
                          batch_first=True,
                          bidirectional=self.bidirectional)

      self.fc = nn.Linear(in_features=hidden_size, out_features=num_classes)

  def forward(self, x):
     # dim(x) = (batch, seq_len, input_size)
     # dim(h_0) = (num_layers *num_directions, batch, hidden_size)
      if self.bidirectional:
          h_0 = Variable(torch.zeros(self.num_layers*2, x.size(0), self.hidden_size))
          c_0 = Variable(torch.zeros(self.num_layers*2, x.size(0), self.hidden_size))
      else:
          h_0 = Variable(torch.zeros(self.num_layers, x.size(0), self.hidden_size))
          c_0 = Variable(torch.zeros(self.num_layers, x.size(0), self.hidden_size))

      o, (h_out,_) = self.lstm(x, (h_0, c_0))

      if self.bidirectional:
          h_out = h_out.view(self.num_layers, 2, x.size(0), self.hidden_size)
          # taking last hidden state
          # dim(h_out) = (num_directions, batch, hidden_size)
          h_out = h_out[-1]

      # in bidectional case we sum vectors over num_directions
      # in the case of multi-layer LSTM we sum over num_layers direction
      # dim(h_out) = (batch, hidden_size)
      h_out = h_out.sum(dim=0)

      # dim(out) = (batch, num_classes)
      out = self.fc(h_out)
      return out

In [27]:python

trainX = torch.from_numpy(trainX).float()
trainY = torch.from_numpy(trainY).float()
valX = torch.from_numpy(valX).float()
valY = torch.from_numpy(valY).float()
testX = torch.from_numpy(testX).float()
testY = torch.from_numpy(testY).float()

In [28]:python

train_data = torch.utils.data.TensorDataset(trainX, trainY)
val_data = torch.utils.data.TensorDataset(valX, valY)
test_data =  torch.utils.data.TensorDataset(testX, testY)

Train and Evaluate Helping Functions

In [29]:python

def train_model(model, train_data, loss_criterion, optimizer, batch_size):

  running_loss = 0.
  model.train()

  train_loader = torch.utils.data.DataLoader(train_data,
                                            batch_size=batch_size,
                                            shuffle=True,
                                            num_workers=0)

  for batch_x, batch_y in train_loader:
      optimizer.zero_grad()

      outputs = model(batch_x)

      loss = loss_criterion(outputs, batch_y)

      loss.backward()

      optimizer.step()

      running_loss += loss.item()

  epoch_loss = running_loss / (len(train_data) // batch_size)

  print("Epoch: %d, train loss: %1.5f" % (epoch+1, epoch_loss))

  return model, epoch_loss

In [30]:python

def evaluate_model(model, val_data, loss_criterion, optimizer, batch_size):
  running_loss = 0.
  model.eval()

  val_loader = torch.utils.data.DataLoader(val_data,
                                           batch_size=batch_size,
                                           shuffle=True,
                                           num_workers=0)

  for batch_x, batch_y in val_loader:

      optimizer.zero_grad()

      with torch.set_grad_enabled(False):
          outputs = model(batch_x)
          loss = loss_criterion(outputs, batch_y)

      running_loss += loss.item()

  epoch_loss = running_loss / (len(val_data) // batch_size)

  print("Epoch: %d, val loss: %1.5f" % (epoch+1, epoch_loss))

  return model, epoch_loss, best_model_wts

1-Layer LSTM model

In [31]:python

learning_rate = 0.001
input_size = trainX.shape[2] # number of input features, multivariate case
hidden_size = 20
num_layers = 1
num_classes = trainY.shape[1] # future time window length
lstm_model = LSTMTimeSeriesPredictor(num_classes, input_size, hidden_size, num_layers, bidirectional=False)

config = {'batch_size': 128, 'num_epochs': 10, 'checkpoints_dir': '/content/drive/My Drive/time series prediction/', 'model_filename': '1-layer-lstm-best_model.pth.tar'}

criterion = torch.nn.MSELoss()
optimizer = torch.optim.Adam(lstm_model.parameters(), lr=learning_rate)

train_history, val_history = [], []

best_model_wts = copy.deepcopy(lstm_model.state_dict())

best_loss = 10e10  # for validation phase

if not os.path.exists(config['checkpoints_dir']):
  os.mkdir(config['checkpoints_dir'])

for epoch in range(config['num_epochs']):
  print("="*20 + str(epoch+1) + "="*20)
  _, train_loss = train_model(lstm_model, train_data, criterion, optimizer, config['batch_size'])
  train_history.append(train_loss)

  _, val_loss, best_model_wts = evaluate_model(lstm_model, val_data, criterion, optimizer, config['batch_size'])
  val_history.append(val_loss)

  if val_loss < best_loss:
      best_loss = val_loss
      print("Saving model for best loss")
      checkpoint = {
          'state_dict': best_model_wts
      }
      torch.save(checkpoint, config['checkpoints_dir'] + config['model_filename'])
      best_model_wts = copy.deepcopy(lstm_model.state_dict())

Out [31]:

====================1====================
Epoch: 1, train loss: 0.90311
Epoch: 1, val loss: 0.78379
Saving model for best loss
====================2====================
Epoch: 2, train loss: 0.77797
Epoch: 2, val loss: 0.73462
Saving model for best loss
====================3====================
Epoch: 3, train loss: 0.74900
Epoch: 3, val loss: 0.70943
Saving model for best loss
====================4====================
Epoch: 4, train loss: 0.73103
Epoch: 4, val loss: 0.69784
Saving model for best loss
====================5====================
Epoch: 5, train loss: 0.71739
Epoch: 5, val loss: 0.68612
Saving model for best loss
====================6====================
Epoch: 6, train loss: 0.70665
Epoch: 6, val loss: 0.67945
Saving model for best loss
====================7====================
Epoch: 7, train loss: 0.69868
Epoch: 7, val loss: 0.67938
Saving model for best loss
====================8====================
Epoch: 8, train loss: 0.69284
Epoch: 8, val loss: 0.67938
====================9====================
Epoch: 9, train loss: 0.68811
Epoch: 9, val loss: 0.66802
Saving model for best loss
====================10====================
Epoch: 10, train loss: 0.68397
Epoch: 10, val loss: 0.67317

In [32]:python

plt.plot(np.arange(config['num_epochs']), train_history, label='Train loss')
plt.plot(np.arange(config['num_epochs']), val_history, label='Val loss')
plt.xlabel("num of epochs")
plt.ylabel("train loss")
plt.show()

Out [32]:

Test

In [33]:python

lstm_model = LSTMTimeSeriesPredictor(num_classes, input_size, hidden_size, num_layers)
lstm_model.load_state_dict(torch.load(os.path.join(config['checkpoints_dir'], config['model_filename']))['state_dict'])
lstm_model.eval()
for param in lstm_model.parameters():
  param.requires_grad = False

In [34]:python

i = 0
while i < len(testX):
  test_x = testX[i].unsqueeze(0)
  test_y = testY[i].unsqueeze(0)
  multi_step_plot(test_x.numpy()[0, :, target_index], test_y.detach().numpy()[0, :], lstm_model(test_x).detach().numpy()[0, :], 'Test case')
  i += 100

Out [34]:

2-Layer LSTM

In [35]:python

learning_rate = 0.001
input_size = trainX.shape[2] # number of input features, multivariate case
hidden_size = 20
num_layers = 2
num_classes = trainY.shape[1] # future time window length
lstm_model = LSTMTimeSeriesPredictor(num_classes, input_size, hidden_size, num_layers, bidirectional=False)

config = {'batch_size': 128, 'num_epochs': 10, 'checkpoints_dir': '/content/drive/My Drive/time series prediction/', 'model_filename': '2-layer-lstm-best_model.pth.tar'}

criterion = torch.nn.MSELoss()
optimizer = torch.optim.Adam(lstm_model.parameters(), lr=learning_rate)

train_history, val_history = [], []

best_model_wts = copy.deepcopy(lstm_model.state_dict())

best_loss = 10e10

if not os.path.exists(config['checkpoints_dir']):
  os.mkdir(config['checkpoints_dir'])

for epoch in range(config['num_epochs']):
  print("="*20 + str(epoch+1) + "="*20)
  _, train_loss = train_model(lstm_model, train_data, criterion, optimizer, config['batch_size'])
  train_history.append(train_loss)

  _, val_loss, best_model_wts = evaluate_model(lstm_model, val_data, criterion, optimizer, config['batch_size'])
  val_history.append(val_loss)

  if val_loss < best_loss:
      best_loss = val_loss
      print("Saving model for best loss")
      checkpoint = {
          'state_dict': best_model_wts
      }
      torch.save(checkpoint, config['checkpoints_dir'] + config['model_filename'])
      best_model_wts = copy.deepcopy(lstm_model.state_dict())

Out [35]:

====================1====================
Epoch: 1, train loss: 0.89874
Epoch: 1, val loss: 0.77490
Saving model for best loss
====================2====================
Epoch: 2, train loss: 0.77985
Epoch: 2, val loss: 0.73741
Saving model for best loss
====================3====================
Epoch: 3, train loss: 0.75368
Epoch: 3, val loss: 0.71219
Saving model for best loss
====================4====================
Epoch: 4, train loss: 0.73283
Epoch: 4, val loss: 0.69299
Saving model for best loss
====================5====================
Epoch: 5, train loss: 0.71830
Epoch: 5, val loss: 0.68409
Saving model for best loss
====================6====================
Epoch: 6, train loss: 0.70771
Epoch: 6, val loss: 0.67732
Saving model for best loss
====================7====================
Epoch: 7, train loss: 0.69788
Epoch: 7, val loss: 0.67322
Saving model for best loss
====================8====================
Epoch: 8, train loss: 0.69050
Epoch: 8, val loss: 0.67194
Saving model for best loss
====================9====================
Epoch: 9, train loss: 0.68391
Epoch: 9, val loss: 0.66734
Saving model for best loss
====================10====================
Epoch: 10, train loss: 0.67748
Epoch: 10, val loss: 0.66578
Saving model for best loss

In [36]:python

plt.plot(np.arange(config['num_epochs']), train_history, label='Train loss')
plt.plot(np.arange(config['num_epochs']), val_history, label='Val loss')
plt.xlabel("num of epochs")
plt.ylabel("train loss")
plt.show()

Out [36]:

2-Layer LSTM training and validation loss

Test

In [37]:python

lstm_model = LSTMTimeSeriesPredictor(num_classes, input_size, hidden_size, num_layers)
lstm_model.load_state_dict(torch.load(os.path.join(config['checkpoints_dir'], config['model_filename']))['state_dict'])
lstm_model.eval()
for param in lstm_model.parameters():
  param.requires_grad = False

In [38]:python

i = 0
while i < len(testX):
  test_x = testX[i].unsqueeze(0)
  test_y = testY[i].unsqueeze(0)
  multi_step_plot(test_x.numpy()[0, :, target_index], test_y.detach().numpy()[0, :], lstm_model(test_x).detach().numpy()[0, :], 'Test case')
  i += 100

Out [38]:

Bidirectional 1-Layer LSTM

In [39]:python

learning_rate = 0.001
input_size = trainX.shape[2] # number of input features, multivariate case
hidden_size = 20
num_layers = 1
num_classes = trainY.shape[1] # future time window length
lstm_model = LSTMTimeSeriesPredictor(num_classes, input_size, hidden_size, num_layers, bidirectional=True)

config = {'batch_size': 128, 'num_epochs': 10, 'checkpoints_dir': '/content/drive/My Drive/time series prediction/', 'model_filename': 'bidir-1-layer-lstm-best_model.pth.tar'}

criterion = torch.nn.MSELoss()
optimizer = torch.optim.Adam(lstm_model.parameters(), lr=learning_rate)

train_history, val_history = [], []

best_model_wts = copy.deepcopy(lstm_model.state_dict())

best_loss = 10e10  # for validation phase

if not os.path.exists(config['checkpoints_dir']):
  os.mkdir(config['checkpoints_dir'])

for epoch in range(config['num_epochs']):
  print("="*20 + str(epoch+1) + "="*20)
  _, train_loss = train_model(lstm_model, train_data, criterion, optimizer, config['batch_size'])
  train_history.append(train_loss)

  _, val_loss, best_model_wts = evaluate_model(lstm_model, val_data, criterion, optimizer, config['batch_size'])
  val_history.append(val_loss)

  if val_loss < best_loss:
      best_loss = val_loss
      print("Saving model for best loss")
      checkpoint = {
          'state_dict': best_model_wts
      }
      torch.save(checkpoint, config['checkpoints_dir'] + config['model_filename'])
      best_model_wts = copy.deepcopy(lstm_model.state_dict())

Out [39]:

====================1====================
Epoch: 1, train loss: 0.92054
Epoch: 1, val loss: 0.78858
Saving model for best loss
====================2====================
Epoch: 2, train loss: 0.77843
Epoch: 2, val loss: 0.73070
Saving model for best loss
====================3====================
Epoch: 3, train loss: 0.75058
Epoch: 3, val loss: 0.71063
Saving model for best loss
====================4====================
Epoch: 4, train loss: 0.73297
Epoch: 4, val loss: 0.69804
Saving model for best loss
====================5====================
Epoch: 5, train loss: 0.71834
Epoch: 5, val loss: 0.68801
Saving model for best loss
====================6====================
Epoch: 6, train loss: 0.70628
Epoch: 6, val loss: 0.67993
Saving model for best loss
====================7====================
Epoch: 7, train loss: 0.69642
Epoch: 7, val loss: 0.67463
Saving model for best loss
====================8====================
Epoch: 8, train loss: 0.68780
Epoch: 8, val loss: 0.67683
====================9====================
Epoch: 9, train loss: 0.68003
Epoch: 9, val loss: 0.67213
Saving model for best loss
====================10====================
Epoch: 10, train loss: 0.67364
Epoch: 10, val loss: 0.66894
Saving model for best loss

In [40]:python

plt.plot(np.arange(config['num_epochs']), train_history, label='Train loss')
plt.plot(np.arange(config['num_epochs']), val_history, label='Val loss')
plt.xlabel("num of epochs")
plt.ylabel("train loss")
plt.show()

Out [40]:

Bidirectional LSTM training and validation loss

Test

In [41]:python

from collections import OrderedDict

state_dict = torch.load(os.path.join(config['checkpoints_dir'], config['model_filename']))['state_dict']

new_state_dict = OrderedDict()
for k,v in state_dict.items():
  if k not in ["lstm.weight_ih_l0_reverse", "lstm.weight_hh_l0_reverse", "lstm.bias_ih_l0_reverse", "lstm.bias_hh_l0_reverse"]:
      new_state_dict[k] = v

In [42]:python

lstm_model = LSTMTimeSeriesPredictor(num_classes, input_size, hidden_size, num_layers)
lstm_model.load_state_dict(new_state_dict)
lstm_model.eval()
for param in lstm_model.parameters():
  param.requires_grad = False

In [43]:python

i = 0
while i < len(testX):
  test_x = testX[i].unsqueeze(0)
  test_y = testY[i].unsqueeze(0)
  multi_step_plot(test_x.numpy()[0, :, target_index], test_y.detach().numpy()[0, :], lstm_model(test_x).detach().numpy()[0, :], 'Test case')
  i += 100

Out [43]:

Experimental results on val data

Model	Best MSE loss
1-layer LSTM	0.66802
2-layer LSTM	0.66578
Bidirectional 1-layer LSTM	0.66894

Summary

At the first glance, all three models based on LSTM showed approximately the same results, so you can choose a single-layer LSTM without loss of quality. In cases where computational efficiency is a crucial part you may chose GRU model, which has less parameters to optimize than LSTM model.

The ways you can improve existing deep learning model:

Work on data cleanliness, look for essential features, which are strongly related to the target variable
Manual and automatic feature engineering
Optimization of hyperparameters (length of hidden state vector, batch size, number of epochs, learning rate, number of layers)
Try different optimization algorithm
Try different loss function which will be differentiable and adequate to task you're solving
Use ensembles of prediction vectors