Deep Learning and Statistical Models for Time-Critical Pedestrian Behaviour Prediction

Joel Janek Dabrowski
Data61, CSIRO, Australia,
[email protected] &Johan Pieter de Villiers
University of Pretoria, South Africa
CSIR, Pretoria, South Africa
[email protected] &Ashfaqur Rahman
Data61, CSIRO, Australia
[email protected] &Conrad Beyers
University of Pretoria, South Africa
[email protected]

Abstract

The time it takes for a classifier to make an accurate prediction can be crucial in many behaviour recognition problems. For example, an autonomous vehicle should detect hazardous pedestrian behaviour early enough for it to take appropriate measures. In this context, we compare the switching linear dynamical system (SLDS) and a three-layered bi-directional long short-term memory (LSTM) neural network, which are applied to infer pedestrian behaviour from motion tracks. We show that, though the neural network model achieves an accuracy of 80%, it requires long sequences to achieve this (100 samples or more). The SLDS, has a lower accuracy of 74%, but it achieves this result with short sequences (10 samples). To our knowledge, such a comparison on sequence length has not been considered in the literature before. The results provide a key intuition of the suitability of the models in time-critical problems.

Keywords Human Behaviour Prediction $\cdot$ Switching Linear Dynamic System $\cdot$ LSTM $\cdot$ RNN $\cdot$ Classification Time.

1 Introduction

Many practical applications can be represented as a sequence of various behaviours. These include detecting bull and bear financial markets [1], gesture recognition [2], animal behaviour recognition [3], and aircraft manoeuvres in military applications [4]. Detection time in such applications is often crucial. When data arrives sequentially, detection time becomes a problem of how many sequential samples the model requires to make an accurate prediction.

In this study, we consider the problem of pedestrian behaviour prediction or intent estimation. A review on the prediction of pedestrian behaviour in urban scenarios is presented in [5]. A large portion of the literature has been devoted to tracking and path prediction. Many studies use the SLDS as a framework [6], [7], [8], and [9]. Recently, the recurrent neural network (RNN) has been shown to be a promising approach [10, 11, 12, 13]. Owing to the significant advancement of the state-of-the-art in pedestrian detection [14, 15, 16, 17], we assume that the trajectories of the pedestrians are known in this study. Given the pedestrian trajectories, we predict a particular behavioural class.

There are studies have considered the problem of pedestrian behaviour prediction. Probabilistic models such as the latent dynamic conditional random field [18] and balanced Gaussian process dynamical models [19] have been applied. Various forms of the RNN have been also been considered. Hoy et. al. [20] propose a variational RNN which performs both tracking and behaviour prediction. Völz et. al. [21] compare neural networks, a support vector machine, and the LSTM. Though both statistical and machine learning models have been applied to the problem and some studies consider time-to-event analyses, to our knowledge, no specific analyses between these model types in terms of time-to-detection have been considered in the literature.

Our contribution is a comparison between a SLDS and a multi-layered bi-directional LSTM neural network in the context of time-to-detection. This is performed by classifying various pedestrian behaviours from the raw motion tracks under varying sequence lengths. Through the comparison, we gain novel insight into a key difference between the models: though the neural network is more accurate than the SLDS overall, it requires 10 times as many sequential samples to achieve this accuracy. The SLDS is able to provide its most accurate classification within the first few samples of the sequence. This result is important in situations where early detection is imperative.

2 Switching Linear Dynamical System

The SLDS models a system that switches between various dynamical models. Each dynamical model is represented as a Linear Dynamic System (LDS) – which is widely associated with the Kalman filter. The SLDS has been extended in various ways, such as introducing variables representing behavioural context information. Such models have been applied to maritime piracy applications [22, 23] and abalone poaching applications [24]. Linderman et. al. [25] extend the SLDS by allowing the switching state to depend on the latent state and exogenous inputs through a logistic regression. The SLDS has also been extended to include multiple LDSs over several sequences under a single switching state [26, 27].

The graphical model representation of the SLDS is illustrated in Figure 1. The model comprises a switching state variable $s_{t}$ , a hidden or latent variable $h_{t}$ and a visible or observable variable $v_{t}$ at time $t$ . The latent variable $h_{1:T}$ and observable variable $v_{1:T}$ form a LDS, where the subscript $1:T$ denotes the joint random variable over all discrete time instances 1 to T.. The switching state variable provides the means to switch between various dynamical models. The continuous dynamics of the system are represented by a linear-Gaussian state space model. The following equations describe the system [28, 29]

		$\displaystyle h_{t}=A_{t}(s_{t})h_{t-1}+\eta^{h}_{t}(s_{t}),$		(1)
		$\displaystyle v_{t}=B_{t}(s_{t})h_{t}+\eta^{v}_{t}(s_{t}).$		(2)

Equation (1) describes the transition model and (2) describes the emission model. The matrix $A_{t}$ is the state matrix and $B_{t}$ is the measurement matrix. With the Gaussian assumption, the noise components are modelled as white noise such that $\eta^{h}_{t}(s_{t})\sim\mathcal{N}(0,\Sigma_{H})$ and $\eta^{v}_{t}(s_{t})\sim\mathcal{N}(0,\Sigma_{V})$ . All the LDS model parameters are conditionally dependent on $s_{t}$ at time $t$ . This provides the means to define different dynamic models for each switching state.

Figure 1: The graphical model of the switching linear dynamical system (SLDS).

The joint distribution describing the SLDS is given by:

\displaystyle p(s_{1:T},h_{1:T},v_{1:T})=p(s_{1})p(h_{1})\prod_{t=2}^{T}p(s_{t}|s_{t-1})p(h_{t}|h_{t-1},s_{t})\prod_{t=1}^{T}p(v_{t}|h_{t}).

(3)

The switching state transition probability $p(s_{t}|s_{t-1})$ is a discrete distribution. It describes how the model switches between various states. The state transition distribution $p(h_{t}|h_{t-1},s_{t})$ and emission distribution $p(v_{t}|h_{t})$ are assumed to be Gaussian. These describe the dynamics of the system through the linear state space equations.

Inference in the SLDS involves inferring the latent variables $s_{t}$ and $h_{t}$ given the observations $v_{1:t}$ . This is typically performed using filtering and smoothing methods. The filtering operation computes the filtered posterior $p(s_{t},h_{t}|v_{1:t})$ . The smoothing operation computes the smoothed posterior $p(s_{t},h_{t}|v_{1:T})$ . Exact inference in the SLDS is intractable [28, 29]. Approximate inference algorithms such as the Generalised Pseudo Bayesian (GPB) algorithm [30] and the Gaussian Sum Smoothing (GSS) algorithm [31] have been developed for the SLDS. In this study, the GPB algorithm is used.

Parameter learning in the SLDS can be performed using the Expectation Maximisation (EM) algorithm [30]. In the expectation step, a smoothing algorithm such as GPB can be used. In the maximisation step, the parameters are estimated using maximum likelihood.

3 Multi-Layered Bidirectional LSTM

A three-layered bi-directional LSTM [32] RNN is constructed for comparison with the SLDS. The model is illustrated in Figure 2. Each LSTM layer comprises two sequences of LSTM cells; one propagating in the positive time direction and one in the negative time direction. Together, the forward and backward sequences form a bi-directional LSTM (BiLSTM). The bi-directional structure provides a means to make a prediction at time $t$ according to the full sequence $1:T$ , $t\leq T$ . This is in comparison to a uni-directional structure, which makes a prediction at time $t$ according to the sequence $1:t$ . Three BiLSTMs are stacked to form three distinct layers. Multiple layers provide a deep structure which promotes higher level feature extraction. Input data is provided to the inputs of the first BiLSTM layer. For each sequence step, the outputs of the third BiLSTM layer are passed through a softmax layer. The softmax outputs the predicted class associated with the current input sample. For notational simplicity, this model is referred to as the RNN in the remainder of the discussion.

Figure 2: Three-layered bi-directional LSTM architecture. Each rectangular node denotes an LSTM cell. Round nodes denote softmax layers. The edges denote connectivity between the LSTM cells and output layer. At time

t

, pedestrian tracks are denoted by

x_{t}

and the behaviour class is denoted by

y_{t}

4 Dataset

The well-known Daimler Pedestrian Path Prediction Benchmark Dataset (GCPR’13) [6] is used in this study. The dataset comprises a collection of 68 pedestrian sequences with 4 different pedestrian behaviour types: crossing, stopping, starting to walk, and bending-in. Though the dataset seems relatively small, the LSTM has been shown to perform well in the path prediction application [11].

The dataset was acquired using stereo cameras. The stereo camera provides the means to produce three dimensional Cartesian coordinates for tracking purposes. The ground truth of the dataset provides bounding boxes, disparity, and $X$ and $Z$ coordinates of each target.

The dataset was developed to test recursive Bayesian filters for pedestrian path prediction for different behaviours [6]. In this study, this problem is inverted. The behaviour of a pedestrian is inferred from the tracked path.

5 Methodology

The dataset is provided with a predefined training set and a test set. The training and test sets comprise 36 and 32 sequences respectively. The model parameters are estimated using the training dataset. The trained models are then applied to predict the behaviour class from the tracks provided in the test dataset. To measure the performance of the models, accuracy, precision and recall are used.

The models are tested on sequences of varying length. This is achieved by truncating the sequences in increments of 10 samples. That is, the models are tested on the first $10,20,30,\dots$ samples of each sequence in the test set. The classification results are stored for each sequence length. Limiting the number of timesteps provides an indication of how well the method is able to predict a behaviour class in a short period of time. Furthermore, it provides some form of consistency over the varying sequence lengths in the dataset.

The SLDS motion model is configured as a constant acceleration model. The tracked coordinates are provided as observations to the SLDS. The model parameters are learned using the EM algorithm. The switching state is defined to comprise the states BendingIn, Crossing, Starting, and Stopping. In the dataset, the pedestrians do not switch between behaviour classes over the sequences. The switching state transition distribution is set with a $0.97$ probability of remaining in the current switching state and a $0.01$ probability of transitioning to one of the other three switching states. The prior switching state probability distribution is set to the uniform distribution.

The RNN is configured with 32 hidden units in each LSTM cell. The ADAM algorithm [33] is used to minimise the cross entropy of the softmax outputs. The model is trained over 110 epochs with a learning rate of 0.0001 and a batch size of 1. The remaining ADAM parameters are set as recommended in [33]. The RNN is trained over the complete length of each sequence in the test set.

6 Results

Refer to caption — Figure 3: Accuracy over the set of truncated sequences.

The accuracy over the set of truncated sequences is presented in Figure 3. The striking feature in the plot is that the RNN increases in accuracy with increasing sequence length, whereas the SLDS decreases in accuracy with increasing sequence length. The SLDS has the highest accuracy with a sequence of 10 samples. This implies that within the first 10 samples, the SLDS is able to classify the sequence. The RNN’s accuracy curve saturates around the 100 sample length mark. This indicates that the RNN requires a sequence of at least 100 samples to achieve the high accuracy.

These results are consistent with theoretical design of the models. The SLDS assumes a first order Markov model in both the dynamics and the switching state. A first order Markov model assumes that the current state is conditionally dependent only on the previous state. The result is that the SLDS is not designed to model long-term dependencies in the data. The SLDS thus performs better when provided with the shorter sequences. Furthermore, the SLDS performance decreases with sequence length as it is designed to switch between dynamics. It is more likely to switch behaviour class in a longer sequence. The LSTM cell in the RNN has been specifically designed to model both long and short-term dependencies in the data [32]. The result is that the RNN requires a longer sequence to achieve a higher accuracy. Another relevant difference between the models is that the SLDS is a structured model where the dynamics have been predefined. In the RNN, the dynamics are learned in a black-box approach, which often requires more data.

The precision and recall over the set of truncated sequences are presented in Figure 4 and Figure 5 respectively. Confusion matrices for the 10-sample-length and complete sequences are presented in Table 1. Precision is often viewed as a measure of the quality of the model. Recall describes the probability of correctly classifying the pedestrian behaviour.

As for the RNN accuracy, the precision and recall values are only high for sequences with 100 samples or more. The precision and recall for the SLDS are highest for sequences of 10 steps.

The RNN generally has a higher precision and recall than the SLDS. The RNN however struggles to correctly predict the starting behaviour class. Consider the confusion matrix for the complete sequence classification presented in Table 1. The majority of Starting samples are incorrectly associated with the BendingIn class. A possible reason for this is that the sequences for the Starting class are generally short in length. The poor results for the Starting class lowers the overall accuracy of the RNN.

Table 1: Confusion matrices for the SLDS and RNN for the 10-sample-length and complete sequence predictions. The matrices are normalised over the rows to indicate a form of recall. Rows and columns follow the class order of BendingIn, Crossing, Starting, and Stopping.

	10 samples	Complete sequence
SLDS	$\begin{bmatrix}0.63&0.02&0.16&0.00\\ 0.00&0.84&0.00&0.29\\ 0.30&0.00&0.84&0.00\\ 0.07&0.13&0.00&0.71\\ \end{bmatrix}$	$\begin{bmatrix}0.46&0.04&0.04&0.23\\ 0.02&0.83&0.12&0.51\\ 0.37&0.01&0.84&0.06\\ 0.15&0.12&0.00&0.21\\ \end{bmatrix}$
RNN	$\begin{bmatrix}0.19&0.19&0.19&0.19\\ 0.46&0.45&0.47&0.47\\ 0.21&0.22&0.21&0.22\\ 0.13&0.14&0.13&0.12\\ \end{bmatrix}$	$\begin{bmatrix}0.89&0.10&0.67&0.00\\ 0.00&0.87&0.09&0.10\\ 0.11&0.02&0.23&0.00\\ 0.00&0.02&0.00&0.89\\ \end{bmatrix}$

The lowest recall is for the SLDS model is the BendingIn class, with a value of $63\%$ . Considering the confusion matrix, the $30\%$ of the samples were misclassified as starting behaviour. The model performs well on the crossing and starting classes. For longer sequences, the precision and recall for the Stopping class decreases significantly. As also indicated in Figure 5, the recall for the Crossing and Starting classes remain fairly constant.

For the RNN with 10-sample sequences, $46\%$ of the BendingIn samples were incorrectly associated with the Crossing class as indicated in Table 1. When provided with the complete sequence, this reduces to $0\%$ . Similarly, most of the Starting samples are incorrectly associated with the Crossing class with short sequences. When provided with the complete sequence, the incorrect classifications shift to the BendingIn class.

A plot of the track and the class predictions for test sequence 0 is presented in Figure 6. The track of the pedestrian performing the BendingIn activity is presented in Figure 6a. The horizontal axis represents the depth dimension, $Z$ with respect to the camera. The vertical axis represents the camera’s horizontal axis, $X$ . Note that the time aspect of the track is not represented in this plot. The plot of the predicted switching state over time is presented in Figure 6b. Dark grey indicates a high probability of that the pedestrian belongs to a particular class. Light grey indicates a low probability of that the pedestrian behaviour belongs to the particular class. Both the SLDS and the RNN associate the behaviour with the BendingIn class for the first 160 time steps. The predictions subsequently transition to the Starting class. This may be explained by the fact that the pedestrian seems to back-track as illustrated in Figure 6a.

Figure 7 illustrates an example of the Starting behaviour class. The SLDS successfully predicts the correct class for the entire sequence. The RNN incorrectly predicts the BendingIn class, but does associate some probability with the Starting class. This result corresponds the complete-sequence confusion matrix presented in Table 1.

Figure 8 illustrates results for the Crossing behaviour class. The track in Figure 8b is approximately linear over the space. With such behaviour, both models generally perform well in this class.

An example of the Stopping behaviour class is presented in Figure 9. The SLDS correctly begins by classifying the stopping behaviour class and then transitions to the crossing class. This result corresponds to the complete sequence confusion matrix presented in Table 1. The RNN correctly classifies the stopping class for the entire sequence. This corresponds to the high recall for this class as illustrated in Figure 5.

7 Summary and Conclusion

In this study a SLDS and a three-layered bidirectional LSTM RNN are applied to predict pedestrian behaviour from motion tracks from the Daimler Pedestrian Path Prediction Benchmark Dataset (GCPR’13). The key result is that the RNN model’s accuracy increases with increasing sequence length, whereas the SLDS’s accuracy decreases with increasing sequence length. The best results for the SLDS are obtained when the first 10 samples of the sequence are provided to the model. This is possibly due to the SLDS being designed to model short-term behaviour as well as having a predefined model of the dynamics. The RNN is designed to model both short and long-term dynamics with a black-box approach. The result is that the RNN is more accurate, but over longer sequences (100 samples or more). This suggests that in situations where a decision is required to be made quickly, the SLDS may be the preferred model.

There is potential for improvement of the results for both models. One approach would be to include contextual information. This can be achieved in the SLDS using methods such as those described in [22, 23, 24]. Contextual information could include road signs, proximity to crossing areas, and traffic congestion levels. Additional information relating to the urban environment could also be influential. For example, a street may be residential or commercial.

References

[1] John M Maheu, Thomas H McCurdy, Yong Song, et al. Extracting bull and bear markets from stock returns. University of Toronto and CIRANO working paper.[Online] Available at https://www. economics. utoronto. ca/public/workingPapers/tecipa-369. pdf, 2009.
[2] Kui Liu, Chen Chen, Roozbeh Jafari, and Nasser Kehtarnavaz. Fusion of inertial and depth sensor data for robust hand gesture recognition. IEEE Sensors Journal, 14(6):1898–1903, 2014.
[3] Vianey Leos-Barajas, Theoni Photopoulou, Roland Langrock, Toby A Patterson, Yuuki Y Watanabe, Megan Murgatroyd, and Yannis P Papastamatiou. Analysis of animal accelerometer data using hidden markov models. Methods in Ecology and Evolution, 8(2):161–173, 2017.
[4] Hoyeop Lee, Byeong Ju Choi, Chang Ouk Kim, Jin Soo Kim, and Ji Eun Kim. Threat evaluation of enemy air fighters via neural network-based markov chain modeling. Knowledge-Based Systems, 116:49–57, 2017.
[5] D. Ridel, E. Rehder, M. Lauer, C. Stiller, and D. Wolf. A literature review on the prediction of pedestrian behavior in urban scenarios. In 2018 21st International Conference on Intelligent Transportation Systems (ITSC), pages 3105–3112, Nov 2018.
[6] Nicolas Schneider and Dariu M. Gavrila. Pedestrian path prediction with recursive bayesian filters: A comparative study. In Joachim Weickert, Matthias Hein, and Bernt Schiele, editors, Pattern Recognition, pages 174–183, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg.
[7] J. F. P. Kooij, N. Schneider, and D. M. Gavrila. Analysis of pedestrian dynamics from a vehicle perspective. In 2014 IEEE Intelligent Vehicles Symposium Proceedings, pages 1445–1450, June 2014.
[8] Julian Francisco Pieter Kooij, Nicolas Schneider, Fabian Flohr, and Dariu M Gavrila. Context-based pedestrian path prediction. In European Conference on Computer Vision, pages 618–633. Springer, 2014.
[9] J. F. P. Kooij, G. Englebienne, and D. M. Gavrila. Mixture of switching linear dynamics to discover behavior patterns in object tracks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(2):322–334, Feb 2016.
[10] R. Hug, S. Becker, W. Hübner, and M. Arens. Particle-based pedestrian path prediction using lstm-mdl models. In 2018 21st International Conference on Intelligent Transportation Systems (ITSC), pages 2684–2691, Nov 2018.
[11] K. Saleh, M. Hossny, and S. Nahavandi. Intent prediction of pedestrians via motion trajectories using stacked recurrent neural networks. IEEE Transactions on Intelligent Vehicles, 3(4):414–424, Dec 2018.
[12] K. Saleh, M. Hossny, and S. Nahavandi. Long-term recurrent predictive model for intent prediction of pedestrians via inverse reinforcement learning. In 2018 Digital Image Computing: Techniques and Applications (DICTA), pages 1–8, Dec 2018.
[13] B. Cheng, X. Xu, Y. Zeng, J. Ren, and S. Jung. Pedestrian trajectory prediction via the social-grid lstm model. The Journal of Engineering, 2018(16):1468–1474, 2018.
[14] J. Li, X. Liang, S. Shen, T. Xu, J. Feng, and S. Yan. Scale-aware fast r-cnn for pedestrian detection. IEEE Transactions on Multimedia, 20(4):985–996, April 2018.
[15] X. Du, M. El-Khamy, J. Lee, and L. Davis. Fused dnn: A deep neural network fusion approach to fast and robust pedestrian detection. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 953–961, March 2017.
[16] Jan Hosang, Mohamed Omran, Rodrigo Benenson, and Bernt Schiele. Taking a deeper look at pedestrians. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
[17] Zhaowei Cai, Mohammad Saberian, and Nuno Vasconcelos. Learning complexity-aware cascades for deep pedestrian detection. In The IEEE International Conference on Computer Vision (ICCV), December 2015.
[18] A. T. Schulz and R. Stiefelhagen. Pedestrian intention recognition using latent-dynamic conditional random fields. In 2015 IEEE Intelligent Vehicles Symposium (IV), pages 622–627, June 2015.
[19] R. Quintero Minguez, I. Parra Alonso, D. Fernandez-Llorca, and M. A. Sotelo. Pedestrian path, pose, and intention prediction through gaussian process dynamical models and pedestrian activity recognition. IEEE Transactions on Intelligent Transportation Systems, pages 1–12, 2018.
[20] M. Hoy, Z. Tu, K. Dang, and J. Dauwels. Learning to predict pedestrian intention via variational tracking networks. In 2018 21st International Conference on Intelligent Transportation Systems (ITSC), pages 3132–3137, Nov 2018.
[21] B. Volz, K. Behrendt, H. Mielenz, I. Gilitschenski, R. Siegwart, and J. Nieto. A data-driven approach for pedestrian intention estimation. In 2016 IEEE 19th International Conference on Intelligent Transportation Systems (ITSC), pages 2607–2612, Nov 2016.
[22] Joel Janek Dabrowski and Johan Pieter de Villiers. Maritime piracy situation modelling with dynamic bayesian networks. Information Fusion, 23:116 – 130, 2015.
[23] Joel Janek Dabrowski and Johan Pieter de Villiers. A unified model for context-based behavioural modelling and classification. Expert Systems with Applications, 42(19):6738 – 6757, 2015.
[24] Joel Janek Dabrowski, Johan Pieter de Villiers, and Conrad Beyers. Context-based behaviour modelling and classification of marine vessels in an abalone poaching situation. Engineering Applications of Artificial Intelligence, 64:95 – 111, 2017.
[25] Scott Linderman, Matthew Johnson, Andrew Miller, Ryan Adams, David Blei, and Liam Paninski. Bayesian Learning and Inference in Recurrent Switching Linear Dynamical Systems. In Aarti Singh and Jerry Zhu, editors, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages 914–922, Fort Lauderdale, FL, USA, 20–22 Apr 2017. PMLR.
[26] Joel Janek Dabrowski, Johan Pieter de Villiers, and Conrad Beyers. Naive bayes switching linear dynamical system: A model for dynamic system modelling, classification, and information fusion. Information Fusion, 42:75 – 101, 2018.
[27] Joel Janek Dabrowski, Conrad Beyers, and Johan Pieter de Villiers. Systemic banking crisis early warning systems using dynamic bayesian networks. Expert Systems with Applications, 62:225 – 242, 2016.
[28] David Barber. Bayesian Reasoning and Machine Learning. Cambridge University Press, 2012.
[29] Kevin P Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.
[30] Kevin P Murphy. Switching kalman filters. Technical report, Department of Computer Science, UC Berkeley, 1998.
[31] David Barber. Expectation correction for smoothed inference in switching linear dynamical systems. The Journal of Machine Learning Research, 7:2515–2540, 2006.
[32] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
[33] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.