11email: {mit2018075,pse2017002,rsi2017502,sonali}@iiita.ac.in
Enhanced Behavioral Cloning Based self-driving Car Using Transfer Learning
Abstract
With the growing phase of artificial intelligence and autonomous learning, the self-driving car is one of the promising area of research and emerging as a center of focus for automobile industries. Behavioral cloning is the process of replicating human behavior via visuomotor policies by means of machine learning algorithms. In recent years, several deep learning-based behavioral cloning approaches have been developed in the context of self-driving cars specifically based on the concept of transfer learning. Concerning the same, the present paper proposes a transfer learning approach using VGG16 architecture, which is fine tuned by retraining the last block while keeping other blocks as non-trainable. The performance of proposed architecture is further compared with existing NVIDIA’s architecture and its pruned variants (pruned by 22.2% and 33.85% using filter to decrease the total number of parameters). Experimental results show that the VGG16 with transfer learning architecture has outperformed other discussed approaches with faster convergence.
Keywords:
Convolution neural networks Transfer learning End-to-end learning self-driving cars Behavioral cloning1 Introduction
The end-to-end deep learning model is the most popular choice to deal with large volume data [1, 2, 3, 4] among researchers. Conventionally, the deep learning approaches decompose the problem in several subproblems to solve them independently, and all the outputs are combined to draw final decision. Many automobile companies like Hyundai, Tesla, etc., are trying to bring millions of self-driving or autonomous cars on the road by utilizing deep learning approaches. In this frantic race to come up with fully safe self-driving cars, some of the organizations like NVIDIA is following the end-to-end approach [5] as shown in Fig. 1, whereas Google is following mid-to-mid approach [6]. Following from these notions, the main objective of present research work is to predict the steering angle of the car via front facing camera.

The behavioral cloning [7] is a process of reproducing human performed tasks by a deep neural network. behavioral cloning is achieved by training the neural network with the data of human subject performing the task. In 1989, a self-driving car was developed by Pomerleau [8] based on neural networks. Afterwards, since past 130 years, the automobile manufacturers did not give attention to the replacement of the driver, who is the most vulnerable part of the car. The automotive companies tried to make the cars safer by including many safety features like anti-lock braking systems, shatter-resistant glass, airbags, etc [9]. However, organizations failed to succeed in developing driver-less intelligence. Self-driving cars are the most desirable revolutionary change in 21st century for a fully safe driving experience to change the way of transportation. According to the World Health Organization’s report on “Global status report on road safety 2018”, every year around 1.35 million humans lose their lives due to road accidents [10]. Self-driving cars will bring this number down and also enable people with disabilities to commute easily. Convolution neural networks (CNN) have revolutionized Pattern recognition [11] with the ability to capture 2D images in the context of self-driving cars. The greatest advantage of CNN is that it automatically extracts the important features to interpret the surrounding environment from the images which can be utilized to develop the intelligent driving system.
In the present research, to establish the importance of transfer learning approach concerning self-driving cars, a novel end-to-end based VGG16 approach is proposed which is fine-tuned to predict the steering angle based on the environmental constraints. Later, the proposed approach is compared with NVIDIA and its pruned variants. Due to the lesser number of parameters in the pruned architectures, the training time reduces significantly compared to baseline architecture. Since the transfer learning approach follows from the pre-trained model where only a part of network is trained, significant computational time is saved without compromising the performance. It has been observed that if the tasks are similar then the weights of initial few layers are similar and the last layers have relevant information towards the task [12]; making transfer learning a better way of saving training time.
The paper is organised in various sections including related work which briefly discusses the available approaches applied to self-driving cars and highlights the drawbacks and advantages of the research work carried out so far. The proposed approach section presents the various approaches utilized in the process of generating a novel model which accurately drives the car. Dataset and pre-processing techniques are also discussed in the subsequent sections. At the end, the experimental results were elaborated with concluding remarks.
2 Related Work
The process of reconstructing the human subcognitive skill through the computer program is referred as behavioral cloning. Here, actions of human performing the skill are recorded along with the situation that gave rise to the action [13]. There are two popular methods for behavioral cloning. In the first method, the skill could be learned in terms of series of dialogues with an operator. Here, in case of autonomous vehicle, it is anticipated from the operator to explain all set of useful skills to control the vehicle. This method is challenging because manual description of skills is not perfectly possible due to human limitations. Alternatively, skill can be reconstructed through recorded actions which are maintained in a structured way by using learning algorithms in terms of various manifestation traces [14, 15, 16, 17] to reproduce the skilled behavior.
Defense Advanced Research Projects Agency (DARPA) initiated DARPA autonomous vehicle(DAVE) [18] project including a radio-controlled model truck which is attached with sensors and lightweight video cameras to test vehicle in an intrinsic environment having trees, heavy stones, lakes, etc. The testing vehicle is trained with 225000 frames of driving data. However, in test runs, DAVE crashed for every 20 metres on an average. In 1989 Pomerleau built an autonomous land vehicle in a neural network (ALVINN) model using the end-to-end learning methodology and it was observed that the vehicle can be steered by a deep neural network [8]. NVIDIA started their research work in the self-driving inspired by the ALVINN and DARPA projects. The motivation for their work was to create an end-to-end model which enables steering of the car without manual intervention [19, 5] while recording humans driving behavior along with the steering angle at every second by using controller area network (CAN) bus. Based on the NVIDIA proposed architecture pilotnet (as shown in Fig. 2), Texas Instruments released JacintoNet i.e an end-to-end neural network for embedded system vehicles such as tiny humanoid robots [20]. In 2020, a group of researchers from Rutgers university proposed a feudal network based on hierarchical reinforcement learning that performs similar to the state-of-the-art models with simpler and smaller architecture which reduces training time [21]. Jelena et al. [22] have proposed a network which is 4 times smaller than the NVIDIA model and about 250 times smaller than the AlexNet [23]. The model is developed only for the use in the embedded automotive platforms. To study the working of these end-to-end models Kim et al. [24] researched about the region of the images contributing in predicting steering angle.
Although learning to drive from this system would not suffice the self-driving car, furthermore the driving system should also address the issues such as how it would backtrack on to the road if it goes off the road by mistake, or else the vehicle will eventually move out of the road. Therefore, the images which are provided by the dataset, are combined with more images to visualize the vehicle in different field-of-views on and off the road. The datasets usually augmented with new images generated by view transformations via flipping the images to cover maximum possible scenarios [5]. For the transformed images the steering angle is changed in such a way that the vehicle would come back to the right position and direction within 2 seconds. The NVIDIA model proved to be quite powerful by achieving 98% autonomy time on road. The results observed from NVIDIA’s model consisting of only 5 convolution layers followed by 3 dense layers exhibited limited performance, thus it is evident that complex tasks require complex structure of deep neural networks with more number of layers to achieve better performance.

3 Proposed approach
A competent human is required for controlling any intricate system such as helicopter, bike, etc. The competency is learnt through experience that develops within the subconscious capability of the brain. These subcognitive skills are challenging and can only be described roughly and inconsistently. In case of frequently occurring actions, the competency can be achieved by the system via learning from the recorded common patterns using deep learning techniques. Extracting and replicating such patterns from human subject performing the task is called behavioral cloning [25].

Following from the idea of behavioral cloning, a novel end-to-end transfer learning based VGG16 approach (as shown in Fig. 3) is proposed to predict the appropriate steering angle. The proposed model is compared with NVIDIA baseline model and its pruned variants built by chopping off the baseline NVIDIA model by 22.2% and 33.85% by using a convolution filter. Fig. 4 presents the overall schematic representation of the proposed approach.

3.1 Network pruning using filter
3.1.1 The use of convolution for network pruning
To downsample the contents of feature maps pooling is used which will reduce the height and weight whilst retaining the salient features of feature maps. The number of feature maps of a convolution neural network will increase as its depth increases [26], this will lead to an increased number of parameters which will increase the training time. This problem can be solved using a convolution layer that will do channel-wise pooling, called projection or max pooling. This technique could be used for network pruning in the CNN networks [26, 27] and to increase the number of features after classical pooling layers. The convolution layer can be used in the following three ways:
-
•
Linear projection of feature maps can be created.
-
•
Since the layer also works as channel-wise pooling, it can also be used for network pruning.
-
•
The projection created by layer can also be used to increase the number of feature maps.
3.1.2 Downsampling with filter
A filter will only have a single parameter or weight for each channel in the input that leads to single channel output value. The filter acts as a single neuron with input from the same position for each of the feature maps. The filter can be applied from left to right and top to bottom which results in a feature map of same height and width as the input [28].

The idea of using of filter to summarize the input feature maps is inspired from inception network proposed by Google [29]. The use of filter allows for the effective control of number of feature maps in the resulting output. Hence, the filter can be used anywhere in the network to control the number of feature maps and so it is also called a channel pooling layer. In the two models, shown in Fig. 5, the network size is pruned by 22.2% and 33.85% with the help of downsampling.
3.2 Transfer learning
Training of deep neural network needs massive computational resources. To minimize this effort, transfer learning has been explored, which assists in using neural networks implemented by various large companies who have abundant resources. The trained models provided by them can be used for academic research projects and startups [30].
As reported in recent publications the significance of the use of transfer learning for image recognition, object detection and classification, etc. [31, 32, 33] has been highlighted. In transfer learning approach a pre-trained model is adopted and fine-tuned to solve the desired problem i.e by freezing some layers and training only a few layers. Studies show that models trained on huge datasets like imagenet should generally work well for other image recognition problems [34]. It is also proven in research that initializing a model with the pre-trained model weights would lead to faster convergence than initializing the model with random weights [12]. For implementing transfer learning mechanism VGG16 has been used and all the blocks are frozen from training except the last block which contains a max-pooling layer and 3 convolution layers as highlighted in Fig. 6 .

Deep neural networks when trained on a huge set of images, the initial layer weights are similar regardless of the undertaken objective, whereas the end layers generally learn more problem-specific features. The initial layers of CNN learn the hidden edges, patterns and textures [27] that tend to identify the features which can be utilized as generic feature extractors for identifying the desired patterns to aid in analysing the complex environment for developing an intelligent driver-less system.
3.2.1 VGG16 with transfer learning
VGG16 [35] is the state-of-the-art deep CNN model which is a runner up in ILSVRC (Imagenet) competition in 2014 [36]. Compared to other models proposed in ILSVRC like ResNet50 [37], Inception[29], etc., VGG16 model has lesser number of parameters because of the way the convolution filters are arranged i.e filter with a stride 1, followed by max pool filter with stride 2. This arrangement of convolution accompanied by the max pool, is followed in the entire network consistently whereas the two fully connected layers forms the decision layer which aggregate to 138 million parameters. In the proposed approach, the initial 4 convolution blocks of the VGG16 are frozen and the last convolution block is fine tuned i.e block 5, to predict the appropriate steering angle based on the surrounding conditions acquired from the captured frames.
4 Dataset description and preprocessing
The dataset is a sequence of front camera dashboard view images captured around Rancho Palos Verdes and San Pedro California traffic [38]. It contains 45400 images and associated steering angle as described in Table 1. In this research, 80% of images are used for training and remaining 20% for validation testing.
Feature | Information |
---|---|
Image | The path of the image present on the disk. |
Steering Angle | A value in the range of -90 to +90 indicating the steering angle. |
The range of steering angle is between -90 to +90 where +90 indicates that the steering is tilted towards the right and -90 indicates that the steering is tilted towards the left. The data is preprocessed to get the images in the desired format which will be suitable for the network to learn and help in prediction of appropriate steering angle. The original and preprocessed images are shown in Fig. 7. The images are preprocessed by performing following steps:
-
•
Remove unnecessary features by cropping the image.
-
•
Convert the image to YUV format.
-
•
Reduce dimensions of the image to .

5 Experimental results
Series of experiments have been carried out with baseline NVIDIA model, its pruned variants and proposed approach as described follows:
-
1.
By decreasing the number of feature maps from 64 to 32 and 64 to 16 by keeping the height and width constant, we pruned the network by 22.2% and 33.85% respectively.
-
2.
The transfer learning approach adopted with the convolution blocks of VGG16 and trained only the last block (3 convolution layers and 1 Maxpooling 2D layer).
The training of the models is assisted with stochastic gradient descent (SGD) [39] approach with Adam as the learning rate optimizer [40]. For robust training, 4-fold validation technique is applied along with the earlystopping to avoid the problem of overfitting [41]. The performance of the models are evaluated using the mean squared error (MSE) as given in the Eq. 1.
(1) |
where stands for the actual steering value and stands for the predicted steering angle. Here, the lesser MSE indicates higher learning ability whereas a higher MSE means the model is not learning from complex environments.
The experimental results show that the pruned networks do not perform better compared to the baseline model. It is also observed that the proposed VGG16 model with transfer learning (training only last 4 convolutions layers) is trained within 40 epochs compared to other models which are trained with 100 epochs. With the experimental results, it has been proved that the VGG16 model with transfer learning works better as compared to the other NVIDIA models. As observed from the Table. 2, the novel transfer learning based approach achieved the better MSE score as compared to NVIDIA and its pruned variants. Due to the deep nature of VGG16, the architecture is able to learn complex patterns where as shallowness of NVIDIA models restrict their ability to adopt such complex environment conditions.
S.No. | Model | MSE | Trainable Parameters |
---|---|---|---|
1 | NVIDIA Model | 29.24848 | 252,219 |
2 | NVIDIA model pruned by 22.2% with filter | 41.61325 | 196,699 |
3 | NVIDIA model pruned by 33.8% with filter | 38.67840 | 166,859 |
4 | VGG16 with Transfer Learning | 23.97599 | 10,373,505 |
Fig. 8 highlights the training behavior of the models at each iteration where it is observed that the proposed approach achieved comparatively minimal loss with least number of training epochs. It is also observed that due to the adoption of trained weights the model starts with better loss and converges faster.

6 Conclusion
A novel approach based on transfer learning with VGG16 is proposed which is fine tuned by retraining the last block while keeping all the other layers non-trainable. The proposed model is compared with NVIDIA and its pruned architectures developed by applying filter. Since the proposed transfer learning architecture starts with minimal initial loss and converges at just 40 epochs compared to NVIDIA’s architecture which took 100 epochs, experimental results show that the transfer learning based approach works better than NVIDIA and its pruned variants. Naturally, the driving patterns also depend on several other environmental conditions like weather, visibility, etc. To adopt these challenging conditions in presence of limited number of samples, generative adversarial network (GAN) can be explored in future to generate vivid weather conditions for more robust driver-less solutions.
References
- [1] R. Chopra, S.S. Roy, in Advanced Computing and Intelligent Engineering (Springer, 2020), pp. 53–61
- [2] M.j. Lee, Y.g. Ha, in 2020 IEEE International Conference on Big Data and Smart Computing (BigComp) (IEEE, 2020), pp. 470–473
- [3] Z. Chen, X. Huang, in 2017 IEEE Intelligent Vehicles Symposium (IV) (IEEE, 2017), pp. 1856–1860
- [4] T. Glasmachers, arXiv preprint arXiv:1704.08305 (2017)
- [5] M. Bojarski, P. Yeres, A. Choromanska, K. Choromanski, B. Firner, L. Jackel, U. Muller, arXiv preprint arXiv:1704.07911 (2017)
- [6] M. Bansal, A. Krizhevsky, A. Ogale, arXiv preprint arXiv:1812.03079 (2018)
- [7] S. Sheel. Behaviour cloning (learning driving pat-tern)— carnd. https://medium.com/@gruby/behaviour-cloning-learning-driving-pattern-c029962a0bbf (2017). Accessed on 2020-06-04
- [8] D.A. Pomerleau, in Advances in neural information processing systems (1989), pp. 305–313
- [9] W. contributors. Automotive safety. https://en.wikipedia.org/wiki/Automotive_safety (2020). Accessed on 2020-06-04
- [10] W.H. Organization, et al., Global status report on road safety 2018: Summary. Tech. rep., World Health Organization (2018)
- [11] Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard, L.D. Jackel, Neural computation 1(4), 541 (1989)
- [12] J. Yosinski, J. Clune, Y. Bengio, H. Lipson, in Advances in neural information processing systems (2014), pp. 3320–3328
- [13] F. Torabi, G. Warnell, P. Stone, arXiv preprint arXiv:1805.01954 (2018)
- [14] C. Sammut, G.I. Webb, Encyclopedia of machine learning (Springer Science & Business Media, 2011)
- [15] D. Michie, R. Camacho, Machine intelligence 13 (1994)
- [16] R. Kulic, Z. Vukic, in IECON 2006-32nd Annual Conference on IEEE Industrial Electronics (IEEE, 2006), pp. 3939–3944
- [17] D. Michie, in Intelligent systems (Springer, 1993), pp. 1–19
- [18] Y. LeCun, E. Cosatto, J. Ben, U. Muller, B. Flepp, Courant Institute/CBLL, http://www. cs. nyu. edu/yann/research/dave/index. html, Tech. Rep. DARPA-IPTO Final Report (2004)
- [19] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L.D. Jackel, M. Monfort, U. Muller, J. Zhang, et al., arXiv preprint arXiv:1604.07316 (2016)
- [20] P. Viswanath, S. Nagori, M. Mody, M. Mathew, P. Swami, in 2018 IEEE 8th International Conference on Consumer Electronics-Berlin (ICCE-Berlin) (IEEE, 2018), pp. 1–4
- [21] F. Johnson, K. Dana, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (2020), pp. 1002–1003
- [22] J. Kocić, N. Jovičić, V. Drndarević, Sensors 19(9), 2064 (2019)
- [23] A. Krizhevsky, I. Sutskever, G.E. Hinton, in Advances in neural information processing systems (2012), pp. 1097–1105
- [24] J. Kim, J. Canny, in Proceedings of the IEEE international conference on computer vision (2017), pp. 2942–2950
- [25] M. Bain, C. Sammut, in Machine Intelligence 15 (1995), pp. 103–129
- [26] J. Brownlee. A gentle introduction to 1x1 convolutions to manage model complexity. https://machinelearningmastery.com/introduction-to-1x1-convolutions-to-reduce-the-complexity-of-convolutional-neural-networks/ (2019). Accessed on 2020-06-04
- [27] J. Brownlee. Transfer learning in keras with computer vision models. https://machinelearningmastery.com/transfer-learning-for-deep-learning/ (2019). Accessed on 2020-06-04
- [28] M. Lin, Q. Chen, S. Yan, arXiv preprint arXiv:1312.4400 (2013)
- [29] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, in Proceedings of the IEEE conference on computer vision and pattern recognition (2015), pp. 1–9
- [30] J. Brownlee. A gentle introduction to transfer learning for deep learning. https://machinelearningmastery.com/transfer-learning-for-deep-learning/ (2020). Accessed on 2020-06-04
- [31] N.S. Punn, S. Agarwal, ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 16(1), 1 (2020)
- [32] N.S. Punn, S. Agarwal, in 2019 Twelfth International Conference on Contemporary Computing (IC3) (IEEE, 2019), pp. 1–6
- [33] N.S. Punn, S.K. Sonbhadra, S. Agarwal, arXiv preprint arXiv:2005.01385 (2020)
- [34] J. Brownlee. A gentle introduction to transfer learning for deep learning. https://machinelearningmastery.com/transfer-learning-for-deep-learning/ (2017). Accessed on 2020-06-04
- [35] K. Simonyan, A. Zisserman, arXiv preprint arXiv:1409.1556 (2014)
- [36] L. Competetion. Large scale visual recognition challenge(ilsvrc). http://www.image-net.org/challenges/LSVRC/ (2009). Accessed on 2020-06-04
- [37] K. He, X. Zhang, S. Ren, J. Sun, in Proceedings of the IEEE conference on computer vision and pattern recognition (2016), pp. 770–778
- [38] S. Chen. Driving dataset. https://drive.google.com/file/d/0B-KJCaaF7elleG1RbzVPZWV4Tlk/view (2017). Accessed on 2020-06-04
- [39] S. Bubeck, arXiv preprint arXiv:1405.4980 (2014)
- [40] D.P. Kingma, J. Ba, arXiv preprint arXiv:1412.6980 (2014)
- [41] R. Caruana, S. Lawrence, C.L. Giles, in Advances in neural information processing systems (2001), pp. 402–408