Intelligent Monitoring of Stress Induced by Water Deficiency in Plants using Deep Learning

*Shiva Azimi \scalerel* |, *Rohan Wadhawan \scalerel* |, , and Tapan K. Gandhi, \scalerel* | Sh. Azimi, R. Wadhawan and T.K.Gandhi are with the Department of Electrical Engineering, Indian Institute of Technology-Delhi, New Delhi 110016, India. (* Sh. Azimi and R. Wadhawan are co-first authors, Corresponding author: T.K.Gandhi.)
E-mail: [email protected], [email protected] and [email protected] received July 24, 2021; revised August 21, 2021; accepted August 29, 2021.

Abstract

In the recent decade, high-throughput plant phenotyping techniques, which combine non-invasive image analysis and machine learning, have been successfully applied to identify and quantify plant health and diseases. However, these techniques usually do not consider the progressive nature of plant stress and often require images showing severe signs of stress to ensure high confidence detection, thereby reducing the feasibility for early detection and recovery of plants under stress. To overcome the problem mentioned above, we propose a deep learning pipeline for the temporal analysis of the visual changes induced in the plant due to stress and apply it to the specific water stress identification case in Chickpea plant shoot images. For this, we have considered an image dataset of two chickpea varieties JG-62 and Pusa-372, under three water stress conditions; control, young seedling, and before flowering, captured over five months. We have employed a variant of Convolutional Neural Network - Long Short Term Memory (CNN-LSTM) network to learn spatio-temporal patterns from the chickpea plant dataset and use them for water stress classification. Our model has achieved ceiling level classification performance of 98.52% on JG-62 and 97.78% on Pusa-372 chickpea plant data and has outperformed the best reported time-invariant technique by at least 14% for both JG-62 and Pusa-372 species, to the best of our knowledge. Furthermore, our CNN-LSTM model has demonstrated robustness to noisy input, with a less than 2.5 % dip in average model accuracy and a small standard deviation about the mean for both species. Lastly, we have performed an ablation study to analyze the performance of the CNN-LSTM model by decreasing the number of temporal session data used for training.

Index Terms:

Plant Phenotyping, Water Stress, Monitoring, Computer Vision, Spatiotemporal Analysis, Deep Learning, Neural Network, CNN, LSTM

I Introduction

It has been estimated that agricultural production should be doubled by 2050 in order to meet the demands of a growing world population. Achieving this goal poses a serious challenge to farming as the current agricultural production growth rate of $1.3\%$ per annum is below the population growth rate. To achieve the required agricultural growth rate, we require modern agricultural practices that focus more on precision, and automated farming [1]. In turn, this will employ a wide array of Internet of Things (IoT) sensors that measure soil conditions and imaging devices that keep track of specific traits such as color, size, and shape of the crops. Furthermore, we need to take a multidisciplinary approach that merges plant science, robotics, computer vision, and environmental sciences. Plant phenotyping is one such method that deals with the measurement of observable traits of a plant in reaction to genetic and environmental changes and has a large number of applications in plant science including plant breeding, quality assessments, and stress identification. Computer vision-based plant phenomics has a significant role in precision farming as it provides easy, fast, and highly automated methods for plant health and growth monitoring [2]. Additionally, it has been used for other tasks such as determining whether a plant is a crop or a weed and the soil’s chemical content using near-infrared and hyperspectral imaging.

Most manual plant phenotyping approaches are costly, time-consuming, destructive, and cumbersome, thereby necessitating the development and use of high-throughput, non-invasive, and image-based plant phenotyping techniques to identify the stress levels in plants. These methods are fast, highly automated, and more accurate. Further, image-based plant phenotyping can be conducted inside a laboratory, inside a controlled chamber room, or on the field [3]. These phenotyping techniques include two fundamental steps: the data acquisition step and the data analysis-inference step. With the recent developments in visible light, infrared and computational photography, capturing high-resolution images in both the visible and the hyperspectral has become straightforward and expedient. However, reliable and efficient data acquisition and processing methods often require expertise in biology, mathematics, and computer vision.

Moreover, phenotyping applications usually involve the processing and analysis of a huge amount of data. Machine Learning (ML) methods have been proven to be quite efficient in the analysis of big data in research areas such as health and economics [4]. However, the traditional ML techniques suffer from the limitation imposed by hand-crafted features. These hand-crafted features often lack generality and are unable to model complex features. This inherent limitation of the classical ML techniques has shifted the focus on Deep Learning (DL) based approaches to ML [5].

One such DL architecture commonly used in various computer vision tasks is the Convolutional Neural Network (CNN) [6]. CNN’s possess convolutional layers for detecting visual features from images [7]. Further, it has been applied in several computer vision applications such as life sciences, medicine, and farming [8]. It has been widely employed for classifying plants and leaves in farming [9]. It has also been used in related applications like counting the number of seeds per pod for soybeans[10], the number of wheat ears under field conditions [11], plant identification [12], identification of plant diseases [13], moisture measurement of sweetcorn [14], etc. Moreover, DL-based predictive methods have been applied in the farming domain, such as finding out future farming parameters - produce estimation [15], the soil moisture content in the field [16], prediction of the growth dynamics of plant leaves [1], and crop weather requirements [17].

This paper focuses on abiotic stresses that are caused due to external environmental factors and often adversely affect agricultural productivity. Water and nitrogen stresses are the two most crucial abiotic stresses in plants that can change plants’ physiological traits. The effect of water-induced stresses, which is a consequence of excessive or inadequate watering content in the soil, inhibits photosynthesis and plants’ growth. To better manage water stress and minimize crop loss resulting from it, we need to develop methods to quickly evaluate water stress without damaging the plants.

Even though CNN has been proven very promising for image-based stress detection and classification in plants [18], it applies a limiting assumption of treating plant images taken at different moments in time equivalently. We know that visual changes introduced due to stress do not become discernible immediately after stress; instead, this change is progressive. However, due to CNN’s time-invariant nature, it is unable to learn temporal patterns and consequently is unable to classify a stress condition with high confidence [19, 18]. Further, the time-invariant approach also requires images showing severe signs of stress to ensure high confidence detections, thereby reducing this approach’s feasibility for early detection and recovery of plants under stress. Therefore, there is a need for a technique that analyses this progressive visual change in stressed plants. This technique should classify stress with high confidence, even when available plant images do not show a sign of severe stress, as it can help us to identify stress in the plants at an early stage.

This paper proposes a deep learning-based temporal analysis pipeline for plant water stress (water deficiency) phenotyping and demonstrates its superiority over vanilla CNN technique, which is time-invariant and only spatial. We validate the proposed approach via a detailed study that analyses changes in Chickpea plant shoot images induced due to water stress.
Chickpea (Cicer arietinum L.) is one of the crucial crops among pulses and is an excellent source of key nutrients such as proteins, iron, carbohydrates, and folic acid [20]. Consumption of chickpea in India is the largest in the world, contributing to $75\%$ of the world’s production and consumption [21]. Due to the growing concerns over food security, the demand for chickpea has been increasing in India and other developing countries. However, climate change and global warming are inducing various abiotic stresses and negatively affecting agricultural production. Among the abiotic stresses which impact chickpea production, stress-induced due to lack of water is the most significant one, causing up to $50\%$ of crop losses [22]. Water deficiency leads to specific physiological changes in chickpea plants such as dryness, yellow leaves, early flowering, and the reduction of leaf size and biomass [23]. Owing to chickpea’s potential towards ensuring food security in developing countries like India, it is imperative to develop image-based analysis methods for easy and early detection of water-related stress.

With this objective in mind, we make the following contributions in this paper:

•

As there are no publicly available plant shoot image datasets of pulses that can be used to detect and classify moisture-related stress conditions, we have created a dataset of Chickpea plant shoot images for the experiments proposed in this article. The dataset comprises two varieties of chickpea plant species - JG-62 and Pusa-372.
•

We have proposed an end-to-end deep learning pipeline for identifying water stress in Chickpea plants. This pipeline employs a variant of Convolutional Neural Network - Long Short Term Memory (CNN-LSTM) to learn spatio-temporal patterns from the chickpea plant dataset and use them for water stress classification. The CNN-LSTM has achieved ceiling level classification performance of 98.52% on JG-62 and 97.78% on Pusa-372 and the chickpea plant data.
•

We have conducted a comparative analysis of the proposed temporal technique with CNN techniques to classify water stress in Chickpea plants. Our proposed technique outperforms the best reported CNN technique by at least 14% for both JG-62 and Pusa-372 species.
•

We tested the robustness of our CNN-LSTM model to noisy input. Across both species, the average model accuracy dipped by less than 2.5 %, with a small standard deviation. This ensures high and consistent classification capabilities even in noisy conditions.
•

We have performed an ablation study on the CNN-LSTM model by decreasing the number of temporal session data used for training.

The rest of this article is structured as follows. The dataset and DL techniques are presented in section 2. The results are presented in section 3. Discussion on the results and application scope is provided in section 4. Finally, conclusions are provided in section 5.

Refer to caption — Figure 1: Visualization of replicates in stress-tolerant Pusa-372 (left) and stress-sensitive JG-62 (right) varieties from our dataset.

II Materials and Methods

In this section, we describe the dataset and methodology that we use for water stress identification in chickpea plants. First, we explain our chickpea plant shoot dataset, and then we discuss the DL techniques used in this paper. Our deep learning water stress classification pipeline consists of four main stages: input, data augmentation, CNN-LSTM network, and classification output, as shown in Fig. 2. These four stages are described in detail in the following subsections.

TABLE I: Parameters of Chickpea plant shoot images dataset.

Chickpea Variety	Light used	Distance of camera	Camera	Image Type	Image size in pixels	Total Images	Condition	Image labelling	No of images
$Pusa-372$	Fluorescent Tubes	1.5 meter	Canon EOS 60D	RGB (JPEG)	5184*3456	3840	Before Flowering	BF	1280
							Young Seedling	YS	1280
							Control	C	1280
$JG-62$	Fluorescent Tubes	1.5 meter	Canon EOS 60D	RGB (JPEG)	5184*3456	3840	Before Flowering	BF	1280
							Young Seedling	YS	1280
							Control	C	1280

II-A Dataset

Most publicly available datasets for plant health analysis only contain images on plant leaves, which is significantly less informative than the entire plant shoot image. These datasets usually show plants under biotic stress, with very few covering plants under abiotic stress. Phenotyping using complete plant shoot images offers certain advantages. Firstly, plant shoot contains more information than individual plant organs, like leaves, branches, flowers, and provide a holistic view of the plant. Secondly, capturing shoot images of a plant is faster, more robust, and provides equal or more visual features than capturing images of individual plant organs of the same plant. Thirdly, temporal analysis of shoot images over time will require low complexity models compared to the integrated temporal analysis of various plant organs, making the former more viable for real-time use. Lastly, this technique is non-destructive and non-evasive, enabling us to make observations while the plant is growing. Thus, using complete shoot images for phenotyping applications is desirable. Furthermore, to the best of our knowledge, there are no publicly available plant shoot image datasets of pulses, especially chickpea, to detect and classify moisture-related abiotic stress conditions. To this end, we created a new dataset of chickpea plant shoot images in the visible spectrum of light.

Two varieties of chickpea strains, namely - stress-tolerant Pusa-372 and stress-sensitive JG-62 - were grown in individual plant pots in the control chamber room and observed over a period of five months for this experiment. From now on, we will refer to JG-62 as JG and Pusa-372 as Pusa. The experiment was conducted in collaboration with plant scientists at the National Institute of Plant Genome Research (NIPGR). For both the varieties, plants were subjected to three different watering conditions based on the water stress applied to them. The three watering conditions are Young Seedling (YS), in which a plant was not watered for 1 week after it was 2 weeks old; Before Flowering (BF), in which a plant was not watered for 1 week after it was 5 weeks old; Control (C), in which a plant was watered throughout. Water stress changes the physical structure of plants, such as shape and color. It also reduces plant height, plant biomass, and the number of branches, leaves, and fruits in chickpea plants. We had 15 pots per species and 5 per water stress category for our experiment. The plant shoot images were captured in regular sessions at a particular time once every three days. For each pot, we have 32 sessions of data. During each image capturing session, images were taken from eight different angles, at every $45^{o}$ . Further, the lighting condition, the camera distance, and other dataset parameters shown in the Table I below were kept the same for all plant pots (with acceptable and negligible margin of human error). Thus, in every session, we have captured 240 images across all the pots of both varieties. Overall, this dataset has a total of $7680$ images. The black pot and the white background were seen in the image. Segmentation can be applied to extract the plant shoot portion from the image but at an additional computational cost in terms of time and resources. As DL techniques are able to avoid such invariant features existing in the context, we do not apply plant shoot segmentation in our paper favoring real-time deployment over high classification accuracies. Fig. 1 shows two sample images from our dataset.

II-B Deep Learning Approach

Over the years, DL techniques like CNNs have become state of the art for image classification. In our paper [24], we employed a 23-layered custom CNN, and ResNet-18 [25] model for water stress classification from chickpea plant shoot images. The ResNet-18 classifier was able to achieve 84% and 86% accuracy on Pusa and JG, respectively. However, this approach enforced a simplifying assumption on the dataset by treating all images belonging to one class equivalent even if they were taken at different times. Furthermore, water stress is introduced after 2 weeks, due to which the images up to that point across all the three conditions are similar to one another, thereby adding noise to the dataset. Despite this noise, the CNN classifier is robust enough to analyze water-stressed plants’ patterns accurately. However, we hypothesize that time-series analysis of the plant shoots’ visual features will remove this noise and produce better results.

Recurrent Neural Network (RNN) has been employed for sequential learning tasks. Long Short Term Memory (LSTM) network is an improvement over the RNN architecture [26]. Unlike RNN, LSTM can learn long-term dependencies and preserve useful temporal information for an extended period. They have become a state-of-the-art technique for sequence learning problems like time series analysis [27]. Moreover, LSTM and CNN combined have also been successfully used in tasks requiring sequence learning of visual features [28], like video classification and activity recognition in videos[29, 30]. Our task shares similarities with activity classification in videos that predicts which activity is being performed by analyzing visual changes over time. Similarly, we need to ascertain temporal patterns resulting from visual changes induced in the chickpea plant’s shoot due to water stress. CNN-LSTM architecture combines LSTM and CNN for spatio-temporal learning. Thus, we introduce a variant of the CNN-LSTM to predict water stress in chickpea plants. In this architecture, CNN pre-trained on ImageNet data extracts visual features from the chickpea plant shoot images. Then, the LSTM analyses these features over time to predict the plant’s water stress condition. We also compare our previous time-invariant approach for water stress classification in chickpea plants [24] with our proposed temporal approach. Several CNN architectures have been developed over time. In this paper, we have used VGG16 [31] and Inception-V3 [32] architectures. Firstly, we fine-tune models of these architectures pre-trained on the ImageNet dataset [33] and use them for time-invariant classification of chickpea plants under water stress conditions. Secondly, we use these models as feature extractors for the CNN-LSTM models.

VGG16: VGG16 architecture has achieved state-of-art accuracy for image classification on the ImageNet dataset in the past. This architecture introduces the concept of stacking smaller convolutional kernels to produce an effective receptive field. This technique also decreased the total number of parameters to achieve the same receptive field and increased non-linearity due to activation across multiple stacked layers. This model was deeper and less wide than GoogLeNet (Inception-V1) [34] proposed around the same time. Although VGG16 performs better than GoogLeNet on the ImageNet dataset, it is more computationally complex and has a more significant computation, memory, and storage requirement.

Inception-V3: Inception-V3 architecture proposed as an improvement over its predecessors (Inception-V1 and Inception-V2) has achieved state-of-the-art accuracy for image classification on ImageNet dataset in the past. Some of the essential features of this model are: it is deeper, avoids representational bottlenecks, especially early in the network, maintains higher dimensional representation, spatial aggregation on the lower dimension, balance width, and depth of the network. In addition, it further reduces the computational complexity, both in terms of the number of parameters and cost of resources (memory and storage) compared to Inception-V1 and Inception-V2 architectures, and increases classification accuracy. As a result, Inception-V3 performs better than VGG16 on the ImageNet dataset.

We describe the architectures, input processing, and neural network training in the subsequent sections.

II-B1 CNN Architecture

This network performs a time-invariant classification analysis to identify water stress in chickpea plant shoot images. For this purpose, we use the convolutional base of the VGG16 and Inception-V3 network and remove the corresponding dense layers. Then, we perform Global Average Pooling [35] after the last Max Pooling layer. Global Average Pooling is preferred over fully connected layers for flattening the feature maps to a linear vector because it is more native to the convolution structure and enforces correspondences between feature maps and categories. Further, this layer has no parameters to optimize, which reduces the chances of over-fitting and is also more robust to the input’s spatial translation. Finally, we add two dense layers after global average pooling, the first one has 512 dimensions, and the following is the output layer with three dimensions, equal to the number of classes, as shown in Fig. 3. We initialize each dense layer using the Glorot uniform initializer [36] and use Softmax activation in the final output Dense layer (Equation 9).

II-B2 CNN-LSTM Architecture

Our CNN-LSTM architecture consists of two main parts: CNN image feature extractor and LSTM to predict water stress category from the extracted features. The architecture is shown in Fig. 4.

CNN feature extractor: We use VGG16 and Inception-V3 models pre-trained on the ImageNet dataset to extract visual features from chickpea plant images. We employ two different feature pre-trained extractors to determine if our approach is CNN architecture-dependent. In both the models, we remove the dense layers and apply Global Average Pooling after the final Max-Pooling layer to obtain 1D vectors of size 512 and 2048 for VGG16 and Inception-V3, respectively. We use the CNN network in time distributed form, that is, the same network is shared across all time steps of subsequent LSTM network. The unrolled version is shown in Fig. 4.

LSTM predictor: In our LSTM network, the number of sequentially connected cells is equal to the number of session data used for prediction, as shown in Fig. 4. This variable length of the LSTM network helps us analyze the effect of the number of data sessions on the prediction performance metrics - Accuracy, Macro Sensitivity, Macro Specificity, and Macro Precision. An ablation study on the same is reported in section III-D. The LSTM network output is fed into a Dense Layer of size 512 dimension, which is connected to the Dense output layer of size 3, equal to the number of water stress categories. We initialize each dense layer using the Glorot uniform initializer and use Softmax activation in the final output Dense layer (Equation 9).

Let us mathematically trace how our proposed network processes an input image sequence of plant shoot images. An image of an input sequence can be written as $i_{t}$ defined as an image at timestep t, $i_{t}\in R^{m\times m}$ , where the image is of dimension $m\times m,\ m=224$ . We define one entire image input sequence as

I=\{i_{t}|\;Image\;at\;timestep\;t,\;t\in\mathbb{N},\>1\leq t\leq 32,\>i_{t}\in R^{m\times m}\}

where $I\in R^{T\times m\times m}$ , T is the number of time steps. Then, we chose either VGG16 or Inception-V3 CNN to extract features from images. We apply the chosen CNN feature extractor to each image of a sequence in a time-distributed manner, such that its weights $W_{c}$ remain the same for all LSTM timesteps and obtain corresponding features. One feature of the output feature sequence can be written as $x_{t}$ feature at timestep t, $x_{t}\in R^{d}$ , where d is the size of the feature vector. We define one entire feature output sequence as

X=\{x_{t}|\;Feature\;at\;timestep\;t,\;t\in\mathbb{N},\;1\leq t\leq 32,\;x_{t}\in R^{d}\}

, where $F\in R^{T\times d}$ , T is the number of timesteps. Therefore, the convolutional feature extractor simulates a function

g\colon I\to X

\displaystyle\left(x_{t_{1}},\ldots,x_{t_{T}}\right)=\left(g\left(i_{t_{1}}\right),\ldots,g\left(i_{t_{T}}\right)\right)

(1)

Then, we feed the feature sequence to a sequence of LSTM units. An LSTM unit comprises a cell, an input gate, an output gate, and a forget gate, as shown in Fig. 5. The cell remembers values over arbitrary time intervals, and the three gates regulate the flow of information into and out of the cell. The equations for an LSTM are defined as:

\displaystyle f_{t}={\sigma_{g}\left(W_{f}x_{t}+U_{f}h_{t-1}+b_{f}\right)}

(2)

\displaystyle i_{t}={\sigma_{g}\left(W_{i}x_{t}+U_{i}h_{t-1}+b_{i}\right)}

(3)

\displaystyle o_{t}={\sigma_{g}\left(W_{o}x_{t}+U_{o}h_{t-1}+b_{o}\right)}

(4)

\displaystyle\tilde{c}_{t}={\sigma_{c}\left(W_{c}x_{t}+U_{c}h_{t-1}+b_{c}\right)}

(5)

\displaystyle c_{t}={f_{t}\odot c_{t-1}+i_{t}\odot\tilde{c}_{t}}

(6)

\displaystyle h_{t}={o_{t}\odot\sigma_{h}(c_{t})}

(7)

$\!x_{t}\in\mathbb{R}^{d}:Input\ feature\ vector\ to\ the\ LSTM\ unit\\ f_{t}\in\mathbb{R}^{h}:Forget\ gate^{\prime}s\ activation\ vector,\\ i_{t}\in\mathbb{R}^{h}:Input\ gate^{\prime}s\ activation\ vector,\\ o_{t}\in\mathbb{R}^{h}:Output\ gate^{\prime}s\ activation\ vector,\\ h_{t}\in\mathbb{R}^{h}:Hidden\ state\ output\ vector\ of\ the\ LSTM\ unit,\\ \tilde{c}_{t}\in\mathbb{R}^{h}:Cell\ input\ activation\ vector,\\ c_{t}\in\mathbb{R}^{h}:Cell\ state\ vector,\\ W\in\mathbb{R}^{h\times d}:Weight\ Matrix\\ U\in\mathbb{R}^{h\times h}Weight\ Matrix\\ \ b\in\mathbb{R}^{h}:Bias\ Matrix\\ \sigma_{g}:Sigmoid\ function\\ \sigma_{h}:Hyperbolic\ tangent\ function\\$

Here, d and h denote input feature and hidden state dimensions, respectively. In addition, $\odot$ denotes element-wise multiplication. Weight and bias matrices are learned during training.

Then, we take the hidden vector, also known as output vector $h_{t}$ of the final LSTM unit $h_{t=T}$ and feed it as input to the classification block consisting of two dense layers.

For the classification block, the Input is $H=h_{t=T}$ , and output is $P\in R^{C}$ , where C is the number of classes, here $C=3$ , $Classes=\{Before\ Flowering,Control,Young\ Seedling\}$ . Then, the dense layer simulates a function

j\colon H\to P

\displaystyle\left(p_{1},p_{2},p_{3}\right)=j\left(h_{t_{T}}\right)

(8)

The proposed network’s output is equal to the classification block’s output, that is, P. As it is a case of multi-class classification, softmax activation is applied to the output of the final fully connected layer, also called the classification layer. It helps convert the output score corresponding to each class into a probability value between 0 and 1.

\displaystyle\textnormal{Softmax}\left(p_{i}\right)=\frac{\exp^{p_{i}}}{\sum\limits_{j=1}^{C}\exp^{p_{j}}}.

(9)

where $p_{i}$ is the predicted probability of a class represented as an element of the 3 dimensional output vector P.

II-B3 Input Processing

CNN-LSTM Network: This section describes input dataset preparation for the CNN-LSTM Network. Each input data sequence consists of 32 images of a plant pot, one from every photo session. To ensure the robustness of classification, we consider photographs at all angles, such that all images of one data sequence have been taken from the same angle. Thus, for both JG and Pusa, we have 120 data samples each, in which there are 40 samples for each water stress category. We use RGB input images of size (224,224,3) and perform CNN network-specific image preprocessing on them before feeding them to our CNN-LSTM. While training the models, we perform data augmentation like horizontal flipping, rotation, shear, and translation to increase the training data’s size on the fly. We ensure that a linear transformation is performed equivalently for each image in the image sequence. Besides linear transformations, we randomly introduce Gaussian noise perturbations to a few training samples. A typical image noise model is Gaussian, additive, independent at each pixel, and independent of the signal intensity. Further, Gaussian noise in digital images, usually a consequence of sensor noise, arises during acquisition. As data acquisition in real-world settings will often be accompanied by noise, we perform noise data augmentation to train a robust model. We perform this augmentation by sampling noise intensities from a Gaussian noise $N$ distribution, which has a mean $\mu=0$ and a standard deviation $\sigma=15\%$ of maximum pixel intensity of any image in our dataset.

N\sim\mathcal{N}(0,\,\sigma^{2})

The value of $\sigma$ has been empirically chosen to provide the best robustness capability for noisy shoot images without making noise, a relevant feature for the model to learn. JG and Pusa noisy input samples are shown in Fig. 7 and 7.

CNN Network: This section describes input data preparation for fine-tuning pre-trained VGG16 and Inception-V3 CNNs. In this case, we equivalently treat all images of a given plant taken at different points in time. Thus, we have 1280 images per category and 3840 in total for each species. We use RGB input images of size (224,224,3) and perform network-specific image preprocessing before feeding them to the corresponding CNN for fine-tuning. Similar to the CNN-LSTM, we perform data augmentation like horizontal flipping, rotation, shear, and translation to increase the training data’s size.

II-B4 Training

We optimize the CNN-LSTM network (temporal analysis) and CNN networks (time-invariant analysis) by minimizing the categorical cross-entropy loss for water stress classification.

\displaystyle Categorical\ Cross\ Entropy=-\sum_{i=1}^{C}y_{i}log(\hat{y_{i}})

(10)

Where C is the number of classes, $Classes=\{Before\ Flowering,Control,Young\ Seedling\}$ , $y_{i}$ is the true class, and $\hat{y_{i}}$ is predicted class, which is obtained after softmax activation, refer to Equation (9).

For our CNN-LSTM, we freeze the weights of the CNN and train the LSTM and the dense layer. To train this network, we backpropagate the loss and update the weights of the network using the Backpropagation Through Time [37] algorithm. The LSTM is trained on 32 sessions data and has about 2.4M and 35M trainable parameters with VGG16 and Inception-V3 feature extractors, respectively. On the other hand, to simulate time-invariant classification, we fine-tune the pre-trained VGG16 and Inception-V3 networks on our dataset using the Backpropagation technique [38].

The training is performed using a mini-batch size of 32 images and neural network weights are optimized using Adam optimizer [39] with learning rate $\alpha=0.0001$ and the other optimizer parameters being $\{\beta_{1}=0.9,\beta_{2}=0.999,\epsilon=10^{-7}\}$ . Further, we train each model for 200 epochs and use them for metric evaluation.

Training Environment We use Tensorflow and Keras DL framework to train our models and train them on a single Nvidia Tesla K80 GPU.

II-B5 Evaluation Protocol

In this paper, we perform 5-fold stratified cross-validation for each model type (plant-variety and CNN pair). In other words, we divide the entire dataset into 5 equivalent subsets and train a model on 4 out of 5 of them. Then, we test on the remaining subset such that each subset acts as a test set once. Finally, we report the average scores across all 5 models for each performance metric - Accuracy, Macro-Sensitivity, Macro-Specificity, and Macro-Precision. We also repeat the cross-validation process 10 times to ensure robustness of the reported scores.

II-C Performance Evaluation Metrics

The performance of the proposed model is evaluated using the performance metrics of Average Accuracy(Acc), Macro Sensitivity(Se), Macro Specificity(Sp) and Macro Precision(Pre). In Macro method, the average of the accuracy, sensitivity specificity and precision of the system on different subsets are taken, where each subset consists of all images of a specific class. Mathematically they are defined as;

\displaystyle Average\ Accuracy\ =\ \frac{\sum_{i}^{C}\frac{Tp_{i}\ +\ Tn_{i}}{Tp_{i}\ +\ Tn_{i}\ +\ Fp_{i}\ +\ Fn_{i}}}{\sum_{j}^{C}1}

(11)

\displaystyle Macro-Sensitivity\ =\ \frac{\sum_{i}^{C}\frac{Tp_{i}}{Tp_{i}\ +\ Fn_{i}}}{\sum_{j}^{C}1}

(12)

\displaystyle Macro-Specificity\ =\ \frac{\sum_{i}^{C}\frac{Tn_{i}}{Tn_{i}\ +\ Fp_{i}}}{\sum_{j}^{C}1}

(13)

\displaystyle Macro-Precision\ =\ \frac{\sum_{i}^{C}\frac{Tp_{i}}{Tp_{i}\ +\ Fp_{i}}}{\sum_{j}^{C}1}

(14)

Here, $Tp_{i}$ represents the true positives; $Tn_{i}$ represents the true negatives; $Fp_{i}$ represents the false positives; $Fn_{i}$ represents the false negatives with respect to the actual and predicted water stress class; such that $i,j\in Classes\ and\ Classes=\{Before\ Flowering,\ Control,\ Young\ Seedling\}$ , and C is the number of classes.

III Experimental Results

In this section, we describe the four experiments performed on the dataset. Firstly, we examine the water stress classification ability of fine-tuned VGG16 and Inception-V3 by performing time-invariant training and compare it with our previously used technique [24]. This experiment also acts as a baseline for temporal analysis, as we use the same CNNs in our CNN-LSTM models. Secondly, we train and evaluate CNN-LSTM models to investigate the effectiveness of temporal analysis of the visual features extracted from plant shoot images. Thirdly, we test the robustness of our model by evaluating it on perturbed sequences of shoot images, such that a certain percentage of images of a sequence undergo Gaussian Noise perturbations. Lastly, we perform an ablation study on the CNN-LSTM model’s effectiveness by uniformly decreasing the amount of session data used for training the models.

TABLE II: Performance metrics for time-invariant water stress classification using CNN models on JG and Pusa varieties of chickpea plants (Acc: Accuracy (in %), Se: Sensitivity, Sp: Specificity, Pre: Precision).

Chickpea Species	CNN Model	Acc	Se	Sp	Pre
$JG-62$	VGG16	72.14	0.7214	0.8734	0.7690
	Inception-V3	80.99	0.8099	0.9135	0.8111
	CNN [24]	78.00	0.7800	0.8900	0.7700
	ResNet-18 [24]	86.00	0.8600	0.9300	0.8600
$Pusa-372$	VGG16	70.96	0.7096	0.8737	0.7059
	Inception-V3	75.00	0.7500	0.8750	0.7950
	CNN [24]	76.00	0.7600	0.8800	0.7500
	ResNet-18 [24]	84.00	0.8400	0.9200	0.8400

III-A Time-Invariant Analysis

In the time-invariant analysis, we train and evaluate four CNN model types, which represent all possible combinations of the plant variety and CNN feature extractor used in this paper, and report the metric scores in Table II. VGG16 and Inception-V3 fine-tuned networks obtain classification accuracies of $72.14\%$ and $80.99\%$ for JG and $70.96\%$ and $75.00\%$ for Pusa variety, respectively, as shown in Table II.

TABLE III: Performance Metrics for temporal water stress classification using CNN-LSTM models on JG and Pusa varieties of chickpea plants using VGG16 and Inception-V3, CNN feature extractor (Acc: Accuracy (in %), Se: Sensitivity, Sp: Specificity, Pre: Precision).

Chickpea Species	CNN-LSTM Model	Acc	Se	Sp	Pre
$JG-62$	VGG16	98.32	0.9833	0.9916	0.9852
$JG-62$	Inception-V3	98.32	0.9833	0.9916	0.9852
$Pusa-372$	VGG16	97.50	0.9749	0.9874	0.9778
$Pusa-372$	Inception-V3	97.50	0.9749	0.9874	0.9778

III-B Temporal Analysis

In the temporal analysis, we train and evaluate four CNN-LSTM model types, which represent all possible combinations of the plant variety and CNN feature extractor used in this paper, and report the metric scores in Table III. We observe that the classification accuracy of VGG16 and Inception-V3 are $98.32\%$ for JG and $97.5\%$ for Pusa variety. The confusion matrices for each model are shown in Fig. 8, where each cell’s value represents the average probability across all the folds.

TABLE IV: Robustness Analysis of CNN-LSTM models on JG and Pusa varieties of chickpea plants using VGG16 and Inception-V3, CNN feature extractor. Mean(Standard deviation) of the following metrics (Acc: Accuracy (in %), Se: Sensitivity, Sp: Specificity, Pre: Precision) are reported in this table.

Chickpea Species	CNN-LSTM Model	Acc	Se	Sp	Pre
$JG-62$	VGG16	95.83	0.9583	0.9789	0.9631
		(1.79)	(0.0179)	(0.0091)	(0.0157)
	Inception-V3	95.83	0.9583	0.9789	0.9631
		(1.79)	(0.0179)	(0.0091)	(0.0157)
$PUSA-372$	VGG16	95.16	0.9516	0.9756	0.9572
		(1.57)	(0.0157)	(0.0080)	(0.0138)
	Inception-V3	95.16	0.9516	0.9756	0.9572
		(1.57)	(0.0157)	(0.0080)	(0.0138)

		Predicted
		BF	C	YS
Actual	BF	\collectcell 1 .0\endcollectcell	\collectcell 0 .0\endcollectcell	\collectcell 0 .0\endcollectcell
	C	\collectcell 0 .0\endcollectcell	\collectcell 1 .0\endcollectcell	\collectcell 0 .0\endcollectcell
	YS	\collectcell 0 .05\endcollectcell	\collectcell 0 .0\endcollectcell	\collectcell 0 .95\endcollectcell

(a) JG - VGG16

		Predicted
		BF	C	YS
Actual	BF	\collectcell 0 .95\endcollectcell	\collectcell 0 .0\endcollectcell	\collectcell 0 .05\endcollectcell
	C	\collectcell 0 .0\endcollectcell	\collectcell 1 .0\endcollectcell	\collectcell 0 .0\endcollectcell
	YS	\collectcell 0 .0\endcollectcell	\collectcell 0 .0\endcollectcell	\collectcell 1 .0\endcollectcell

(b) JG - Inception-V3

		Predicted
		BF	C	YS
Actual	BF	\collectcell 1 .0\endcollectcell	\collectcell 0 .0\endcollectcell	\collectcell 0 .0\endcollectcell
	C	\collectcell 0 .0\endcollectcell	\collectcell 1 .0\endcollectcell	\collectcell 0 .0\endcollectcell
	YS	\collectcell 0 .0\endcollectcell	\collectcell 0 .075\endcollectcell	\collectcell 0 .925\endcollectcell

		Predicted
		BF	C	YS
Actual	BF	\collectcell 0 .925\endcollectcell	\collectcell 0 .075\endcollectcell	\collectcell 0 .0\endcollectcell
	C	\collectcell 0 .0\endcollectcell	\collectcell 1 .0\endcollectcell	\collectcell 0 .0\endcollectcell
	YS	\collectcell 0 .0\endcollectcell	\collectcell 0 .0\endcollectcell	\collectcell 1 .0\endcollectcell

(d) Pusa - Inception-V3

Figure 8: Confusion matrix depicting the results for CNN-LSTM model with VGG16 and Inception-V3 as the CNN feature extractor trained on 32 sessions of JG and Pusa species. Here, (a), (b), (c), and (d) represent the confusion matrices of the different possible species - feature extractor models.

TABLE V: Ablation Study: Performance metrics for CNN-LSTM model on JG-62 and Pusa-372 varieties using VGG16 and Inception-V3 image feature extractor (Acc: Accuracy (in %), Se: Macro-Sensitivity, Sp: Macro-Specificity, Pre: Macro-Precision) on Sn session data where n represent images of dataset up to the nth session.

Chickpea Species	Feature Extractor	Metric	S4	S8	S12	S16	S20	S24	S28	S32
$JG-62$	VGG16	Acc	87.5	91.67	93.33	93.33	95.83	97.5	97.5	98.32
		Se	0.875	0.9167	0.9333	0.9333	0.9583	0.9749	0.9749	0.9833
		Sp	0.9375	0.9583	0.9662	0.9662	0.9791	0.9874	0.9874	0.9916
		Pre	0.8857	0.9267	0.9412	0.9557	0.963	0.9704	0.9778	0.9852
	Inception-V3	Acc	87.5	91.67	93.33	93.33	95.83	97.5	97.5	98.32
		Se	0.875	0.9167	0.9333	0.9333	0.9583	0.9749	0.9749	0.9833
		Sp	0.9375	0.9583	0.9662	0.9662	0.9791	0.9874	0.9874	0.9916
		Pre	0.8857	0.9267	0.9412	0.9557	0.963	0.9704	0.9778	0.9852
$Pusa-372$	VGG16	Acc	83.33	87.5	91.67	93.33	95.83	96.66	96.66	97.5
		Se	0.8333	0.875	0.9167	0.9333	0.9583	0.9666	0.9666	0.9749
		Se	0.9167	0.9583	0.9583	0.9666	0.9791	0.9833	0.9833	0.9874
		Pre	0.8426	0.933	0.933	0.945	0.963	0.9704	0.9704	0.9778
	Inception-V3	Acc	87.5	93.33	94.99	95.83	96.66	96.66	97.5	97.5
		Se	0.8751	0.9083	0.9249	0.9583	0.9666	0.9666	0.9749	0.9749
		Sp	0.9375	0.9666	0.9707	0.9791	0.9833	0.9833	0.9874	0.9874
		Pre	0.8857	0.945	0.951	0.963	0.9704	0.9704	0.9778	0.9778

III-C Robustness Analysis

In this experiment, we add Gaussian noise perturbations to the test sequences of each cross-validation fold. We start by selecting a certain percentage of images from each test sequence in random order and perturb them with noise. We often come across a range of image perturbation percentages rather than a fixed value in real-life scenarios. The minimum percentage is nearly 3% or 1 out of 32 images in a sequence. The maximum perturbation percentage is set to nearly one-third of images of an entire test sequence, approximately equal to 10 out of 32 images. For perturbations greater than one-third of the images in a test sequence, applying a noise removal preprocessing step before training the CNN-LSTM model will be computationally more efficient than training a larger and more complex neural model, which is inherently unaffected by noise. After selecting the images, we apply noise to them. The noise intensities are sampled from the distribution described in section II-B3. Then, we perform ten model evaluation cycles by increasing the number of perturbed images in a test sequence from 1 to 10, with an increment of one image per cycle. Finally, we report the mean accuracy and standard deviation across all cycles as shown in Table IV.

III-D Ablation Study

We perform an ablation study to determine the CNN-LSTM model’s performance on decreasing the session data used for training. In this study, we evaluate CNN-LSTM models corresponding to each chickpea plant species ${JG,Pusa}$ , and CNN feature extractor ${VGG16,Inception-V3}$ pair. We train 8 models for each pair, such that each model differs in the number of session data used. Starting from the 32nd session down to the 4th session, we reduce the number of sessions data by 4. A gap of 4 sessions was chosen as it provided the best solution for the trade-off between the available computational resources and time for training models vs. the change in the performance metrics’ value between two consecutive models. We report the results obtained in Table V and visualize the value of each performance metric vs. the number of session data in the graphs shown in Fig. 9.

III-E Computational Complexity

In this section, we report the time and space complexity of inference. To report the worst-case complexities, we utilize the models trained on 32 sessions of data, as these models have the maximum number of parameters. The inference time does not include the time to load the 32 session images, pre-process the image, and load the model. In other words, we measure the time taken for the feedforward propagation of the model. We calculate the inference time on the Nvidia Tesla K80 GPU and Intel(R) Xeon(R) CPU. Our CNN-LSTM models with VGG16 feature extractor and Inception-V3 feature extractors have nearly 17M and 56M parameters, respectively. Further, the model with the VGG16 feature extractor takes 29ms and 70ms to predict the plant’s water stress condition on the GPU and CPU, respectively. Whereas the model with the Inception-V3 feature extractor, which has more parameters, takes 59ms and 98ms to predict the plant’s water stress condition on the GPU and CPU, respectively.

IV Discussion

This section presents the discussion on the experimental results, application scope of this research, and its limitations.

IV-A Experimental Inference

This subsection provides a discussion on the results of the four experiments presented in this paper.

IV-A1 Time Invariant Analysis

Firstly, we observe that VGG16 and Inception-V3 models’ performance is similar to the performance of our previous techniques, that is, a custom-CNN architecture and ResNet-18 architecture (as shown in Table II). ResNet-18, with fewer parameters than Inception-V3 and VGG16 models, has better water stress classification performance due to a higher degree of overfitting in the latter two larger models. Secondly, we infer that VGG16 and Inception-V3 models’ performance on JG images is better than Pusa images, which is consistent with our previous results, as shown in Table II. This can be attributed to the water-resistant nature of Pusa species and the water-sensitive nature of JG species. In other words, the visual changes introduced due to water stress are more prominent in JG than Pusa, thus making it easier to classify JG images into the three water stress categories. Lastly, we observe that Inception-V3 models produced better results than VGG16 models across both chickpea species. This can be explained by relating these results with both these architectures’ classification results on the ImageNet dataset. Inception-V3 outperforms VGG16 in both the top 1 and top 5 error rates (%) because it can extract better visual features [32, 31]. This suggests that image-based classification by transfer learning from Inception-V3 should be better than VGG16, consistent with the results shown in Table II. Therefore, time-invariant water-stress classification is network-dependent.

IV-A2 Temporal Analysis

On comparing the results of temporal analysis of visual changes induced in the chickpea plant shoots due to water stress (shown in Table III) with the time-invariant analysis (shown in Table II), we observe that temporal analysis outperforms the best reported time-invariant scores by at least 14% for both chickpea varieties. Additionally, each chickpea variety’s results are consistent for both feature extractors, showing that the proposed CNN-LSTM technique is independent of the feature extractor network. We also observe that the CNN-LSTM performs better on JG than Pusa. This observation corresponds to the inherent water stress sensitivity characteristics of these two chickpea varieties. JG is water stress-sensitive, thus producing more noticeable visual changes over time than Pusa, which is water stress-tolerant. This can also be seen from the time-invariant analysis results in this paper, as shown in Table II.

IV-A3 Feasability analysis of CNN feature extractor using Grad-CAM

This section examines the feasibility of the CNN feature extractor considered in this paper. We employ Gradient-weighted Class Activation Mapping (Grad-CAM) [40]. Grad-CAM uses the gradient information flowing into the last convolutional layer of CNN to understand each neuron for a class label of interest. As we use the same CNN feature extractors for time-invariant and our proposed temporal analysis, Grad-CAM visualization of the CNN network used for time-invariant analysis can be used to approximately extrapolate the behavior of these extractors in the temporal context. Further, we use the CNN network with Inception-V3 feature extractor, the best-reported network by this work. While applying Grad-CAM, we obtain the class discriminative localization map of width u and height v for a water stress class c by first computing the gradient of the score for that class, that is, ${y}^{c}$ (before the softmax), for feature maps $\alpha_{k}$ of a convolutional layer. These gradients flowing back are global average-pooled over the width and height dimensions (indexed by i and j respectively) to obtain the neuron importance weights $\alpha_{k}^{c}$ .

\alpha_{k}^{c}=\overbrace{\frac{1}{Z}\sum_{i}\sum_{j}}^{\text{global\ average\ pooling }}\underbrace{\frac{\partial y^{c}}{\partial A_{ij}^{k}}}_{\text{gradients\ via\ backprop }}

(15)

After calculating $\alpha_{k}^{c}$ , we perform a weighted combination of the activation maps and follow it by a ReLU. Without it, the class activation map highlights more than required and achieves low localization performance.

L_{\mathrm{Grad}-\mathrm{CAM}}^{c}=\operatorname{ReLU}\underbrace{\left(\sum_{k}\alpha_{k}^{c}A^{k}\right)}_{\text{linear \ combination }}

(16)

Subsequently, we superimpose the activation map (heatmap) with the original image to coarsely visualize which region it focuses on to classify water stress. In the images shown in Fig. 11 and 11, the intensity of yellow is directly proportional to the intensity of neural activation with respect to the predicted class. In other words, CNN focuses on the yellow highlighted regions in the image to make its prediction. From the figures, we observe that the CNN focuses on the shoot of the chickpea plant to predict water stress for both varieties. Further, this area of focus varies with the size and shape of the shoot. These visualizations explain and validate the use of chickpea plant shoot images to detect water stress.

IV-A4 Robustness Analysis

On comparing the results in Table III and Table IV, we observe that the mean accuracy of our model on noisy test data is less than the accuracy on noise-free test data, by atmost 2.5%. This decrease is consistent for JG and Pusa varieties and VGG16 and Inception-V3 feature extractors. Even in the presence of noise, each chickpea variety’s results are consistent for both the feature extractors, thereby highlighting that the temporal technique is independent of the feature extractor. The model accuracy on the JG variety is greater than that of Pusa, which further validates the water-sensitive nature of JG over the Pusa variety. An interesting observation is the small standard deviation from the mean accuracy for all the CNN-LSTM models. A small standard deviation demonstrates that the model will not be adversely affected by noise, and its accuracy will remain reasonably consistent. Thus, the small decrease in classification accuracy and a fairly consistent average accuracy in noisy conditions makes this technique suitable for real-time deployment.

IV-A5 Ablation Study

We draw the following inferences from the result. Firstly, the graphs in Fig. 9 and Table V demonstrate that by decreasing the number of session data for training the model, we decrease its ability to discern water stress conditions. This observation is reasonable because a longer image sequence will learn better differentiating features, especially since water stress on the shoot is prominent in the later stages of growth. Secondly, the performance metric curves (as shown in Fig 9) for a given plant species are similar for both the feature extractors. This emphasizes that temporal analysis using CNN-LSTM models has negligible dependence on the CNN feature extractor used. On the contrary, the time-invariant classification of water stress depends on the CNN architectures employed, as shown in Table II. This observation further reinforces the merit of temporal analysis. Lastly, we also observe that these curves and final scores are similar across both species, thereby demonstrating that temporal analysis performs well across different chickpea plant species. Species invariance is another beneficial characteristic for the real-time deployment of this technique.

IV-B Application

The image and deep learning-based water stress classification methods can detect the lack of water in plants and its excess. This can help farmers optimize irrigation, which will, in turn, prevent unnecessary expenditure and promote optimum productivity by ensuring good soil health. Our deep learning pipeline focuses on solving water stress due to water deficiency crops may face during the growth period. In our dataset, we use a single chickpea plant per image and fluorescent lighting to simulate daylight conditions. Then, we train an CNN-LSTM model to learn visual spatial-temporal features that help classify water stress. To use our approach in real-time, we will require images of an individual crop from a field, taken over time. This will act as real-time test data. We can repeat this process for multiple crops in the field to get a general idea about the water stress situation.

IV-C Limitations

Our proposed deep learning pipeline has shown merits in the form of high water stress identification performance, robustness to noisy conditions, and independence from the type of visual feature extractor used. However, it does have a couple of limitations. Firstly, we train the CNN-LSTM model on our dataset that simulates daylight conditions during photo capturing sessions. Thus, the model may show a slight variation in performance when actually used in daylight. Secondly, our approach requires that one plant per frame for accurate analysis because it has been trained on a dataset with one plant per image. For real-time deployment, plant instance detection from an image followed by its extraction may be required, which can increase processing overheads.

V Conclusion

In this paper, a novel deep learning-based pipeline for plant water stress (water deficiency) phenotyping has been proposed and validated via a detailed study on water stress identification from Chickpea plant shoot images. The pipeline consists of four main stages - image sequence input, data augmentation (and input processing), Convolutional Neural Network - Long Short Term Memory (CNN-LSTM) network, and water stress prediction. There are no publicly available datasets of pulses plant shoot images, specifically Chickpea plant, so a new dataset of two varieties of chickpea shoot images under different water stress conditions has been considered. The authors ensured high-quality training data by taking adequate measures during data acquisition and applying data augmentation techniques like Gaussian noise augmentation before neural processing. The proposed pipeline employs a CNN-LSTM to learn visual spatio-temporal patterns and use them to classify water stress. This work demonstrates that temporal analysis of chickpea plant shoot images outperforms the best time-invariant (or only spatial) analysis by nearly $14\%$ for both chickpea varieties. Further, the experimental results show that the temporal approach is independent of the underlying CNN feature extractor. This study also illustrates the robustness of our proposed CNN-LSTM model to noise. Across both species, the average model accuracy dipped by less than 2.5%, with a small standard deviation, thereby ensuring high and consistent classification capabilities even in noisy conditions. Moreover, the Grad-CAM visualizations explain and validate the use of chickpea plant shoot images to detect water stress. The ablation study further reveals that the proposed CNN-LSTM model, consequently the proposed deep learning pipeline, performs equitably on both water stress-sensitive species JG-62 and stress-resistant Pusa-372. Finally, the results of all four experiments in this paper validate the stress-sensitive nature of JG-62 and the stress-tolerant nature of Pusa-372. The findings in this paper demonstrate the potential of the proposed technique for real-time applications like plant stress monitoring and intelligent irrigation. However, this technique is not without its caveats. The proposed method has been validated on a controlled dataset while ensuring a high degree of resemblance to real-world conditions and data. Techniques like noise removal and plant shoot segmentation may be required while dealing with real-world data, which may, in turn, increase computational overheads. Nevertheless, we believe that the proposed deep learning pipeline will form the basis for future work in this domain. We encourage researchers to validate our work on their datasets and build upon this pipeline. Our future works will also be focused on proposing new components for our deep learning pipeline that will make it more robust and take it closer to real-world deployment. We are also experimenting with lightweight models that will be less-compute intensive.

References

[1] D. Shadrin, A. Menshchikov, A. Somov, G. Bornemann, J. Hauslage, and M. Fedorov, “Enabling precision agriculture through embedded sensing with artificial intelligence,” IEEE Transactions on Instrumentation and Measurement, vol. 69, no. 7, pp. 4103–4113, 2019.
[2] D. I. Patrício and R. Rieder, “Computer vision and artificial intelligence in precision agriculture for grain crops: A systematic review,” Computers and electronics in agriculture, vol. 153, pp. 69–81, 2018.
[3] G. Bai, Y. Ge, W. Hussain, P. S. Baenziger, and G. Graef, “A multi-sensor system for high throughput field phenotyping in soybean and wheat breeding,” Computers and Electronics in Agriculture, vol. 128, pp. 181–192, 2016.
[4] A. Singh, B. Ganapathysubramanian, A. K. Singh, and S. Sarkar, “Machine learning for high-throughput stress phenotyping in plants,” Trends in plant science, vol. 21, no. 2, pp. 110–124, 2016.
[5] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, p. 436, 2015.
[6] Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. E. Hubbard, and L. D. Jackel, “Handwritten digit recognition with a back-propagation network,” in Advances in neural information processing systems, 1990, pp. 396–404.
[7] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner et al., “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
[8] A. Kamilaris and F. X. Prenafeta-Boldú, “Deep learning in agriculture: A survey,” Computers and electronics in agriculture, vol. 147, pp. 70–90, 2018.
[9] S. H. Lee, C. S. Chan, S. J. Mayo, and P. Remagnino, “How deep learning extracts and learns leaf features for plant classification,” Pattern Recognition, vol. 71, pp. 1–13, 2017.
[10] L. C. Uzal, G. L. Grinblat, R. Namías, M. G. Larese, J. Bianchi, E. Morandi, and P. M. Granitto, “Seed-per-pod estimation for plant breeding using deep learning,” Computers and electronics in agriculture, vol. 150, pp. 196–204, 2018.
[11] S. Madec, X. Jin, H. Lu, B. De Solan, S. Liu, F. Duyme, E. Heritier, and F. Baret, “Ear density estimation from high resolution rgb imagery using deep learning technique,” Agricultural and Forest Meteorology, vol. 264, pp. 225–234, 2019.
[12] Y. Sun, Y. Liu, G. Wang, and H. Zhang, “Deep learning for plant identification in natural environment,” Computational intelligence and neuroscience, vol. 2017, 2017.
[13] J. G. A. Barbedo, “Plant disease identification from individual lesions and spots using deep learning,” Biosystems Engineering, vol. 180, pp. 96–107, 2019.
[14] J. Zhang, D. Du, Y. Bao, J. Wang, and Z. Wei, “Development of multifrequency-swept microwave sensing system for moisture measurement of sweet corn with deep neural network,” IEEE Transactions on Instrumentation and Measurement, vol. 69, no. 9, pp. 6446–6454, 2020.
[15] K. Kuwata and R. Shibasaki, “Estimating crop yields with deep learning and remotely sensed data,” in 2015 IEEE International Geoscience and Remote Sensing Symposium (IGARSS). IEEE, 2015, pp. 858–861.
[16] L. Song, S. Prince, B. Valliyodan, T. Joshi, J. V. M. dos Santos, J. Wang, L. Lin, J. Wan, Y. Wang, D. Xu et al., “Genome-wide transcriptome analysis of soybean primary root under varying water-deficit conditions,” BMC genomics, vol. 17, no. 1, p. 57, 2016.
[17] G. Sehgal, B. Gupta, K. Paneri, K. Singh, G. Sharma, and G. Shroff, “Crop planning using stochastic visual optimization,” pp. 47–51, 2017.
[18] S. Azimi, T. Kaur, and T. K. Gandhi, “A deep learning approach to measure stress level in plants due to nitrogen deficiency,” Measurement, vol. 173, p. 108650, 2021.
[19] Z. Gao, Z. Luo, W. Zhang, Z. Lv, and Y. Xu, “Deep learning application in plant stress imaging: a review,” AgriEngineering, vol. 2, no. 3, pp. 430–446, 2020.
[20] A. Kumar, S. Nath, A. Kumar, A. K. Yadav, and D. Kumar, “Combining ability analysis for yield and yield contributing traits in chickpea (cicer arietinum l.),” Journal of Pharmacognosy and Phytochemistry, vol. 7, no. 1, pp. 2522–2527, 2018.
[21] M. Kumar, M. A. Yusuf, and M. Nigam, “An update on genetic modification of chickpea for increased yield and stress tolerance,” Molecular biotechnology, vol. 60, no. 8, pp. 651–663, 2018.
[22] V. Devasirvatham and D. Tan, “Impact of high temperature and drought stresses on chickpea production,” Agronomy, vol. 8, no. 8, p. 145, 2018.
[23] S. D. Gupta, A. S. Manjri, and P. S. R. Kewat, “Effect of drought stress on carbohydrate content in drought tolerant and susceptible chickpea genotypes,” IJCS, vol. 6, no. 2, pp. 1674–1676, 2018.
[24] S. Azimi, T. Kaur, and T. K. Gandhi, “Water stress identification in chickpea plant shoot images using deep learning,” in 2020 IEEE 17th India Council International Conference (INDICON). IEEE, 2020, pp. 1–7.
[25] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” pp. 770–778, 2016.
[26] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[27] J. Ma, H. Liu, C. Peng, and T. Qiu, “Unauthorized broadcasting identification: A deep lstm recurrent learning approach,” IEEE Transactions on Instrumentation and Measurement, vol. 69, no. 9, pp. 5981–5983, 2020.
[28] T. Bao, S. A. R. Zaidi, S. Xie, P. Yang, and Z.-Q. Zhang, “A cnn-lstm hybrid model for wrist kinematics estimation using surface electromyography,” IEEE Transactions on Instrumentation and Measurement, vol. 70, pp. 1–9, 2020.
[29] Z. Wu, T. Yao, Y. Fu, and Y.-G. Jiang, “Deep learning for video classification and captioning,” in Frontiers of multimedia research, 2017, pp. 3–29.
[30] A. Ullah, J. Ahmad, K. Muhammad, M. Sajjad, and S. W. Baik, “Action recognition in video sequences using deep bi-directional lstm with cnn features,” IEEE Access, vol. 6, pp. 1155–1166, 2017.
[31] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[32] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818–2826.
[33] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255.
[34] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
[35] M. Lin, Q. Chen, and S. Yan, “Network in network,” arXiv preprint arXiv:1312.4400, 2013.
[36] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proceedings of the thirteenth international conference on artificial intelligence and statistics, 2010, pp. 249–256.
[37] P. Werbos, “Backpropagation through time: what it does and how to do it,” Proceedings of the IEEE, vol. 78, no. 10, pp. 1550–1560, 1990.
[38] Y. LeCun, D. Touresky, G. Hinton, and T. Sejnowski, “A theoretical framework for back-propagation,” in Proceedings of the 1988 connectionist models summer school, vol. 1. CMU, Pittsburgh, Pa: Morgan Kaufmann, 1988, pp. 21–28.
[39] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[40] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 618–626.