¹¹institutetext: Peking University, Beijing 100091, China
¹¹email: [email protected], [email protected] ²²institutetext: University of Science and Technology Beijing, Beijing 100083, China
²²email: [email protected], [email protected]

CountMamba: Exploring Multi-directional Selective State-Space Models for Plant Counting

Hulingxiao He 11 Yaqi Zhang 22 Jinglin Xu 22 Yuxin Peng (🖂) 11

Abstract

Plant counting is essential in every stage of agriculture, including seed breeding, germination, cultivation, fertilization, pollination yield estimation, and harvesting. Inspired by the fact that humans count objects in high-resolution images by sequential scanning, we explore the potential of handling plant counting tasks via state space models (SSMs) for generating counting results. In this paper, we propose a new counting approach named CountMamba that constructs multiple counting experts to scan from various directions simultaneously. Specifically, we design a Multi-directional State-Space Group to process the image patch sequences in multiple orders and aim to simulate different counting experts. We also design Global-Local Adaptive Fusion to adaptively aggregate global features extracted from multiple directions and local features extracted from the CNN branch in a sample-wise manner. Extensive experiments demonstrate that the proposed CountMamba performs competitively on various plant counting tasks, including maize tassels, wheat ears, and sorghum head counting.

Keywords:

Smart Agriculture Plant Counting State-Space Models

1 Introduction

Plant counting is indispensable at nearly every critical stage of agricultural production, spanning from seed breeding [11], germination [26], cultivation [22, 33, 23], fertilization [2], pollination yield estimation [1], and harvesting [14]. Moreover, it plays a vital role in extracting plant characteristic growth and typical plant features such as the number of leaves [8], corn ears [22], and wheat spike [33] which can be utilized for yield prediction [1], seed quality testing [6], and verification of the impact of novel genes [31].

However, traditional yield estimates are predominantly predicted manually, a process characterized by its time-consuming and labor-intensive nature, influenced by numerous subjective factors. Given these drawbacks, there is an urgent need for automatic plant counting to reduce human error, and provide more accurate and timely insights into agricultural productivity.

Agricultural practitioners have tried to automate plant counting tasks over the past decade. Previously, distinguishing plants relied on color contrast [21], but it became challenging with obstructions or similar colors. Recently, deep learning-based methods eased this with segmentation [6] or detection [19] pipelines through Faster R-CNN [27] and fully convolutional networks (FCN) [20]. Although existing methods boost the performance of plant counting, they still encounter two challenges: 1) Arbitrary Direction: Since the images for plant counting are usually captured from a top-down perspective, the plants can be distributed in an arbitrary direction, like in the horizontal, vertical, diagonal or anti-diagonal direction. Extracting spatial features along the direction of plant distribution could facilitate counting. 2) High Resolution: Since high-resolution imagery is frequently used in modern plant phenotyping platforms such as UAVs (unmanned aerial vehicles), it is inefficient to model a long-term dependency for image patches.

Following the breakthrough of Mamba [9], a state-of-the-art model integrating selective scanning (S6), there has been a notable surge in leveraging State Space Models (SSMs) across diverse computer vision tasks [13, 36, 37]. Inspired by the fact that humans count objects in high-resolution images by doing sequential scan following the direction of the plant distribution, Mamba serves as a potential resolution for counting plants efficiently and effectively.

To explore the potential of plant counting tasks via SSMs, we propose CountMamba, a simple but effective model, to adapt Mamba for plant counting. CountMamba is formulated with three principal components. Specifically, the 1) Multi-directional State-Space Group (MSSG) performs with four parallel branches to extract features from various directions, with each branch consisting of stacked Horizontal State-Space Blocks (HSSBs), Vertical State-Space Blocks (VSSBs), Diagonal State-Space Blocks (DSSBs) and Anti-diagonal State-Space Blocks (ASSBs), respectively. Then, the 2) Global-local Adaptive Fusion aggregates the global features from MSSG adaptively in a sample-wise manner and employs an CNN branch to complement the global features with local information. Finally, the 3) Counter and Normalizer are utilized to infer the number of plants based on the normalized count map. Through adaptively integrating local relationships and global information along appropriate directions, our CountMamba achieves competitive counting results on several plant counting tasks, including maize tassels counting, wheat ears counting, and sorghum head counting.

Our contributions can be summarized as follows:

•

We propose CountMamba, a new model introducing the state space models to perform plant counting tasks effectively and efficiently.
•

We propose the Multi-directional State-Space Group that extracts features of patch sequences in various orders and boosts Mamba to adapt to plants distributed in any direction.
•

We propose the Global-Local Adaptive Fusion that adaptively aggregates local information and global information from appropriate directions to select informative features in a sample-wise manner.
•

Extensive experiments on various plant counting benchmarks demonstrate that our CountMamba can provide a powerful and promising backbone for plant counting.

2 Related Work

2.1 Plant Counting

Counting the number of plants significantly predicts crop yield and other aspects. Initially, the counting of plants is usually distinguished from other backgrounds through different colors [21] in the agricultural field, as there exists a strong color contrast between different plants. Nevertheless, this method is difficult to distinguish when plants are obstructed or have similar colors. Based on this issue, many methods attempt to solve the problem through phenotypic parameters such as plant appearance texture features [5]. However, distinguishing phenotypic parameters was also difficult at the time. With the emergence of deep learning methods, the difficulty is mainly alleviated.

In deep learning, many methods tend to construct segmentation [6] or detection [19, 24, 35] pipelines to locate plants. Because both detection and segmentation have well-known, powerful, and easy-to-use frameworks, such as faster R-CNN [27] and fully convolutional networks (FCN) [20], apart from the traditional convolutional networks, new network architectures such as Transformer [30] and Mamba [9] that have emerged in recent years can also be utilized for extracting plant phenotypic characteristics. Harada et al. [28] exploit a novel hybrid wheat detection model by incorporating the CNN and Transformer for modeling long-range dependence. WheatNet [35] consists of two parts: a Transform Network, which reduces the effect of differences in the color features, and a Detection Network, which is designed to improve the capability of detection. FlowerNet [16] is a framework for flower counting based on the algorithm of YOLACT++ [3], a real-time instance segmentation model. TasselNet [22] is the first work of a local count network applied to plants. TasselNetV2 [33] discovers weak context on the plants, which is essential for the counting work.

Apart from the segmentation and detection abundantly utilized in plant counting, regression-based plant counting is less adopted. Giuffrida et al. [8] introduce scale and rotation invariance to learn features in a log-polar representation, which aims to count leaves in round shapes. A combined network that integrates density map regression and image segmentation is proposed by Wu et al. [32] for rice seedlings counting. Regression-based methods generally use the overall features of the image for regression analysis, ignoring the spatial information of the image. Therefore, their performance is worse than that of some density map-based methods constructed by detection or segmentation pipelines.

2.2 State Space Models

State Space Models (SSMs) have shown remarkable and efficient ability in capturing long sequences by utilizing selective state space to capture relevant information. Unlike Transformer [30], SSMs have lower linear time complexity and implement linear time running in terms of sequence length, making them particularly suitable for handling very long sequences. Gu et al. [10] first introduce a Structured State-Space Sequence model (S4) engineered with a specific focus on capturing long-range dependencies, which boasts the advantage of linear complexity. Inspired by this, various models have been developed, such as the S5 layer [29], which introduces MIMO SSM and efficient parallel scan into the S4 layer. Fu et al. [7] propose a novel SSM layer, H3, which significantly narrows the performance disparity between SSMs and Transformer attention in language modeling. Methea et al. [25] enhance the expressiveness of the Gated State Space layer on S4 by incorporating additional gating units, thereby boosting its performance. Lately, Gu et al. [9] have introduced a data-dependent SSM layer and developed Mamba, a universal language model backbone. Mamba exhibits superior performance compared to Transformers across various sizes when trained on large-scale real-world data while demonstrating linear scalability with sequence length. Based on this work, Huang et al. [13] solve the simple local 2D dependency relationship in the original Mamba model by dividing the image into different windows, effectively capturing local dependency relationships while maintaining Mamba’s [9] original global dependency ability. RS-Mamba [36] is proposed to process high-resolution remote sensing images. Vision-Mamba [37] combines bidirectional SSM for global visual context modeling for data dependency and position embedding for position-aware visual recognition.

3 Methodology

3.1 Preliminary

State Space Models (SSMs) and various models based on Mamba are inspired by the continuous system, which maps the 1-D function or sequences $x(t)\!\in\!\mathbb{R}\rightarrow\!y(t)$ through a hidden state $h(t)$ $\in$ $\mathbb{R}^{N}$ . Formally, SSMs employ the following ordinary differential equation (ODE) to model the input data:

		$\displaystyle h^{\prime}(t)=\textbf{A}h(t)+\textbf{B}x(t)$		(1)
		$\displaystyle y(t)=\textbf{C}h(t)$		(1)

where $\textbf{A}\!\in\!\mathbb{R}^{N\times N}$ represents the evolution matrix, $\textbf{B}\!\in\!\mathbb{R}^{N\times 1}$ and $\textbf{C}\!\in\!\mathbb{R}^{N\times 1}$ indicate the projection matrices. SSMs approximate this continuous ODE through discretization techniques. The S4 and Mamba are the discrete versions of the continuous system, which use the timescale parameter $\bm{\Delta}$ to convert continuous parameters $\mathbf{A}$ and $\mathbf{B}$ into discrete parameters $\overline{\mathbf{A}}$ and $\overline{\mathbf{B}}$ . The commonly used method for transformation is zero-order hold (ZOH), which is defined as follows:

		$\displaystyle\overline{\mathbf{A}}=\exp(\bm{\Delta}\mathbf{A})$		(2)
		$\displaystyle\overline{\mathbf{B}}=(\bm{\Delta}\mathbf{A})^{-1}(\exp(\bm{\Delta}\mathbf{A})-\mathbf{I})\cdot\bm{\Delta}\mathbf{B}.$		(2)

After the discretization of $\overline{\mathbf{A}}$ and $\overline{\mathbf{B}}$ , the discretized version of Equation (1) using a step size $\bm{\Delta}$ can be rewritten as:

		$\displaystyle h_{t}=\overline{\mathbf{A}}h_{t-1}+\overline{\mathbf{B}}x_{t}$		(3)
		$\displaystyle y_{t}=\mathbf{C}h_{t}.$		(3)

To enhance computational efficiency, the iterative process described in Equation (3) can be accelerated using parallel computation techniques, leveraging a global convolution operation:

\displaystyle\bm{y}=\bm{x}\otimes\overline{\bm{K}},\quad\overline{\bm{K}}=(\mathbf{C}\overline{\mathbf{B}},\mathbf{C}\overline{\mathbf{A}}\overline{\mathbf{B}},\ldots,\mathbf{C}\overline{\mathbf{A}}^{L-1}\overline{\mathbf{B}})

(4)

where $\otimes$ is the convolution operation, $L$ is the length of the input sequence $\bm{x}$ , and $\overline{\bm{K}}\!\in\!\mathbb{R}^{L}$ is the kernel of the SSM.

Refer to caption — Figure 1: An overview of the proposed CountMamba. It contains a Multi-directional State-Space Group (MSSG) comprised of stacked Horizontal State-Space Blocks (HSSBs), Vertical State-Space Blocks (VSSBs), Diagonal State-Space Blocks (DSSBs), and Anti-diagonal State-Space Blocks (ASSBs) in parallel, followed by Global-Local Adaptive Fusion, Counter and Normalizer to achieve plant counting.

3.2 Overall Architecture

As shown in Fig. 1, CountMamba consists of three main components: Multi-directional State-Space Group (MSSG), Global-Local Adaptive Fusion (GLAF), Counter, and Normalizer. Given an input image $\mathbf{I}\!\in\!\mathbb{R}^{H\times W\times 3}$ , we flatten it into a patch sequence and then project it to a feature sequence via linear transformation. Subsequently, we employ a Multi-directional State-Space Group (MSSG) to acquire the down-sampled deep features $\bm{F}_{H}$ , $\bm{F}_{V}$ , $\bm{F}_{D}$ , and $\bm{F}_{A}$ from stacked Horizontal State-Space Blocks (HSSBs), Vertical State-Space Blocks (VSSBs), Diagonal State-Space Blocks (DSSBs) and Anti-diagonal State-Space Blocks (ASSBs). $\bm{F}_{H}$ , $\bm{F}_{V}$ , $\bm{F}_{D}$ , and $\bm{F}_{A}$ are then fused adaptively in a sample-wise manner and refined by local features extracted by the CNN branch. Finally, the aggregated features undergo the Counter and Normalizer to predict the normalized counting map $\mathbf{C}_{n}\!\in\!\mathbb{R}^{H\times W}$ , which is summed up to infer the overall number of plants $C$ in the image.

3.3 Multi-directional State-Space Group

Multi-directional State-Space Group (MSSG) is designed to extract plant image features from various directions to adapt to different distributions. As shown in Fig. 1, MSSG is comprised of four parallel branches, consisting of stacked Horizontal State-Space Blocks (HSSBs), Vertical State-Space Blocks (VSSBs), Diagonal State-Space Blocks (DSSBs), and Anti-diagonal State-Space Blocks (ASSBs). HSSB, VSSB, DSSB, and ASSB share the same architecture except for the scan direction, possessing a Horizontal State-Space Module (HSSM), Vertical State-Space Module (VSSM), Diagonal State-Space Module (DSSM), Anti-diagonal State-Space Module (ASSM), respectively. The details of HSSM, VSSM, DSSM, and ASSM are illustrated in Fig. 2.

Inspired by the design of [36], a layer normalization is adopted to standardize the input data for balancing computational efficiency and capability, followed by a linear layer for transformation. Then, the depth-wise convolution operates on each channel separately to extract local features. The features are fed into State-Space Modules with corresponding directions to perform selective scanning in both the forward and backward directions, simulating an expert counting in a preferred scanning order. In addition, the output is linearly transformed and undergoes a gating operation with outputs of a linear transformation of the normalized features. The features undergo another linear layer and are added with inputs through a residual connection.

3.4 Global-Local Adaptive Fusion

As plants in different images can be distributed in any direction, we design a sample-wise fusion mechanism to adaptively aggregate features from different directions, which can be defined as:

\bm{F}_{\text{global}}=\alpha_{H}\circ\bm{F}_{H}+\alpha_{V}\circ\bm{F}_{V}+\alpha_{D}\circ\bm{F}_{D}+\alpha_{A}\circ\bm{F}_{A}

(5)

where $\bm{F}_{\text{global}}$ represents the adaptively aggregated feature embedding, the element-wise multiplication denoted by $\circ$ , and $\bm{F}_{H},\bm{F}_{V},\bm{F}_{D},\bm{F}_{A}$ refer to the extracted featured from HSSM, VSSM, DSSM, and ASSM, respectively. The adaptive fusion weights $\alpha_{H}$ , $\alpha_{V}$ , $\alpha_{D}$ , and $\alpha_{A}$ are defined as:

\alpha_{H},\alpha_{V},\alpha_{D},\alpha_{A}=\text{softmax}(\mathbf{W}\cdot\text{Concat}(\bm{F}_{H},\bm{F}_{V},\bm{F}_{D},\bm{F}_{A}))

(6)

where $\mathbf{W}$ is a learnable linear transformation. Moreover, we add a lightweight CNN branch that contains stacked layers of convolution, batch normalization, non-linear activation, and max pooling to extract local features $\bm{F}_{\text{local}}$ . The detailed architecture of the CNN branch is illustrated in Fig. 2. Finally, the aggregated features from multiple directions $\bm{F}_{\text{global}}$ are refined by $\bm{F}_{\text{local}}$ to obtain the final features $\bm{F}_{\text{fused}}$ , defined as:

\bm{F}_{\text{fused}}=\bm{F}_{\text{global}}+\beta\cdot\bm{F}_{\text{local}}

(7)

where $\beta$ is the hyper-parameter to control the weight of local information.

3.5 Counter and Normalizer

Counters are used to count plants in a specified area, while normalizers address the impact of duplicate counting of two overlapping images during the counting process [33]. After extracting the features of the image $\bm{F}_{\text{fused}}$ , it needs to be sent to the counter for quantity prediction. In our experiment of the regression, a count map $C_{r}$ is used for the image transformation following [33], with each count representing a local region sized $r\!\times\!r$ . During the image mapping process, the parameters of the local region size $r$ and the output stride $s$ are critical since they determine whether there is redundancy in the map. Throughout the experimental process, it is necessary to ensure that $r\!\geq\!s$ . When $r\!>\!s$ , every two adjacent local regions overlap with $\frac{r-s}{r}$ , the count map becomes redundant in this situation. The overlap diminishes only when $r\!=\!s$ . In our paper, the parameters of r and s are defined as 64 and 8, respectively. Thus, the resulting count map exhibits redundancy. The normalization step must be performed to eliminate redundancy and ensure that the final sum of normalized count maps accurately reflects the plant count of the image. The base input size of the network is $r\!\times\!r$ , which is only tied to the network architecture. And owing to the architecture, the input size of the images can be arbitrary. For instance, the input image $\mathbf{I}\!\in\!\mathbb{R}^{H\times W\times 3}$ , CountMamba defines a transformation $f$ as follows: $f(\mathbf{I}):\mathbb{R}^{H\times W\times 3}\!\rightarrow\!\mathbb{R}^{\frac{H}{s}\times\frac{W}{s}}$ , where $H,W\gg r$ .

Table 1: Performance on the MTC dataset.

\dagger

indicates our re-implementation.

Method	MAE	RMSE	rMAE(%)	rRMSE(%)	$\mathbf{R^{2}}$
MCNN [34]	17.9	21.9	273.4	692.5	0.33
CSRNet [15]	6.9	11.5	77.8	190.3	0.82
SANet [4]	5.1	11.4	28.7	60.9	-
BCNet [18]	5.2	9.2	31.8	62.5	0.88
SFC2Net [17]	5.0	9.4	17.7	24.6	0.89
TasselNet [22]	6.6	9.9	44.8	89.9	0.87
TasselNetV2 [33]	5.4	9.2	31.9	69.5	0.89
TasselNetV2+ [33]	5.1	9.1	-	-	0.89
TasselNetV2+^† [33]	4.9	8.5	25.8	46.4	0.90
STEERER [12]	5.4	8.1	44.7	107.0	0.89
CountMamba (Ours)	4.6	7.9	26.2	49.9	0.92

Table 2: Performance on the WED dataset.

\dagger

indicates our re-implementation.

Method	MAE	RMSE	rMAE(%)	rRMSE(%)	$\mathbf{R^{2}}$
MCNN [34]	11.5	15.6	8.4	11.0	0.38
CSRNet [15]	4.2	5.2	3.2	4.1	0.94
SANet [4]	4.9	6.2	3.9	5.0	-
BCNet [18]	4.1	4.9	3.1	3.8	0.94
TasselNet [22]	6.8	8.3	-	7.1	0.79
TasselNetV2 [33]	5.3	6.8	4.1	5.3	0.90
TasselNetV2+ [33]	4.9	6.1	-	4.6	0.91
TasselNetV2+^† [33]	4.8	5.9	3.7	4.6	0.91
SFC2Net [17]	4.2	5.1	3.2	4.2	0.93
STEERER [12]	5.8	6.8	4.3	5.2	0.89
CountMamba (Ours)	5.3	6.5	4.0	4.9	0.89

As a redundant count map $\mathbf{C}_{r}\!\in\!\mathbb{R}^{\frac{H}{s}\times\frac{W}{s}}$ is emerged after the operation $f$ , the normalized count map $\mathbf{C}_{n}\!\in\!\mathbb{R}^{H\times W}$ is designed. Subsequently, the image-level count $C$ can be calculated by aggregating $\mathbf{C}_{n}$ , as follows:

C=\sum_{x=1}^{W}\sum_{y=1}^{H}\mathbf{C}_{n}(x,y)

(8)

where $\mathbf{C}_{n}(x,y)$ represents each element of $\mathbf{C}_{n}$ indexed by $x$ and $y$ .

3.6 Loss Function

Following previous works [22, 33], we optimize our CountMamba with $L_{1}$ loss for counting values, which can be formulated as:

L=\frac{1}{B}\sum_{i=1}^{B}|C_{i}-{C}_{i}^{*}|

(9)

where $B$ is the batch size, $C_{i}$ and ${C}_{i}^{*}$ are the predicted number of plants and the ground truth in the $i$ -th image, respectively.

4 Experiments

4.1 Dataset and Evaluation Metric

MTC dataset is a maize tassels count dataset first introduced by Liu et al. [22]. The maize tassels in the MTC dataset exhibit significant variations in scale and shape. Their original resolution ranges from $3648\times 2736$ , $4272\times 2848$ , and $3456\times 2304$ . For the experiment, 186 images are randomly selected and utilized as the training set, while the remaining 175 images are allocated to the test set. Every maize tassel in the dataset is manually labeled using a bounding box. These bounding-box annotations are then converted into dot annotations by calculating their central coordinates.

WED dataset is a wheat ear dataset first introduced by Madec et al. [24]. The images have a resolution of $6000\times 4000$ pixels, with the number of ears varying from 80 to 170 in each image. The dataset contains 236 images, with 165 allocated for training and 71 for testing. While bounding box annotations are provided in this dataset, only the center point of each box is utilized to ensure the uniform comparison of different methods for this experiment.

SHC dataset is a sorghum heads counting dataset first introduced by Guo et al. [24]. It consists of two subsets with 52 and 40 images, respectively. All the images are labeled with dotted annotations. Our experiments are evaluated on dataset 1, in which 26 images are randomly selected for training and the remaining for testing. We follow the same training configuration used in [33].

To evacuate the counting performance on these above datasets, we adopted the MAE, RMSE, rMAE, rRMSE, and $\text{R}^{2}$ as the basic metrics. These metrics are calculated as follows:

\text{MAE}=\frac{1}{N}\sum_{i=1}^{N}\left|G_{i}-P_{i}\right|

(10)

\text{RMSE}=\sqrt{\frac{1}{N}\sum_{i=1}^{N}(G_{i}-P_{i})^{2}}

(11)

\text{rMAE}=\frac{1}{N}\sum_{i=1}^{N}\left|\frac{G_{i}-P_{i}}{G_{i}}\right|

(12)

\text{rRMSE}=\sqrt[]{\frac{1}{N}\sum_{i=1}^{N}(\frac{G_{i}-P_{i}}{G_{i}})^{2}}

(13)

\text{R}^{2}=1-\frac{\sum_{i=1}^{N}\left(P_{i}-G_{i}\right)^{2}}{\sum_{i=1}^{N}\left(P_{i}-\bar{G}\right)^{2}}

(14)

where $G_{i}$ is the real number of a plant in the ith image, $P_{i}$ is the prediction number of the plant in the ith image, $N$ is the number of total images, and $\bar{G}$ is the mean real number count.

4.2 Implementation Details

All the images in the MTC dataset are resized to 512 $\times$ 512, and crop size is set to 256 $\times$ 256. As for the WED dataset, the down-sampling ratio is set to $\frac{1}{8}$ , the crop size is set to 256 $\times$ 256. All the images in the SHC dataset are randomly cropped as 256 $\times$ 1024. The training process employs Adam optimizer with an initial learning rate of 1 $\times$ $10^{-4}$ . The size of image patches is $2\times 2$ , and the hyperparameter $\beta$ is set to 1. All experiments are carried out with the Pytorch framework and a single NVIDIA GeForce RTX 4090 GPU.

5 Results and Analysis

5.1 Comparison with the State-of-the-Arts

Tab. 1, 2, and 3 show the quantitative results between CountMamba and state-of-the-art plant counting methods on MTC, WED and SHC dataset, respectively. Thanks to the adaptation to the distribution of different directions, our proposed CountMamba achieves competitive performance on all three benchmark datasets. Notably, CountMamba achieves the best MAE of 4.6 and RMSE of 7.9 on the MTC dataset and the best RMSE of 18.6 on the SHC dataset.

Table 3: Performance on the SHC dataset.

\dagger

indicates our re-implementation.

Method	MAE	RMSE	rMAE(%)	rRMSE(%)	$\mathbf{R^{2}}$
TasselNetV2 [33]	18.0	21.3	-	-	0.96
TasselNetV2+ [33]	17.5	20.6	-	-	0.96
TasselNetV2+^† [33]	21.4	23.7	4.4	4.9	0.94
STEERER [12]	15.6	19.2	3.2	3.9	0.95
CountMamba (Ours)	15.9	18.6	3.2	3.7	0.96

Table 4: Ablation study of different scan directions on MTC dataset.

HSSM	VSSM	DSSM	ASSM	MAE	RMSE	rMAE(%)	rRMSE(%)	$\mathbf{R^{2}}$
			✓	5.0	8.4	34.3	78.0	0.91
		✓	✓	4.6	7.7	30.1	59.4	0.92
	✓	✓	✓	4.6	7.9	27.5	55.6	0.91
✓	✓	✓	✓	4.6	7.9	26.2	49.9	0.92

Table 5: Ablation study of expert numbers on the MTC dataset.

#Experts	MAE	RMSE	rMAE(%)	rRMSE(%)	$\mathbf{R^{2}}$
One	4.9	8.3	33.6	73.9	0.91
Two	4.7	8.0	28.7	58.1	0.91
Four	4.6	7.9	26.2	49.9	0.92

Table 6: Ablation study of Global-Local Adaptive Fusion on MTC dataset.

Adaptive Fusion	CNN Branch	MAE	RMSE	rMAE(%)	rRMSE(%)	$\mathbf{R^{2}}$
		6.5	10.9	45.4	91.1	0.85
	✓	4.9	8.2	33.7	69.4	0.91
✓	✓	4.6	7.9	26.2	49.9	0.92

5.2 Ablation Study

Effects of different scan directions. To allow Mamba to count plants distributed in various directions, we build four SSMs inspired by [36], which uses scans in eight directions to generate scanned sequences. Here, we ablate different scan directions to study the effects, and the results are shown in 4. We are adding anti-diagonal, diagonal, vertical, and horizontal scanning in succession, which results in better results, demonstrating that counting in different order benefits the plant distribution in various directions.

Effects of expert numbers. We compare different numbers of counting experts in Tab. 5, including 1) One: one State-Space Block that scans in the horizontal, vertical, diagonal, and anti-diagonal directions simultaneously; 2) Two: one State-Space Block that scans in the horizontal and vertical directions and the other in the diagonal and anti-diagonal directions; 3) Four: using HSSB, VSSB, DSSB and ASSB. The results show that assigning feature extraction in the horizontal, vertical, diagonal, and anti-diagonal directions to four experts achieves the best counting results.

Effects of Global-Local Adaptive Fusion. We ablate two key designs in Global-Local Adaptive Fusion. The results, presented in 6, indicate that 1) Introducing a CNN branch to complement the global relationships with local information can effectively enhance the fine-grained local details. 2) Without using sample-wise adaptive fusion, i.e., directly using the average of HSSM, VSSM, DSSM, and ASSM for counting, can only obtain sub-optimal results. Using sample-wise adaptive fusion can flexibly adapt to plants with different spatial distributions.

5.3 Visualization

Qualitative results of our proposed CountMamba on MTC, WED, and SHC datasets are shown in Fig. 3. We observe that: 1) CountMamba obtains strong responses on plant regions and weak responses on non-plant regions in the counting map and thus can infer accurate counting results. 2) CountMamba is robust to the direction of plant distribution and consistently effectively infers numbers from tens to hundreds.

6 Conclusion and Future Work

In this paper, we propose CountMamba to explore the power of the recent advanced state space model for plant counting. Our method leverages a Multi-directional State-Space Group to scan image patches in multiple orders, adapting to any distribution of plants. Moreover, Global-Local Adaptive Fusion is utilized to adaptively aggregate the global features extracted from various directions and local features extracted by the CNN branch in a sample-wise manner. Extensive experiments on multiple benchmark datasets demonstrate our CountMamba serves as a simple but effective state-space model for plant counting. Future work will explore the scalability of our approach to more complex and diverse plant scenarios, as well as the potential integration of fine-grained feature extraction strategies.

Acknowledgments. This work was supported by the grants from the National Natural Science Foundation of China (U22B2048, 61925201, 62132001, 62373043).

References

[1] Bai, X., Liu, P., Cao, Z., Lu, H., Xiong, H., Yang, A., Cai, Z., Wang, J., Yao, J.: Rice plant counting, locating, and sizing method based on high-throughput uav rgb images. Plant Phenomics 5, 0020 (2023)
[2] Boissard, P., Martin, V., Moisan, S.: A cognitive vision approach to early pest detection in greenhouse crops. computers and electronics in agriculture 62(2), 81–93 (2008)
[3] Bolya, D., Zhou, C., Xiao, F., Lee, Y.J.: Yolact++ better real-time instance segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 44(2), 1108–1121 (2022)
[4] Cao, X., Wang, Z., Zhao, Y., Su, F.: Scale aggregation network for accurate and efficient crowd counting. In: Proceedings of the European conference on computer vision). pp. 734–750 (2018)
[5] Cointault, F., Guerin, D., Guillemin, J.P., Chopinet, B.: In-field triticum aestivum ear counting using colour-texture image analysis. New Zealand Journal of Crop and Horticultural Science 36(2), 117–130 (2008)
[6] Donapati, R.R., Cheruku, R., Kodali, P.: Real-time seed detection and germination analysis in precision agriculture: A fusion model with u-net and cnn on jetson nano. IEEE Transactions on AgriFood Electronics (2023)
[7] Fu, D.Y., Dao, T., Saab, K.K., Thomas, A.W., Rudra, A., Ré, C.: Hungry hungry hippos: Towards language modeling with state space models. arXiv preprint arXiv:2212.14052 (2022)
[8] Giuffrida, M.V., Minervini, M., Tsaftaris, S.A.: Learning to count leaves in rosette plants (2016)
[9] Gu, A., Dao, T.: Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752 (2023)
[10] Gu, A., Goel, K., Ré, C.: Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396 (2021)
[11] Guo, W., Zheng, B., Potgieter, A.B., Diot, J., Watanabe, K., Noshita, K., Jordan, D.R., Wang, X., Watson, J., Ninomiya, S., et al.: Aerial imagery analysis–quantifying appearance and number of sorghum heads for applications in breeding and agronomy. Frontiers in plant science 9, 1544 (2018)
[12] Han, T., Bai, L., Liu, L., Ouyang, W.: Steerer: Resolving scale variations for counting and localization via selective inheritance learning. In: Proceedings of the International Conference on Computer Vision. pp. 21848–21859 (2023)
[13] Huang, T., Pei, X., You, S., Wang, F., Qian, C., Xu, C.: Localmamba: Visual state space model with windowed selective scan. arXiv preprint arXiv:2403.09338 (2024)
[14] Jin, X., Madec, S., Dutartre, D., de Solan, B., Comar, A., Baret, F.: High-throughput measurements of stem characteristics to estimate ear density and above-ground biomass. Plant Phenomics (2019)
[15] Li, Y., Zhang, X., Chen, D.: Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1091–1100 (2018)
[16] Lin, J., Li, J., Ma, Z., Li, C., Huang, G., Lu, H.: A framework for single-panicle litchi flower counting by regression with multitask learning. Plant Phenomics 6, 0172 (2024)
[17] Liu, L., Lu, H., Li, Y., Cao, Z.: High-throughput rice density estimation from transplantation to tillering stages using deep networks. Plant Phenomics (2020)
[18] Liu, L., Lu, H., Xiong, H., Xian, K., Cao, Z., Shen, C.: Counting objects by blockwise classification. IEEE Transactions on Circuits and Systems for Video Technology 30(10), 3513–3527 (2019)
[19] Liu, W., Quijano, K., Crawford, M.M.: Yolov5-tassel: Detecting tassels in rgb uav imagery with improved yolov5 based on transfer learning. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 15, 8085–8094 (2022)
[20] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3431–3440 (2015)
[21] Lu, H., Cao, Z., Xiao, Y., Li, Y., Zhu, Y.: Region-based colour modelling for joint crop and maize tassel segmentation. Biosystems Engineering 147, 139–150 (2016)
[22] Lu, H., Cao, Z., Xiao, Y., Zhuang, B., Shen, C.: Tasselnet: counting maize tassels in the wild via local counts regression network. Plant methods 13, 1–17 (2017)
[23] Lu, H., Liu, L., Li, Y.N., Zhao, X.M., Wang, X.Q., Cao, Z.G.: Tasselnetv3: Explainable plant counting with guided upsampling and background suppression. IEEE Transactions on Geoscience and Remote Sensing 60, 1–15 (2021)
[24] Madec, S., Jin, X., Lu, H., De Solan, B., Liu, S., Duyme, F., Heritier, E., Baret, F.: Ear density estimation from high resolution rgb imagery using deep learning technique. Agricultural and forest meteorology 264, 225–234 (2019)
[25] Mehta, H., Gupta, A., Cutkosky, A., Neyshabur, B.: Long range language modeling via gated state spaces. arXiv preprint arXiv:2206.13947 (2022)
[26] Primicerio, J., Caruso, G., Comba, L., Crisci, A., Gay, P., Guidoni, S., Genesio, L., Ricauda Aimonino, D., Vaccari, F.P.: Individual plant definition and missing plant characterization in vineyards from high-resolution uav imagery. European Journal of Remote Sensing 50(1), 179–186 (2017)
[27] Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28 (2015)
[28] Sho, H., Xian-Hua, H.: A hybrid wheat head detection model with incorporated cnn and transformer. IEICE Proceedings Series 78(P1-09) (2023)
[29] Smith, J.T., Warrington, A., Linderman, S.W.: Simplified state space layers for sequence modeling. arXiv preprint arXiv:2208.04933 (2022)
[30] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017)
[31] Wang, Y., Du, F., Wang, J., Wang, K., Tian, C., Qi, X., Lu, F., Liu, X., Ye, X., Jiao, Y.: Improving bread wheat yield through modulating an unselected ap2/erf gene. Nature Plants 8(8), 930–939 (2022)
[32] Wu, J., Yang, G., Yang, X., Xu, B., Han, L., Zhu, Y.: Automatic counting of in situ rice seedlings from uav images based on a deep fully convolutional neural network. Remote Sensing 11(6), 691 (2019)
[33] Xiong, H., Cao, Z., Lu, H., Madec, S., Liu, L., Shen, C.: Tasselnetv2: in-field counting of wheat spikes with context-augmented local regression networks. Plant methods 15(1), 150 (2019)
[34] Zhang, Y., Zhou, D., Chen, S., Gao, S., Ma, Y.: Single-image crowd counting via multi-column convolutional neural network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 589–597 (2016)
[35] Zhao, J., Cai, Y., Wang, S., Yan, J., Qiu, X., Yao, X., Tian, Y., Zhu, Y., Cao, W., Zhang, X.: Small and oriented wheat spike detection at the filling and maturity stages based on wheatnet. Plant Phenomics 5, 0109 (2023)
[36] Zhao, S., Chen, H., Zhang, X., Xiao, P., Bai, L., Ouyang, W.: Rs-mamba for large remote sensing image dense prediction. arXiv preprint arXiv:2404.02668 (2024)
[37] Zhu, L., Liao, B., Zhang, Q., Wang, X., Liu, W., Wang, X.: Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:2401.09417 (2024)