Pheno-Robot: An Auto-Digital Modelling System for In-Situ Phenotyping in the Field

Yaoqiang Pan^1,∗, Kewei Hu^1,∗, Tianhao Liu², Chao Chen², and Hanwen Kang^2,# ^∗ Equal Contribution¹ K.Hu and Y.Pan are with the College of Engineering, South China Agriculture University, Guangzhou, China² T.Liu, C.Chen, and H.Kang are with the Department of Mechanical and Aerospace Engineering, Monash University, Melbourne, Australia

Abstract

Accurate reconstruction of plant models for phenotyping analysis is critical for optimising sustainable agricultural practices in precision agriculture. Traditional laboratory-based phenotyping, while valuable, falls short of understanding how plants grow under uncontrolled conditions. Robotic technologies offer a promising avenue for large-scale, direct phenotyping in real-world environments. This study explores the deployment of emerging robotics and digital technology in plant phenotyping to improve performance and efficiency. Three critical functional modules: environmental understanding, robotic motion planning, and in-situ phenotyping, are introduced to automate the entire process. Experimental results demonstrate the effectiveness of the system in agricultural environments. The pheno-robot system autonomously collects high-quality data by navigating around plants. In addition, the in-situ modelling model reconstructs high-quality plant models from the data collected by the robot. The developed robotic system shows high efficiency and robustness, demonstrating its potential to advance plant science in real-world agricultural environments.

I Introduction

Plant phenotyping plays a fundamental role in precision agriculture. It involves identifying and selecting genetics with advantageous input traits and complementary output traits [1]. While phenotyping in the laboratory plays a key role in identifying promising lines for crossbreeding, they are a surrogate for the primary goal of understanding how a crop will grow in real-world environments [2]. Uncontrolled ’non-laboratory’ conditions present significant challenges, particularly in the analysis of the traits responsible for beneficial responses [3]. Therefore, the mass collection of phenotypic data in the field is essential for precision agriculture.

At the heart of plant phenotyping is the digital modelling of plant growth and traits, including both appearance and geometry [4]. This process involves monitoring the growth, development, and changes in plants, taking into account factors such as climate, soil properties, pests, and diseases [5]. Image-based plant modelling is a widely used approach, particularly for analysing observable traits [4]. However, analysing plants from a single viewpoint poses challenges, especially when plants overlap [6]. To overcome this limitation, RGB images acquired from multiple viewpoints are widely used. Recently, 3D reconstruction technology has become a prominent tool for plant analysis [7]. This new paradigm provides essential information about the geometric aspects of plant [8], including height [9], volume [10], and even mass [11]. These advances in modelling technology, coupled with the increasing availability of information-rich data, contribute significantly to the capture of detailed plant characteristics.

Refer to caption — Figure 1: Pheno-Robot system operates in the greenhouse.

Despite significant advances, the current practice heavily relies on manual operation, which is highly labour-intensive and inefficient, making it impractical for large-scale farms. Robotics offers the potential for widespread sampling in the field under authentic agricultural conditions. However, the realisation of robotic automation for high-quality in-situ plant phenotyping in the field faces two main challenges. Firstly, there is a lack of effective methods for perception and autonomously navigating the intricacies of farm landscapes to perform large-scale and direct phenotyping. Secondly, there is a crucial gap in the availability of an accurate and effective digital modelling method capable of generating high-fidelity and multi-modal plant models. Successfully overcoming these challenges holds the key to enabling capabilities for repeated and detailed assessments of plants, potentially rendering a paradigm shift in the development of agri-genetics.

In this research, we present an innovative robotic system designed for autonomous in-situ phenotyping in the field, synthesising both robotic and digital technologies. The developed Pheno-Robot system comprises a comprehensive framework that includes a deep learning model and a motion planning method for robotic automation. It also incorporates a Neural Radiance Field (NeRF)-based modelling network, which enhances the system’s ability to perform detailed and accurate in-situ phenotyping. Our key contributions are:

•

develops a novel environmental understanding and the robotic navigation method tailored for farm environments.
•

develops an improved NeRF method to address sparse-view input to achieve good quality in plant modelling.
•

presents a hierarchy mapping method and demonstrates the Pheno-Robot system in agricultural settings.

The rest of this paper is organised as follows. Section II surveys related work, followed by the proposed methodologies in Section III. The experiment results are given in Sections IV. and then the conclusions are given in Section V.

II Related Works

II-A Robotic Automation in Agriculture

Robots play a key role in precision agriculture, using smart and intervention technologies to improve efficiency through sensing and automation. Precision agriculture addresses spatial and temporal variability in the environment and plant growth patterns, scaling from traditional farm level [9] to sub-field precision [12]. Horticulture farms are more complex than crop farms, requiring robots to navigate semi-structured, challenging terrain in dynamic environments [13]. Firstly, sensor information is essential for detecting objects and potential risks in the field to ensure the safe operation of robotic vehicles. Secondly, machine learning is indispensable for farming robots to maximize locomotion flexibility (e.g., moving sideways or navigating narrow spaces between crops) and optimize input utilization with greater efficacy. The agility of these robots, coupled with the ability to carry specialized sensors, holds the key to unlocking the full potential of precision agriculture. Despite notable progress, exemplified by robots designed for autonomous operations in orchard environments for selective harvesting [14] or monitoring applications [15], there is still uncharted territory. Robots capable of precise environmental recognition, robust and flexible motion, and accurate plant modelling remain an area for exploration.

II-B In-situ phenotyping

Phenomics encompasses the study of various phenotypic plant traits, including growth, yield, plant height, leaf area index and more. Traditional methods rely on imaging and 3D range sensors to measure various plant traits such as colour, shape, volume, and spatial structure [8]. For example, the growth rate of rosette plants such as Arabidopsis [16] and tobacco [6] is often analysed using images taken from a single viewpoint. On the other hand, 3D range sensors allow accurate measurement of plant geometry and characteristics such as height, width and volume [13]. Kang et al. [17] proposed a sensor fusion system for yield estimation in apple orchards, using deep learning-based panoptic segmentation algorithms. Wu et al. [2] utilised Multi-View Stereo (MVS) for the reconstruction of crop geometry in the field. However, 3D range sensors face limitations in agricultural environments due to strong occlusion, often resulting in a bleeding effect [18] near discontinuous geometry. In addition, their point resolution is sparse compared to image-based data. While MVS-based methods offer better quality with sufficient data, they require hours to process even simple plants [11], limiting their applicability in real-time operations. Furthermore, current technologies do not provide a multi-modal representation of instances, making them inefficient for analysing plant features when multiple data types (image/geometry) are required.

III Methods

The Pheno-Robot system comprises three subsystems: an Environmental Perception Module (EPM), a Motion Planning Module (MPM), and an In-situ Phenotyping Module (IPM), as illustrated in Figure 2.

III-A Environment Perception

III-A1 3D Detection Network

In this study, we utilise a novel neural network, the 3D Object Detection Network (3D-ODN), as introduced in our previous study [19]. Specifically designed for processing point cloud maps in Bird’s-Eye View (BEV), the 3D-ODN consists of three integral network branches: point cloud subdivision, feature coding, and a detection head. First, the entire point cloud map is uniformly segmentated. The points within each segment are then projected onto the BEV perspective, where local semantic features are extracted. These features are then fed into the main branch, forming a single-stage network architecture dedicated to processing the point cloud features from the BEV angle. To improve the learning of multi-scale feature embeddings, the feature pyramid is used to facilitate the fusion of features of multi-scales. Finally, the recognition results are projected into 3D space based on the predicted height values. The identified instances are denoted as $\mathbf{T}=\{t_{1},t_{2},\cdots\rvert t_{i}\in\mathbb{R}^{2}\}$ , representing the instances detected by the 3D-ODN (see Fig.3 (a)).

III-A2 Graph Mapping

For each instance, eight nodes are designated to represent the four corners in two different directions of motion, as shown in Fig.3 (b). This orientation consideration aims to mitigate turning movements in narrow passages, which could otherwise lead to a higher failure rate of movement due to trapping into potential obstacles. Consequently, each instance $o$ is associated with eight nodes, denoted as $t_{i}=\{N^{i}_{1},N^{i}_{2},\cdots,N^{i}_{8}\rvert N^{i}_{j}\in\mathbb{R}^{2}\}$ (see Fig.3 (b)). To extract the line structure of farms, a line detection algorithm is used to group instances in the same row, denoted $\mathbf{g}$ . Each row has two common access nodes at both ends, denoted $N_{l}$ and $N_{r}$ ( $N_{l}$ and $N_{r}\in\mathbb{R}^{2}$ ). These nodes serve to establish connections between all instances within the group and to form links with other groups (see Fig.3 (a)). A group is thus characterised as $g_{i}=\{t_{1},t_{2},\cdots,N^{i}_{l},N^{i}_{r}\}$ . The map is then represented as a set of groups denoted by $\mathbf{G}=\{g_{1},g_{2},\cdots\}$ .

III-B Trajectory Planning

III-B1 Global Path Planning

A Greedy-search-based path generation algorithm on the graph map is developed, as illustrated in Alg.1. The following are the relevant definitions:

•

$\mathbf{T}$ represents the goal set that requires phenotyping.
•

$\mathbf{V}$ represents the subgroups if their instances in $\mathbf{T}$ .
•

$\mathbf{N_{start}}$ represents the robot position at the start.
•

$\mathbf{\Gamma}$ represents the global path that includes a sequence of nodes.

Algorithm 1 Global Path Generation

\mathbf{T},\mathbf{G},\mathbf{N_{start}}

\mathbf{\Gamma}

1: for

t_{i}

\mathbf{T}

\mathbf{v}_{new}\leftarrow\mathbf{FindParent}(t_{i});

\mathbf{V}\leftarrow\mathbf{V}\bigcup\mathbf{v}_{new};

4: end for

5: while

\mathbf{V}\neq\mathbf{\emptyset}

\mathbf{v}_{j}\leftarrow\mathbf{FindNearestSubgroup}(\mathbf{\Gamma},\mathbf{V});

\mathbf{\Gamma}_{new}\leftarrow\mathbf{PlanConnection}(\mathbf{\Gamma},\mathbf{v}_{j});

\mathbf{\Gamma}\leftarrow\mathbf{\Gamma}\bigcup\mathbf{\Gamma}_{new};

\mathbf{Delete}\ \mathbf{v}_{j}

\mathbf{from}\ \mathbf{V};

10: end while

11: return

\mathbf{\Gamma}

Two key sub-functions presented in Alg.1 are described as follows:

•

$\mathbf{FindParent}$ : this function finds the group that this instance belongs to.
•

$\mathbf{FindNearestSubgroup}$ : this function finds the subgroup that has the instance closest to the current robot location.

The key functions $\mathbf{PlanConnection}$ is described in Alg.2.

Algorithm 2

\mathbf{PlanConnection}

\mathbf{\Gamma},\mathbf{v}=\{t_{1},t_{2},\cdot,t_{n}\rvert t_{i}\in\mathbf{g}_{i}\}

\mathbf{\Gamma}_{new}

1: while

\mathbf{v}\neq\mathbf{\emptyset}

\mathbf{N}_{new},\ t_{i}\leftarrow\mathbf{FindNearestAndFeasible}(\mathbf{\mathbf{\Gamma},v});

\mathbf{\Gamma}_{new}\leftarrow\mathbf{\Gamma}_{new}\bigcup\mathbf{N}_{new};

4: if

\mathbf{IFFullyCover}(\Gamma_{new},t_{i})

then

\mathbf{Delete}\ t_{i}

\mathbf{from}\ \mathbf{v};

6: end if

7: end while

8: return

\mathbf{\Gamma}_{new}

The function $\mathbf{FindNearestAndFeasible}$ finds the feasible and the nearest node (see Fig.4 (a)) of an instance in the subgroup. It evaluates the feasibility from two perspectives: traversability and orientation, which will be detailed in Sec III-B3. Another function $\mathbf{IFFullyCover}$ evaluates if an instance has been fully covered by the generated path. This is achieved by checking the number of connected nodes of an instance, as shown in Fig.4 (b).

III-B2 Local Trajectory Generation

This step aims to compute the detailed path based on the global path, including three steps: path generation, optimisation, and interpolation.

Initial path Generation: Given the global path $\Gamma$ , the first step is to generate an initial collision-free path between two nodes. Two scenarios are considered, including path generation for phenotyping data acquisition (where two adjacent nodes belong to the same instance) and paths between nodes of different instances. In the first case, a sequence of collision-free preset sample positions is established and the RRT algorithm is used to connect these positions to form the sampling trajectory between nodes. In the second case, the $\mathbb{A}^{*}$ algorithm is used to determine a collision-free path between nodes.

Trajectory Optimisation: The initial path $\Phi$ consists of a sequence of points $\{Q_{0},Q_{1},\cdots|Q\in\mathbb{R}^{2}\}$ , and we conceptualise a trajectory $t\in[0,1]\rightarrow\Phi\subset\mathbb{R}^{2}$ as a continuous function mapping time to robot states. The objective function incorporates three key aspects of the robot’s motion. It penalises velocities to encourage smoothness, proximity to the environment to secure trajectories that maintain a certain distance from obstacles, and the distance between the state and the preset viewpoint. These terms are represented as $f_{s}$ , $f_{c}$ , and $f_{o}$ , respectively. The objective function is:

\Phi^{*}=\mathop{\mathbf{argmin}}_{\Phi}\ \alpha_{s}f_{s}(\Phi)+\alpha_{c}f_{c}(\Phi)+\alpha_{o}f_{o}(\Phi)

(1)

the initial state $\Phi(0)=Q_{0}$ and final state $\Phi(1)=Q_{1}$ are fixed. $\alpha_{s}$ , $\alpha_{c}$ , and $\alpha_{o}$ are weights for each penalty terms. The detailed formulations of $f_{s}$ , $f_{c}$ , and $f_{o}$ are as follows:

f_{s}(\Phi)=\int_{\Phi}||\frac{d}{dt}\Phi(t)||^{2}

(2)

f_{c}(\Phi)=\int_{\Phi}c(\Phi(t))||\frac{d}{dt}\Phi(t)||dt

(3)

f_{o}(\Phi)=\int_{\Phi}||\Phi(t)-\Phi_{d}||^{2}dt

(4)

where $c(\cdot):\mathbb{R}^{2}\to\mathbb{R}$ be the function that penalises the state near the obstacles, $\Phi_{d}$ is the trajectory that contains the desired view-position for data acquisition.

We update the trajectory by the functional gradients $\bar{\nabla}f(\Phi_{i})$ using $\Phi_{i+1}=\Phi_{i}-l_{\mathbf{r}}\cdot\bar{\nabla}f(\Phi_{i})$ , following the work [20], where $l_{\mathbf{r}}$ is the learning rate. The functional gradient of the objective in (2)-(4) is given by

\bar{\nabla}f_{s}(\Phi)=-\frac{d^{2}}{dt^{2}}\Phi(t)

(5)

\bar{\nabla}f_{c}(\Phi)=||\Phi^{\prime}(t)||\cdot[(I-\Phi^{\prime}(t)\Phi^{\prime}(t)^{T})\nabla c-c\kappa]

(6)

\bar{\nabla}f_{o}(\Phi)=-(\Phi(t)-\Phi_{d})

(7)

where $\nabla c$ is the derivative of obstacles to the control points $Q_{j}$ by the $\partial c/\partial Q_{j}$ , the definition of $\kappa$ is given as below.

\kappa=||\Phi^{\prime}(t)||^{2}((I-\Phi^{\prime}(t)\Phi^{\prime}(t)^{T})\Phi^{\prime\prime})

(8)

B-spline Interpolation: The optimised trajectory is then parameterised by a piece-wise B-spline into a uniform curve $\Phi_{B}$ . Given the determined degree $m$ , and a knot vector $\{k_{0},k_{1},\cdot,k_{M}\}$ , where $M=n_{q}+2m$ . The parameterised uniform curve $\zeta(t)$ can be formulated as:

\zeta(t)=\frac{\sum\limits_{i=1}^{n_{q}}R_{i,m}(t)w_{i}Q_{i}}{\sum\limits_{i=1}^{n_{q}}R_{i,m}(t)w_{i}}

(9)

where $w_{i}$ is the weight of each control points $Q_{i}$ , $R_{i,m}$ is the basis function for $Q_{i}$ at $m$ degree, $u$ is the control value of the curve. Then, the TEB-planner [21] is used to find the proper velocity to follow the trajectory.

III-B3 Feasibility checking

The $\mathbf{FindNearestAndFeasible}$ function in Alg.2 evaluates the feasibility between two nodes from two aspects: orientations and traversability.

I. Orientation Checking: This term evaluates whether two nodes have similar directions, as the narrow tunnels between plants do not allow turn-round. The largest allowed orientation difference between two nodes is $1/3\pi$ .

II. Traversability Analysis: The point cloud between two nodes is projected onto a 2D grid map. The traversability of each grid is assessed as the multiple risk factors, including:

(a) Collision risk: A risk factor quantified by the possibility of a grid belonging to obstacles, denoted $\theta\in[0,1]$ . A terrain analysis [22] is utilised here.

(b) Slope risk: For each point on the terrain $T_{i}$ , points on the map are selected by a cube box with sides of length $l_{s}$ , which is represented as $\Omega_{i}=\{(p^{j}_{i})_{j=1:N_{i}}\rvert p_{i}^{j}\in\mathbb{R}^{3}\}$ . The SVD is used to fit a plane $P_{i}$ from the $\Omega_{i}$ and get the normal vector $n_{z}\in\mathbb{R}^{3}$ . The slop angle between terrain $T_{i}$ and vertical direction $n_{z}$ is obtained by $\mathbf{s}=\mathbf{arccos}\frac{||e_{z}\cdot n_{z}||}{||e_{z}||\cdot||n_{z}||}$ .

(c) Step risk: This term evaluates the height gap between adjacent grid cells. Negative obstacles can also be detected by checking the lack of measurement points in a cell. The maximum height gap is denoted as $\lambda$ .

The above three risks are combined by a weighted sum to obtain the weighted traversability value $\Upsilon$ , as:

\Upsilon=\theta+\alpha_{\mathbf{s}}\frac{\mathbf{s}}{\mathbf{s}_{crit}}+\alpha_{\lambda}\frac{\lambda}{\lambda_{crit}}

(10)

where the $\mathbf{s}_{crit}$ and $\lambda_{crit}$ are the maximum allowed slope angle and height gap, respectively.

III. Terrain analysis for trajectory optimisation: For robot operation, the odometry and KD-tree [23] are utilised to build the local terrain map (Fig.5 (a)). The detected obstacle points (Fig.5 (b)) are converted to polygons, denoted as $\{\mathcal{O}\subset\mathcal{R}^{2}\}$ . Let the $\phi(Q_{i},\mathcal{O})$ describe the minimal Euclidean distance between obstacles and a control point. A minimum separation $\phi_{\mathrm{min}}$ between all obstacles and $Q_{i}$ is found by

\phi_{\mathrm{min}}=\mathrm{min}[\phi(Q_{i},\mathcal{O}_{1}),\phi(Q_{i},\mathcal{O}_{2}),\cdots]

(11)

The obstacles $\mathcal{O}$ is updated online for dynamic environments.

III-C In-situ Phenotyping Model

III-C1 Neural Rendering

NeRF represent the 3D scene as a radiance field which describes volume density $\sigma$ and view-dependent color $c$ for every point $x$ and every viewing direction $d$ via a MLP [24]:

\sigma,c=H_{\Theta}(x,d)

(12)

The ray tracing-based volume rendering is used to render the parameters of a 3D scene into colours $\hat{C}$ of a ray $r$ , expressed as:

\displaystyle\hat{C}(r)

\displaystyle=\sum_{i=1}^{N}T_{i}\left(1-\exp\left(-\sigma_{i}\delta_{i}\right)\right)c_{i},T_{i}

(13)

Where $T$ is the volume transmittance and $\delta$ is the step size of ray marching. For every pixel, the squared loss on photometric error is used for the optimization of MLP. When applied across the image, this loss $\mathcal{L}_{\mathrm{Rendering}}$ is represented as:

\mathcal{L}_{\mathrm{Rendering}}=\sum_{r\in R}||\hat{C}(r)-C(r)||_{2}^{2}

(14)

where $C(r)$ is the ground truth colours.

III-C2 Few-shot learning from Sparse Views

The artefact of ”white floaters” caused by rendering distortion is the common failure mode in NeRF learning. Autonomous data acquisition by robots can always lead to imperfect density and sparse views with fewer overlapping regions (Fig. 6 (a)), thus distorting the rendering of these regions. This is essential because, between these sparse views, NeRF’s training lacks sufficient information to estimate the correct geometric information of the scene, leading to significant and dense floats in the region that is close to the camera. To reduce those artefacts, an optimisation term $\mathcal{L}_{occ}$ , which is designed to regulate the learning behaviour of NeRF [25], is used to penalize the dense fields near the camera via “occlusion” regularization, which can be expressed as:

\displaystyle\mathcal{L}_{occ}

\displaystyle=\frac{\boldsymbol{\sigma}_{K}\mathbf{m}_{K}}{K}=\frac{1}{K}\sum_{K}\sigma_{k}\cdot m_{k}

(15)

where $\boldsymbol{\sigma_{K}}$ represents the density values of the $K$ points sampled along the ray, ranked in order of proximity to the origin, $\mathbf{m_{K}}$ is the binary mask vector that determines whether a ray sector will be penalised or not.

III-C3 Geometry Extraction from Field

Given a predefined 3D region of interest, a set of spatial points $P=\{p_{1},p_{2},...,p_{n}\}$ is generated via dense volumetric sampling. For each point $p_{i}\in P$ , it’s evaluated through the NeRF model to obtain The density values, $\sigma(p_{i})=\mathrm{NeRF}_{\sigma}(p_{i})$ , form the basis for surface extraction. The Marching Cubes algorithm identifies the iso-surface by threshold of the density values:

M=\mathbf{MarchingCubes}(P,\sigma_{\text{threshold}})

(16)

Where $M$ is the resultant mesh and $\sigma_{threshold}$ is an optimal density value demarcating the object’s boundary.

For vertex $v_{j}$ in mesh $M$ , a viewing ray $r_{j}$ is constructed and queried $\mathbf{c}(v_{j})=\mathrm{NeRF}_{\mathbf{c}}(v_{j},r_{j})$ in NeRF. Those values derived from the field are mapped onto mesh $M$ , assigning colour to each vertex:

\operatorname{Color}(v_{j})=\mathbf{c}(v_{j})

(17)

For a 2D texture representation, the vertex-coloured mesh undergoes UV unwrapping. To minimize distortion, Least Squares Conformal Mapping (LSCM) [26] is used, which is to minimize the conformal energy:

E(u,v)=\int_{\Omega}\left(|\nabla u|^{2}+|\nabla v|^{2}\right)dA

(18)

where $(u,v)$ are the 2D texture coordinates for each vertex in $M$ , $\Omega$ represents the object’s surface, and $dA$ is a differential area element on the mesh’s surface, indicating that the energy is computed by integrating over the surface of the mesh.

III-C4 NeRF Model

The Instant-NGP [27], which makes use of a multi-resolution hashed encoding that can represent learned features of the scenes with tiny MLPs, is utilised. In detail, Instant-NGP operates on the premise that the object to be reconstructed is enclosed within multi-resolution grids. For any point $\mathbf{x}\in\mathbb{R}^{3}$ in various resolution grids, it obtains the hash encoding $h^{i}(\mathbf{x})\in\mathbb{R}^{d}$ , where $d$ is the features’ dimension, $i$ is the level of tri-linear interpolation. The hash encodings of all levels are concatenated to form the multi-resolution feature $h(\mathbf{x})=\{h^{i}(\mathbf{x})\}_{i=1}^{L}\in\mathbb{R}^{L\times d}$ .

Training: The primary goal of our study is to produce high-quality rendering models of instances. We utilise the rendering loss as it quantifies the discrepancy between the rendered images and the input images, denoted as ${\mathcal{L}_{color}}$ . Besides, to improve the MLP learning under sparse view, we also apply ${\mathcal{L}_{occ}}$ from (12). Therefore, given a set of posed-images $I_{gt}$ and the predicted renderings from the network $I_{pred}$ , the training loss ${\mathcal{L}_{MLP}}$ is defined as:

{\mathcal{L}_{MLP}}={\mathcal{L}_{color}}+\mathcal{L}_{occ}

(19)

where ${\mathcal{L}_{color}}=(1/N)\cdot\sum_{i=1}^{N}\|I_{gt}^{(i)}-I_{pred}^{(i)}\|_{2}^{2}$ and $N$ is the total number of images in the datasets.

III-D Hierarchy Map Representation

A hierarchy scene representation that combines large-scale mapping with local high-fidelity rendering is introduced, including structure-level and instance-level.

I. Structure-level representation: The coarse-level map is designed to extract high-level information, focusing on the structure of the scene and the plants in the environment. In this context, we construct a map of the farm by creating a coloured point cloud map and its corresponding semantic map.

II. Instance-level representation: Each plant in the environment will be meticulously represented, ensuring that each plant $o_{i}$ is captured in fine detail through the robot’s visual image stream. In the context of phenotyping, each detected plant will be associated with a high-fidelity neural-learned rendering model, accompanied by its corresponding 3D model.

IV Experiments

IV-A Pheno-Robot Hardware

The Pheno-Robot is equipped with a 32-line LiDAR, a Realsense D435 camera and a 9-axis IMU (see Fig.7 (a) and (b)) for navigation. Communication between the different modules is via the common message layer on the ROS server. The sensors are connected to the computer via ROS-Noetic on Ubuntu 20.04. A 4K GoPro Hero-11 is mounted on a gimbal stabilizer on the left side, and the image stream from the GoPro is transmitted to the computer through the GoPro-ROS-node. Plant modelling by NeRF training is performed remotely, with data transmission via the 5G wireless network.

IV-B Evaluation on Pheno-Robot system

IV-B1 Evaluation on EPM

This section evaluates the performance of the EPM in typical agricultural environments. Two different settings, shown in Fig. 7 (a) and (b), were selected for the system evaluation, and their corresponding semantic maps are shown in Fig. 7(c) and (d), respectively. The results show the precise instance-level understanding of the EPM of the point cloud map in agricultural scenarios, with an overall recall and precision for detection of 0.95 and 1.0, respectively. The average position error and bounding box error are measured to be 0.06m and 0.02m respectively. Particularly in agricultural landscapes with row structures, our method demonstrates the ability to detect such features and generate a graph map for robot autonomy. Furthermore, the method is versatile and supports both online and offline modes depending on specific requirements. For this study, we perform the semantic extraction process offline.

IV-B2 Evaluation on MPM

This section evaluates the performance of the MPM for automated phenotyping by robots. We conducted tests in two different environments, shown in Fig. 8 and 9, corresponding to the environments shown in Fig. 7 (a) and (b), respectively.

In the experiment, target plants requiring phenotyping were randomly selected. The results show that the presented MPM achieves a high success rate and meets the robot’s requirements in both scenarios, as shown in (a) of Fig. 8 and 9. Compared to conventional $A^{*}$ or Dijkstra planners, which may generate inappropriate global trajectories leading to turning movements in narrow channels, causing local motion planning failures for car-like robots, our graph-based global planner performs better. Beyond the global planner, the developed local trajectory planner also shows strong robustness in field environments, as illustrated in (b) and (c) of Figs. 8 and 9, respectively. A significant factor affecting the performance of the local trajectory planner is the traversability analysis. Given the complex terrain in agricultural environments, traversability may be prone to overestimating or underestimating risk areas during planning, leading to motion planning failures. In the experiment, the robot was optimised with an additional solid-state lidar at the front to improve its perception under complex conditions. In addition, a replanning mechanism is employed for local motion planning, with an update frequency of 5HZ and a forward planning distance of 10m. Overall, our system demonstrates robust performance in the field with appropriate parameter tuning. The maximum speed during sampling is 0.2m/s, while the maximum speed for other conditions is 1m/s.

IV-B3 Evaluation on in-situ Phenotype

This section assesses the performance of IPM. We compare the in-situ phenotyping models of both scenarios by using handheld data acquisition and robot-automated acquisition. The quality of the results is evaluated by Peak Signal to Noise Ratio (PSNR), as detailed in Table 1.

TABLE I: Comparison of Modelling quality

		PSNR(dB)		Time(min)
Dataset	Methods	HA	RA	HA	RA
Outdoor	Instant-NGP	24.8	22.5	2.4	3.2
Outdoor	Ours	25.2	24.2	2.3	2.7
Greenhouse	Instant-NGP	23.4	21.5	3.1	5.2
Greenhouse	Ours	23.3	22.7	2.3	4.1

In the initial experiments, both vanilla Instant-NGP and our model showed fast convergence, taking less than 2 minutes when using hand-collected samples, resulting in a PSNR above 24 dB and indicating high-quality results. However, when using robot-collected samples, Instant-NGP took over 4 minutes to converge and achieved a PSNR below 23 dB. In contrast, our method achieved a PSNR above 23.5 dB and converged in 3 minutes, demonstrating the effectiveness of occlusion regularisation in NeRF training with sparse-view inputs, effectively mitigating geometric estimation errors in the renderings. Moving on to the greenhouse experiments, the quality of modelling using both Instant-NGP and our model showed a decrease, with PSNR values of around 23 dB in approximately 3 minutes when using handheld collected samples. This reduction in quality is mainly due to the complex geometry of the canopy leading to occlusion. When modelling with robotically collected samples, both models showed a decrease in model quality along with a longer training convergence time. In comparison, our model performed better under these field conditions. The hierarchy maps for both scenarios are shown in Figures 10 and LABEL:fig:enter-label-2, revealing global structure-aware maps of farms and detailed models for individual plants. These results highlight the ability of the robotic system to achieve high-quality in-situ phenotyping in complex agricultural environments, demonstrating resilience despite certain limitations.

IV-C Demonstration of Digital-Modelling for Simulation

Our system can streamline the creation of virtual environments for robot development using Isaac Sim, an advanced virtual environment known for realistic physics simulation, rich sensor simulation, and graphical rendering. The environment model, derived from terrain and trees of hierarchy maps, is seamlessly integrated (see Fig. 11).

The first scenario simulates a car-like robot equipped with LiDAR operating in orchards (Fig. 11 (a)). The real-world terrain is converted into 2D images and meshes are constructed based on these images. Tree locations are imported directly from EPM predictions and tree models are generated by extracting meshes from the renderings. In the second scenario (Fig. 11 (b) and (c)), fruit models with fragile joints are added to the trees, creating a reinforcement learning environment for robotic harvesting tasks. The harvesting robot, equipped with a UR-5 manipulator, a soft gripper and a camera on the remote robot base, is trained and evaluated in this virtual environment.

V Conclusion

This study investigates the utilisation of robots in plant phenotyping to increase efficiency and reduce labour-intensive tasks. The proposed system consists of three key sub-systems that address environmental information processing, motion planning for data acquisition, and modelling using data collected by the robot. Experimental results demonstrate the effectiveness of the system, particularly in outdoor environments with mild terrains, where the robot can collect high-quality samples leading to superior plant phenotypic models. For undulating terrains, typically challenging for plant phenotyping, our enhancements to the NeRF model also exhibit promise in generating quality plant phenotypic models. Future research will focus on further refining phenotyping quality by exploiting the robotic system, incorporating the robotic arm, and making additional improvements to the NeRF model to minimise artefacts in challenging environments.

References

[1] L. Fu, F. Gao, J. Wu, R. Li, M. Karkee, and Q. Zhang, “Application of consumer rgb-d cameras for fruit detection and localization in field: A critical review,” Computers and Electronics in Agriculture, vol. 177, p. 105687, 2020.
[2] S. Wu, W. Wen, Y. Wang, J. Fan, C. Wang, W. Gou, and X. Guo, “Mvs-pheno: a portable and low-cost phenotyping platform for maize shoots using multiview stereo 3d reconstruction,” Plant Phenomics, 2020.
[3] D. M. Deery and H. G. Jones, “Field phenomics: Will it enable crop improvement?” Plant Phenomics, 2021.
[4] Y. Zhang and N. Zhang, “Imaging technologies for plant high-throughput phenotyping: a review,” Frontiers of Agricultural Science and Engineering, vol. 5, no. 4, pp. 406–419, 2018.
[5] R. Xu and C. Li, “A modular agricultural robotic system (mars) for precision farming: Concept and implementation,” Journal of Field Robotics, vol. 39, no. 4, pp. 387–409, 2022.
[6] B. Dellen, H. Scharr, and C. Torras, “Growth signatures of rosette plants from time-lapse video,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 12, no. 6, pp. 1470–1478, 2015.
[7] L. Feng, S. Chen, C. Zhang, Y. Zhang, and Y. He, “A comprehensive review on recent applications of unmanned aerial vehicle remote sensing with various sensors for high-throughput plant phenotyping,” Computers and electronics in agriculture, vol. 182, p. 106033, 2021.
[8] S. Paulus, “Measuring crops in 3d: using geometry for plant phenotyping,” Plant methods, vol. 15, no. 1, pp. 1–13, 2019.
[9] N. Virlet, K. Sabermanesh, P. Sadeghi-Tehran, and M. J. Hawkesford, “Field scanalyzer: An automated robotic field phenotyping platform for detailed crop monitoring,” Functional Plant Biology, vol. 44, no. 1, pp. 143–153, 2016.
[10] S. Paulus, H. Schumann, H. Kuhlmann, and J. Léon, “High-precision laser scanning system for capturing 3d plant architecture and analysing growth of cereal plants,” Biosystems Engineering, vol. 121, pp. 1–11, 2014.
[11] D. Xiong, D. Wang, X. Liu, S. Peng, J. Huang, and Y. Li, “Leaf density explains variation in leaf mass per area in rice between cultivars and nitrogen treatments,” Annals of Botany, vol. 117, no. 6, pp. 963–971, 2016.
[12] M. S. A. Mahmud, M. S. Z. Abidin, A. A. Emmanuel, and H. S. Hasan, “Robotics and automation in agriculture: present and future applications,” Applications of Modelling and Simulation, vol. 4, pp. 130–140, 2020.
[13] T. Duckett, S. Pearson, S. Blackmore, B. Grieve, W.-H. Chen, G. Cielniak, J. Cleaversmith, J. Dai, S. Davis, C. Fox et al., “Agricultural robotics: the future of robotic agriculture,” arXiv preprint arXiv:1806.06762, 2018.
[14] W. Au, H. Zhou, T. Liu, E. Kok, X. Wang, M. Wang, and C. Chen, “The monash apple retrieving system: a review on system intelligence and apple harvesting performance,” Computers and Electronics in Agriculture, vol. 213, p. 108164, 2023.
[15] D. Albani, J. IJsselmuiden, R. Haken, and V. Trianni, “Monitoring and mapping with robot swarms for agricultural applications,” in 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). IEEE, 2017, pp. 1–6.
[16] M. Jansen, F. Gilmer, B. Biskup, K. A. Nagel, U. Rascher, A. Fischbach, S. Briem, G. Dreissen, S. Tittmann, S. Braun et al., “Simultaneous phenotyping of leaf growth and chlorophyll fluorescence via growscreen fluoro allows detection of stress tolerance in arabidopsis thaliana and other rosette plants,” Functional Plant Biology, vol. 36, no. 11, pp. 902–914, 2009.
[17] H. Kang, H. Zhou, X. Wang, and C. Chen, “Real-time fruit recognition and grasping estimation for robotic apple harvesting,” Sensors, vol. 20, no. 19, p. 5670, 2020.
[18] H. Kang, X. Wang, and C. Chen, “Accurate fruit localisation using high resolution lidar-camera fusion and instance segmentation,” Computers and Electronics in Agriculture, vol. 203, p. 107450, 2022.
[19] Y. Pan, H. Cao, K. Hu, H. Kang, and X. Wang, “A novel perception and semantic mapping method for robot autonomy in orchards,” arXiv e-prints, pp. arXiv–2308, 2023.
[20] M. Zucker, N. Ratliff, A. D. Dragan, M. Pivtoraiko, M. Klingensmith, C. M. Dellin, J. A. Bagnell, and S. S. Srinivasa, “Chomp: Covariant hamiltonian optimization for motion planning,” The International journal of robotics research, vol. 32, no. 9-10, pp. 1164–1193, 2013.
[21] C. Rösmann, F. Hoffmann, and T. Bertram, “Kinodynamic trajectory optimization and control for car-like robots,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 5681–5686.
[22] W. Zhang, J. Qi, P. Wan, H. Wang, D. Xie, X. Wang, and G. Yan, “An easy-to-use airborne lidar data filtering method based on cloth simulation,” Remote sensing, vol. 8, no. 6, p. 501, 2016.
[23] Y. Cai, W. Xu, and F. Zhang, “ikd-tree: An incremental kd tree for robotic applications,” arXiv preprint arXiv:2102.10808, 2021.
[24] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” Communications of the ACM, vol. 65, no. 1, pp. 99–106, 2021.
[25] J. Yang, M. Pavone, and Y. Wang, “Freenerf: Improving few-shot neural rendering with free frequency regularization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 8254–8263.
[26] B. Lévy, S. Petitjean, N. Ray, and J. Maillot, “Least squares conformal maps for automatic texture atlas generation,” in Seminal Graphics Papers: Pushing the Boundaries, Volume 2, 2023, pp. 193–202.
[27] T. Müller, A. Evans, C. Schied, and A. Keller, “Instant neural graphics primitives with a multiresolution hash encoding,” ACM Transactions on Graphics (ToG), vol. 41, no. 4, pp. 1–15, 2022.