MoMa-Pos: An Efficient Object-Kinematic-Aware Base Placement
Optimization Framework for Mobile Manipulation

Beichen Shao^1†, Nieqing Cao^2†, Yan Ding^3∗, Xingchen Wang¹, Fuqiang Gu¹, Chao Chen^1∗ ^† Equal Contribution, * Corresponding author¹ College of Computer Science, Chongqing University, ² Xi’an Jiaotong-Liverpool University, ³ Shanghai Artificial Intelligence Laboratory

Abstract

In this work, we present MoMa-Pos, a framework that optimizes base placement for mobile manipulators, focusing on navigation-manipulation tasks in environments with both rigid and articulated objects. Base placement is particularly critical in such environments, where improper positioning can severely hinder task execution if the object’s kinematics are not adequately accounted for. MoMa-Pos selectively reconstructs the environment by prioritizing task-relevant key objects, enhancing computational efficiency and ensuring that only essential kinematic details are processed. The framework leverages a graph-based neural network to predict object importance, allowing for focused modeling while minimizing unnecessary computations. Additionally, MoMa-Pos integrates inverse reachability maps with environmental kinematic properties to identify feasible base positions tailored to the specific robot model. Extensive evaluations demonstrate that MoMa-Pos outperforms existing methods in both real and simulated environments, offering improved efficiency, precision, and adaptability across diverse settings and robot models. Supplementary material can be found at https://yding25.com/MoMa-Pos

Refer to caption — Figure 1: Framework of MoMa-Pos. The framework is composed of four key phases. Phases 1 and 2 form the task-specific kinematic perceptual modeling, while Phases 3 and 4 focus on base placement optimization and guided navigation. In *Phase 1*, key objects in the scene are prioritized to enable efficient kinematic modeling without processing every object. In *Phase 2*, kinematic modeling is conducted for these prioritized objects, supporting the subsequent object-kinematic-aware base placement optimization. In *Phase 3*, potential base placement areas are identified by considering both robot-specific constraints and environmental kinematics. Finally, in *Phase 4*, the robot navigates to the optimal position, ensuring physical feasibility and adaptability for task execution.

I Introduction

Mobile manipulators, typically consisting of a mobile base and a robotic arm, are increasingly deployed in everyday environments like homes, where they are tasked with a wide range of navigation-manipulation tasks [1, 2]. These robots generally navigate to targets such as fridges and then execute manipulation tasks like opening fridge doors. It is well-recognized that the success of these tasks is heavily dependent on the robots’ base placement [3, 4, 5, 6]. Improper placement can significantly hinder the robots’ ability to execute fixed-base manipulation, where the mobile bases remain stationary while the robotic arms perform the manipulation tasks at hand.

However, determining feasible base placement remains challenging, particularly in household environments that contain both rigid and articulated objects. Physically interacting with articulated objects requires robots to be kinematically aware of these objects’ specific joint movements and constraints during manipulation. For instance, when opening a fridge door, the robot must account for the door’s rotational movement and position itself correctly to exert the necessary force along the appropriate kinematic trajectory. Therefore, kinematic modeling of these articulated objects is essential. While technologies in this area have advanced considerably [7], the associated computational costs remain significant. The cumulative modeling time of all objects in a given scenario can be substantial, further diminishing the overall efficiency of robotic systems. Additionally, a feasible base placement is highly dependent on the robot model. Robots with shorter arms, such as Stretch [8], need to place themselves closer to target objects due to their limited arm reach.

Existing approaches in computing base position present several limitations. The basic methods [9, 10] set specific base placement areas empirically, without accounting for critical factors such as the kinematics of articulated objects and specific robot models, which limits their applicability in complex scenarios. More advanced methods use inverse reachability maps [11], relying on precomputed kinematic solutions to guide base placement [12, 13, 11, 14, 15, 16]. However, these approaches overlook the kinematic complexities of articulated objects, where complex motion and configuration dependencies impose constraints that extend beyond the limitations of reachability maps. Moreover, these methods typically incur high computational costs, making them impractical for real-time applications. Learning-based approaches predict feasible base placements by considering the kinematic properties of objects, the environment, and specific robot models [4, 3, 17, 5, 18, 19]. However, they require extensive data collection and training, often tailored to predefined robot-environment pairs, which limits their generalizability to real-world applications. Given these limitations, there is a pressing need for an efficient and generalizable framework that incorporates object-kinematic awareness and adaptability to various robot models.

This work introduces MoMa-Pos, a framework designed to determine optimal base positioning for mobile manipulators prior to executing navigation-manipulation tasks. A central aspect of MoMa-Pos is its ability to prioritize key objects in the scene, enabling efficient modeling without processing every object in the environment. By leveraging a graph-based neural network, our method predicts the importance of task-relevant objects, allowing for focused computation on critical elements. This selective modeling integrates advanced algorithms like URDFormer [7], facilitating accurate kinematic representations where necessary. The optimization of base positions in MoMa-Pos is distinguished by its integration of two critical factors: the use of inverse reachability maps to account for robot-specific constraints, and the incorporation of objects’ kinematic properties from the environment to identify feasible base placements. This dual strategy refines potential base position areas, balancing physical feasibility with task-specific efficiency, leading to enhanced precision and adaptability. Extensive experiments in both real and simulated environments demonstrate the superior performance of MoMa-Pos, consistently outperforming existing methods in terms of efficiency and accuracy across various scenarios, parameters, and robot models.

II Problem Statement

In this section, we define the problem of computing the optimal base placement in an environment, denoted as $Env^{O}$ , which contains a set of objects represented as $O=\{o_{1},o_{2},\cdots,o_{N}\}$ . Each object $o_{i}$ is characterized by its 3D position $o_{i}^{p}$ and 3D bounding box $o_{i}^{b}$ . The target object within these is specified as $o_{t}$ , and the mobile manipulator is tasked with manipulating object $o_{t}$ . This robot model, consisting of a base, a body, and an arm, is denoted by the tuple $(r^{b},r^{h},r^{l})$ , where $r^{b}$ represents the base dimensions $(r^{x},r^{y},r^{z})$ (width, length, height), $r^{h}$ is the body height, and $r^{l}$ is the arm extension length. The robot’s 2D base position is denoted by $r^{p}$ , which is assumed to be vertically aligned with the arm base. The objective is to determine a feasible base placement $r^{p*}$ for navigation-manipulation tasks. The position must meet two critical criteria: the arm must be able to compute collision-free trajectories for task execution, and the base must avoid collisions with any objects in $Env^{O}$ .

To obtain the position $o_{i}^{p}$ and bounding box $o_{i}^{b}$ for each object $o_{i}$ , the robot first scans the environment to collect 3D point cloud data and videos. The point cloud is processed by a pre-trained Mask3D model [20] to segment objects and extract their corresponding 3D bounding boxes. After removing outliers based on a distance threshold, the remaining points are used to compute an axis-aligned 3D bounding box, with $o_{i}^{p}$ defined as its geometric center. Additionally, the video data is used to select RGB images with optimal views for further kinematic modeling.

III The MoMa-Pos Framework

The proposed MoMa-Pos consists of four phases, including object importance prediction for kinematic modeling, kinematic modeling of key objects, base placement optimization, and navigation to feasible base placement, which are indicated in Fig. 1.

III-A Object Importance Prediction for Kinematic Modeling

The objective of object importance prediction is to select a subset of objects, $S$ , from the object set $O$ for kinematic modeling, enabling efficient computation without processing every object. Rather than directly select the objects in an environment, we evaluate each object’s importance and select those with an importance score exceeding a predefined threshold $\alpha$ . This approach allows for flexibility; if the initial selection is insufficient, reducing $\alpha$ can include additional objects [21]. We develop a prediction model, $f:O\times O\rightarrow(0,1]$ , where the output, $f(o_{i},o_{t})$ , represents the likelihood of including object $o_{i}$ in $S$ based on its importance relative to the target object $o_{t}$ . Due to the challenges of obtaining precise labels for all objects in diverse environments[21], we adopt an unsupervised learning approach, guided by the observation that an object’s importance is often correlated with its spatial proximity to $o_{t}$ and its size, $o_{i}^{b}$ . This correlation enables us to infer importance without exact labels.

Our method structures within a graph-based architecture to calculate spatial proximity between objects $o_{i}$ and $o_{t}$ while considering object size $o_{i}^{b}$ . In $Env^{O}$ , each object $o_{i}$ is represented as a node $v_{i}\in V$ , with attributes such as size $o_{i}^{b}$ . The spatial relationships between objects are captured by edges $E\subseteq V\times V$ . An edge $(v_{i},v_{j})$ exists if object $o_{i}$ is spatially related to $o_{j}$ (e.g., on, in, or inside). The direction of the edge indicates spatial hierarchy, and the weight is defined as ${1}/{\textit{dist}_{xy}(o_{i},o_{j})}$ , where $\textit{dist}_{xy}$ is the horizontal distance between $o_{i}$ and $o_{j}$ . To quantify the spatial proximity between nodes $v_{i}$ and $v_{t}$ , we explored multiple algorithms, such as word2vec[22], which were found inadequate in capturing the spatial information. After a thorough evaluation, we selected the DeepWalk algorithm [23] here. Node sequences are generated through biased random walks, with the transition probability $P(v_{i},v_{j})$ weighted by the function $w(v_{i},v_{j})=k_{0}/\textit{dist}_{xy}(o_{i},o_{j})+(1-k_{0})\times size(o_{j})$ , where $k_{0}$ is a tuning parameter. This bias ensures the walks reflect both spatial distance and object size, generating sequences that better capture object importance.

P(v_{i},v_{j})=\begin{cases}\frac{w(v_{i},v_{j})}{\sum_{u\in N(v_{i})}w(v_{i},u)},&(v_{i},v_{j})\in E\\ 0,&otherwise\end{cases}

where $N(v_{i})$ represents the set of nodes adjacent to $v_{i}$ . After obtaining node embeddings, we compute the importance score of each object $o_{i}$ by calculating the cosine similarity [24] between its embedding and that of the target object $v_{t}$ . Objects with a score exceeding a threshold $\alpha$ , where $\alpha\in[0.0,1.0]$ , are included in the set $S$ . The threshold $\alpha$ is adjustable, starting at a higher value and decreasing if the system requires more objects to achieve optimal performance.

III-B Kinematic Modeling of Key Objects

The purpose of this subsection is to conduct kinematic modeling of key objects within simulation environments. These models will be utilized in the base placement computation for navigation planning. The models will also facilitate the prediction of manipulation waypoints [25] for articulated objects, enabling effective execution of real-world manipulation tasks. To achieve this, we employ the state-of-the-art model, URDFormer, an end-to-end model based on Vision Transformers (ViT), which excels in capturing an object’s kinematic structure through URDF models transformed from a single input image. However, URDFormer cannot directly ground the model to real-world objects, often leading to distortions in size and pose when rescaling the detected objects.

One of the reasons, discovered through our exploration, is that the input image’s viewpoint critically affects the modeling accuracy, with frontal views producing the best results. To address it, we design an algorithm to select the optimal viewpoint image of the target object from video streams, ensuring the input to URDFormer is optimal for more accurate kinematic modeling. This process involves using Detectron2 [26] to detect the target object in each video frame, followed by sorting the frames based on the object’s central position in the image. We then fine-tuned a ResNet50 model [27], trained on a dataset we created, which includes 1,000 labeled images (frontal and non-frontal) of over 20 everyday objects such as refrigerators and microwaves, to classify whether the object in each frame is facing forward. The first best image, which is classified as “frontal,” is selected as the input for URDFormer to generate an initial URDF model with interactive structures.

Furthermore, to address the inherent limitations of 2D single-image input which can often lead the generated URDF model to misalign the simulation with the real-world 3D environment, we correct the model using precise 3D information from point clouds. Specifically, we use 3D bounding boxes generated by Mask3D [20] to align the object’s size and pose, ensuring the URDF model accurately simulates an interactive environment that mirrors the real world. Finally, after inputting a recorded environment video, our algorithm selects the best frontal image of the target object and uses 3D bounding boxes to align the object’s size and pose, generating an accurate URDF model. This allows us to precisely locate the actionable parts on articulated objects at any opening angle, ensuring alignment with real-world conditions. The continuous location information at each angle forms the object’s manipulation waypoints. Note that rigid objects do not require modeling with URDFormer. Instead, they can be approximated as bounding boxes, where the box dimensions correspond to the object’s bounding size. This strategy significantly improves efficiency.

III-C Base Placement Optimization

To optimize the base placement, we first calculate the potential area for base positions (denoted as $A$ ), where there are no collisions with surrounding objects. We then refine this area by considering both the robot’s constraints as well as the objects’ kinematics in determining base placements.

To get an initial robot’s potential base position area $A$ , we consider both the robot’s model $r^{p}$ and the target object’s position $o_{t}^{p}$ to avoid any collisions. The potential area $A$ is calculated according to the following equation:

\begin{split}A=\left\{r^{p}\mid\textit{dist}_{xy}(r^{p},o_{t}^{p})\leq\Delta r,\,\forall i,\,\neg overlap\left(r^{p},r^{b},o_{i}^{p},o_{i}^{b}\right)\right\}\end{split}

Here, $\Delta r$ represents the allowable horizontal distance between a base position $r^{p}$ and the target $o_{t}^{p}$ , illustrated in an orange line in Fig. 2. The overlap function checks for collisions between the robot’s base and surrounding objects $o_{i}$ by comparing their positions and boundaries. More specifically, $\Delta r$ is calculated as $\sqrt{(r^{l})^{2}-(o_{t}^{p.z}-r^{z}-r^{h})^{2}}$ , where $r^{l}$ is the arm extension range (shown in green in the figure), derived from the robot’s DH matrix. The z-axis distance between $r^{p}$ and $o_{t}^{p.z}$ is indicated by $|o_{t}^{p.z}-r^{z}-r^{h}|$ , marked as a blue line in the figure.

To further enhance the base placement optimization process, we integrate an inverse reachability matrix to account for robot-specific constraints and combine the potential field method to avoid potential interference with furniture and obstacles due to their possible kinematic trajectories. Utilizing the inverse reachability matrix, we can determine which points in $A$ a robot can reach given its joint configurations. Calculated offline, the matrix allows us to precompute feasible base positions within area $A$ . By selecting base positions directly from the matrix rather than through random sampling, we ensure that each position satisfies the robot’s kinematic constraints. Inspired by the potential field method [28], we then We then assess the feasibility of each candidate position. The target object $o_{t}$ exerts an attractive force. For each candidate base position $r^{p}$ , we sample a 3D position $(r^{p},z)$ and connect it to $o_{t}^{p}$ with a line. If this line collides with any furniture’s bounding box, the potential value from the potential field $F(r^{p},z)$ is set to zero, indicating the robot cannot reach the target from that position. Otherwise, we calculate $F(r^{p},z)=1/\textit{dist}_{xy}(r^{p},o_{t}^{p})$ . The final potential value is the sum of the value from the inverse reachability matrix and the value from the potential field. Positions with higher final potential values are more feasible. Additionally, we incorporate trajectory planning by discretizing it into multiple waypoints and calculating the potential for each, resulting in a more comprehensive potential field. The sum of the potentials along the trajectory produces the final potential map, which guides the selection of the optimal base placement. Fig. 3 illustrates the potential values for sampled positions in a kitchen environment.

III-D Navigation to Feasible Base Placement

Within area $A$ , we use the Latin Hypercube sampling[29] method to sample $M$ 2D positions $r^{p}_{i}$ , with higher potential values suggesting better feasibility. Instead of naively selecting the position with the lowest potential and checking feasibility via a motion planner like RRT*[30], we aim to balance potential values with navigation costs. This leads to a problem resembling the Open Traveling Salesman Problem (TSP)[31], where the robot does not return to its starting position.

Given the large number of sampled positions, we group them into smaller sets, each containing $T$ candidates. Within each set, the optimal path is determined by solving the Open TSP, using edge weights that combine both distance $\textit{dist}_{xy}(r^{p}_{i},r^{p}_{j})$ and potential values. The edge weight is computed as $k_{1}\times\textit{dist}(r^{p}_{i},r^{p}_{j})+k_{1}^{\prime}\times\left(F(r^{p}_{j})-F(r^{p}_{i})\right)$ , where $k_{1}$ and $k_{1}^{\prime}$ are user-defined coefficients. Algorithms ranging from exact methods [32] to heuristic approaches [33] can be used, depending on the size of $T$ . The search continues until a collision-free trajectory is found, and the corresponding base position $r^{p*}$ is selected.

Lemma 1 (MoMa-Pos is complete). Given any object importance scorer $f:O\times O\rightarrow(0,1]$ , threshold $\alpha\in[0.0,1.0]$ , and $M$ sampled positions, $M\in[1,\infty)$ , if the motion planner PLAN is complete, then MoMa-Pos is complete.

Proof:

Given the codomain of $f$ excludes 0, an $\alpha>0$ exists such that $\forall o\in O,f(o,o)\geq\alpha$ . With $M\rightarrow\infty$ , a feasible position, if existent, will be sampled eventually. Given PLAN’s completeness, MoMa-Pos also achieves completeness. Hence, for any task and an infinite number of samples, MoMa-Pos will identify at least one solution, assuming one exists, through PLAN. ∎

IV Experiments

In our experiments, we aim to address the following questions: Q1: How does MoMa-Pos’s capability in determining feasible base placements? Q2: How important is the object importance predictor for efficiency in kinematic modeling? Q3: How accurate is the kinematic modeling?

Experimental Setup: In the real robot experiment, we construct a simplified home environment in a controlled laboratory setting, as illustrated in Fig. 4. The scene includes five rigid objects (e.g., table, apple, and cup) and two articulated objects (i.e., fridge and microwave), with the rigid objects and microwave placed randomly on the table, which has a variable position, while the fridge remained fixed. The mobile manipulator, shown in Fig. 2, consisted of an AgileX-based mobile platform and a Realman 63-F robotic arm, equipped with a Realsense 435 camera for RGB images, depth images, and point cloud data acquisition. The mobile manipulator is tasked with grasping rigid objects (i.e., apple and cup) avoiding obstacles and opening articulated items. For the fridge and microwave, we open them to 90 degrees during the experiments. The methods for determining base placements are evaluated across five trials per task, with each scene featuring distinct object configurations. The robot’s initial position is kept consistent across all trials. For manipulation, the spatial positions of the objects and their actionable parts (e.g., door handles) of articulated objects are determined using Mask3D [20] and Detectron2 [26]. We then employ the RRT* algorithm, implemented via the open-source OMPL library [34], to compute corresponding manipulation trajectories. For navigation, we utilize the Dynamic Window Approach (DWA) [35], chosen for its responsiveness and stability in complex dynamic environments, ensuring high precision and safety. A UWB-based solution [36], relying on communication for global positioning, is employed, with an accuracy error of approximately 2 cm.

We also perform simulated robot experiments. Unlike the real-world setup, the simulation involves a larger number of everyday objects (30, including a fridge, microwave, and cup) and significantly more tasks, totaling 500. In the simulation, the 3D positions of objects and their actionable components are directly accessible. The simulation tool is supported by the GitHub project¹¹1https://github.com/AutonoBot-Lab/BestMan_Pybullet. Our experiments were conducted on a computer with an AMD EPYC 7452 processor (32 cores, 64 threads), an Nvidia GTX 2080 GPU, and 128GB of RAM. All code is publicly available for reproducibility at: https://yding25.com/MoMa-Pos

Methods

Grasp Apple (rigid)

Grasp Cup (rigid)

Open Fridge (right-hinged)

Open Microwave (left-hinged)

Time (s)

Cost (m)

SRate (%)

Time (s)

Cost (m)

SRate (%)

Time (s)

Cost (m)

SRate (%)

Time (s)

Cost (m)

SRate (%)

MoMa-Pos (ours)

13.5

\pm

2.0

3.2

\pm

1.2

100.0

13.2

\pm

2.4

3.5

\pm

1.4

100.0

17.4

\pm

2.3

\pm

0.1

100.0

17.2

\pm

1.0

3.7

\pm

1.6

100.0

Habitat

17.3

\pm

0.2

3.0

\pm

0.0

100.0

16.9

\pm

2.7

3.2

\pm

0.0

100.0

15.7

\pm

1.4

2.2

\pm

0.0

18.0

14.4

\pm

0.4

3.2

\pm

0.0

22.0

M3*

17.8

\pm

2.3

3.2

\pm

1.1

100.0

17.3

\pm

3.2

3.3

\pm

0.1

100.0

18.2

\pm

8.4

2.2

\pm

0.1

26.0

17.4

\pm

1.4

3.5

\pm

0.0

44.0

Reuleaux

19.2

\pm

4.3

3.6

\pm

1.6

100.0

18.7

\pm

3.9

3.7

\pm

1.1

100.0

17.7

\pm

2.7

2.5

\pm

0.1

70.0

18.0

\pm

2.0

3.6

\pm

1.4

82.0

TABLE I: A comparison between MoMa-Pos and baseline methods in the real robot experiments in terms of task execution time (Time), navigation cost (Cost), and success rate (SRate) across different furniture categories, namely containers and non-containers. The optimal results are highlighted in bold.

Baselines: We compare our method with the following reimplemented baselines, which are inspired by original data-driven approaches:

•

Habitat [2]: For rigid objects such as tables, the base placement closest to the target is selected, ensuring it is navigable. For articulated objects like fridges, the base placement is determined based on a fixed position relative to these objects, guided by prior experience, while also ensuring no collisions with the environment.
•

M3* [3]: Based on the original method, a designated area is divided into $5\times 5$ cm cells. For rigid objects, the closest navigable position to the target is selected, prioritizing navigation cost. For articulated objects, base positions are randomly sampled within the area until a feasible placement is found. Unlike the original, which employs reinforcement learning to optimize the initial base placement, this baseline lacks such optimization, potentially leading to higher navigation costs. We use * to denote differences. Additionally, while the original method allows for dynamic base repositioning during manipulation to enhance grasping, our implementation uses fixed-base manipulation, which introduces limitations and requires adjustments to accommodate our experimental setup.
•

Reuleaux [11]: Similar to M3*, it starts by designating a specific area, divided into $5\times 5$ cm cells. It then evaluates how easily the robot can reach the target without obstacles. The robot chooses its base placement randomly, prioritizing those with the highest reachability to ensure optimal accessibility.

Different from MoMa-Pos, none of these three baselines account for object kinematics, which can lead to suboptimal manipulation performance, particularly when handling articulated objects.

Rating Criteria: To ensure a fair comparison, all baselines are evaluated within a complete working environment. These methods do not account for the kinematic information of articulated objects, treating all objects as simple bounding boxes, thus saving time on kinematic modeling. We evaluate the performance of all approaches using the following metrics: task execution time (Time), measured in seconds; navigation cost (Cost), measured in meters; and success rate (SRate). For a more detailed breakdown of time consumption in determining feasible positions, we measure the time spent in each key step of the process, which includes three stages: predicting object importance, performing kinematic modeling, and optimizing base placement. The time allocated for modeling not only includes the modeling process itself but also the integration of modeled objects into the simulator. To focus on the efficiency of the task’s core elements, we exclude navigation time from task execution time, as robot movement speed can introduce significant variability unrelated to task completion efficiency. Instead, the navigation cost metric evaluates spatial movement efficiency, ensuring that it does not interfere with the direct assessment of task performance.

IV-A Results and Analysis

IV-A1 MoMa-Pos vs. Baselines

We aim to address Q1 here. The main results are presented in TABLE I. The table demonstrates that our method consistently achieves a 100% success rate across all trials in the working environment. Compared to the baseline methods (Habitat, M3*, and Reuleaux), the success rate of MoMa-Pos is significantly higher, particularly with articulated objects such as fridges. This is due to our method’s ability to account for the structure of the objects, effectively avoiding obstructions. MoMa-Pos also excels in efficiency, outperforming the M3* and Reuleaux baselines. These methods rely on random sampling within a predefined area and often fail to quickly locate a viable position, leading to increased time costs and inconsistent performance, as evidenced by the high variance in M3* and Reuleaux’s results. Although Habitat is sometimes more efficient, its lower success rate, especially with articulated objects, limits its practical application. Our method not only demonstrates significant advantages in success rate and time efficiency with articulated objects but also ensures the low navigation cost for finding feasible solutions. In our tests, MoMa-Pos consistently identifies a feasible solution within 18 seconds.

IV-A2 Object Importance Predictor for Efficiency in Kinematic Modelling

Q2 is answered here. Our experimentation begins by setting the threshold value $\alpha$ (specifically, 0.45) within the object importance prediction module. The achieved success rate of 100% confirms that no crucial object has been overlooked. Fig. 5 illustrates the modelling time for MoMa-Pos and a variant devoid of the prediction module. It is clear that incorporating prediction in MoMa-Pos markedly decreases the time cost. Additionally, while the URDFormer-based modeling time for the cup is significantly shorter, the overall time is prolonged due to the increased time required for importing objects into simulators, as the cup’s complex surroundings demand more processing compared to the fridge.

IV-A3 Kinematic Modeling Accuracy

Q3 is addressed here. TABLE II presents the accuracy of kinematic modeling. For the Vanilla URDFormer, the generated object model exhibits significant misalignment with the real-world object, resulting in a spatial error of approximately 0.38m, as shown in the table. The spatial error is calculated based on AP and the fridge dimensions, with the unit in meters. To improve alignment with real-world objects, we apply two methods to obtain the bounding box: Omni3D [37] and Mask3D. These methods differ in their approach, with Omni3D utilizing RGB images and Mask3D leveraging depth data. Compared to Omni3D, Mask3D demonstrates significantly higher accuracy, reducing the modeling error to 0.04m.

Methods	Modeling Accuracy Error (m)
Vanilla URDFormer	0.38
URDFormer+Omni3D	0.13
URDFormer+Mask3D (ours)	0.04

TABLE II: Comparison of modeling accuracy errors across different methods for fridge modeling.

We also calculate the average time distribution across all tasks, including those involving both rigid and articulated objects, throughout the entire process. TABLE III presents the results. It is clear that the majority of the time (over 98% of the total time cost) is dedicated to kinematic modeling and determining feasibility through a sampling-based motion planner, such as RRT*. In contrast, the time spent on candidate selection, including solving the Open Traveling Salesman Problem (TSP), is minimal, accounting for less than 2% of the total time (less than 0.3 seconds).

Process	Time Distribution (%)
Modeling (including importing time)	58.0
Checking feasibility of samples (OMPL)	40.0
Potential field computation	0.9
Importance prediction	0.8

TABLE III: Average time distribution of entire process in MoMa-Pos

Our framework is further evaluated using three simulated robots: Fetch, Segway+UR5e, and Husky+UR5e. In all trials, MoMa-Pos achieves a 100% success rate, demonstrating its robustness across different robot models. This highlights the versatility of our method, which is adaptable to various scenarios. Compared to baselines, MoMa-Pos significantly reduces the effort needed for integrating kinematic models into the simulator. Note that the simulated experiments do not involve kinematic modeling steps. Additional simulation results are provided in the appendix²²2https://yding25.com/MoMa-Pos/Appex.

V Conclusion

In this work, we introduce MoMa-Pos, a novel framework designed to optimize base placement for mobile manipulators, particularly in environments containing both rigid and articulated objects. MoMa-Pos addresses key challenges in navigation-manipulation tasks by selectively modeling task-relevant objects, leveraging a graph-based neural network to enhance computational efficiency. By integrating inverse reachability maps and environmental kinematic properties, the framework enables precise base placement tailored to specific robot models, ensuring adaptability and physical feasibility across diverse environments.

References

[1] S. Yenamandra, A. Ramachandran, K. Yadav, A. Wang, M. Khanna, T. Gervet, T.-Y. Yang, V. Jain, A. W. Clegg, J. Turner, et al., “HomeRobot: Open-vocabulary mobile manipulation,” arXiv preprint arXiv:2306.11565, 2023.
[2] S. Andrew, Y. Karmesh, C. Alex, B. Vincent-Pierre, G. Aaron, C. Angel, S. Manolis, K. Zsolt, and B. Dhruv, “Habitat Rearrangement Challenge 2022,” 2022.
[3] J. Gu, D. S. Chaplot, H. Su, and J. Malik, “Multi-skill Mobile Manipulation for Object Rearrangement,” arXiv preprint arXiv:2209.02778, 2022.
[4] X. Zhang, Y. Zhu, Y. Ding, Y. Zhu, P. Stone, and S. Zhang, “Visually grounded task and motion planning for mobile manipulation,” arXiv preprint arXiv:2202.10667, 2022.
[5] H. Zhang, K. Mi, and Z. Zhang, “Base Placement Optimization for Coverage Mobile Manipulation Tasks,” arXiv preprint arXiv:2304.08246, 2023.
[6] T. Sandakalum, N. X. Yao, and M. H. Ang, “Inv-Reach Net: Deciding mobile platform placement for a given task,” in 2022 IEEE-RAS 21st International Conference on Humanoid Robots (Humanoids). IEEE, 2022, pp. 127–133.
[7] Z. Chen, A. Walsman, M. Memmel, K. Mo, A. Fang, K. Vemuri, A. Wu, D. Fox, and A. Gupta, “URDFormer: A pipeline for constructing articulated simulation environments from real-world images,” 2024. [Online]. Available: https://arxiv.org/abs/2405.11656
[8] C. C. Kemp, A. Edsinger, H. M. Clever, and B. Matulevich, “The design of stretch: A compact, lightweight mobile manipulator for indoor human environments,” in International Conference on Robotics and Automation (ICRA), 2022, pp. 3150–3157.
[9] A. Szot, A. Clegg, E. Undersander, E. Wijmans, Y. Zhao, J. Turner, N. Maestre, M. Mukadam, D. S. Chaplot, O. Maksymets, et al., “Habitat 2.0: Training home assistants to rearrange their habitat,” Advances in Neural Information Processing Systems, vol. 34, pp. 251–266, 2021.
[10] X. Puig, E. Undersander, A. Szot, M. D. Cote, T.-Y. Yang, R. Partsey, R. Desai, A. W. Clegg, M. Hlavac, S. Y. Min, et al., “Habitat 3.0: A co-habitat for humans, avatars and robots,” arXiv preprint arXiv:2310.13724, 2023.
[11] A. Makhal and A. K. Goins, “Reuleaux: Robot base placement by reachability analysis,” in IEEE International Conference on Robotic Computing (IRC), 2018, pp. 137–142.
[12] N. Vahrenkamp, T. Asfour, and R. Dillmann, “Robot placement based on reachability inversion,” in IEEE International Conference on Robotics and Automation, 2013, pp. 1970–1975.
[13] F. Zacharias, C. Borst, M. Beetz, and G. Hirzinger, “Positioning mobile manipulators to perform constrained linear trajectories,” in IROS, 2008, pp. 2578–2584.
[14] M. Attamimi, K. Ito, T. Nakamura, and T. Nagai, “A planning method for efficient mobile manipulation considering ambiguity,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, 2012, pp. 965–972.
[15] F. Zacharias, C. Borst, and G. Hirzinger, “Capturing robot workspace structure: representing robot capabilities,” in IROS, 2007, pp. 3229–3236.
[16] F. Reister, M. Grotz, and T. Asfour, “Combining navigation and manipulation costs for time-efficient robot placement in mobile manipulation tasks,” IEEE Robotics and Automation Letters, vol. 7, no. 4, pp. 9913–9920, 2022.
[17] A. Gupta, M. E. Shepherd, and S. Gupta, “Predicting Motion Plans for Articulating Everyday Objects,” 2023. [Online]. Available: https://arxiv.org/abs/2303.01484
[18] F. Stulp, A. Fedrizzi, and M. Beetz, “Action-related place-based mobile manipulation,” in IROS, 2009, pp. 3115–3120.
[19] D. Honerkamp, T. Welschehold, and A. Valada, “Learning kinematic feasibility for mobile manipulation through deep reinforcement learning,” IEEE Robotics and Automation Letters, vol. 6, no. 4, pp. 6289–6296, 2021.
[20] J. Schult, F. Engelmann, A. Hermans, O. Litany, S. Tang, and B. Leibe, “Mask3D: Mask Transformer for 3D Semantic Instance Segmentation,” 2023.
[21] T. Silver, R. Chitnis, A. Curtis, J. B. Tenenbaum, T. Lozano-Perez, and L. P. Kaelbling, “Planning with learned object importance in large problem instances using graph neural networks,” in AAAI, vol. 35, no. 13, 2021, pp. 11 962–11 971.
[22] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” 2013. [Online]. Available: https://arxiv.org/abs/1301.3781
[23] B. Perozzi, R. Al-Rfou, and S. Skiena, “Deepwalk: Online learning of social representations,” in Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2014, pp. 701–710.
[24] M. Vijaymeena and K. Kavitha, “A survey on similarity measures in text mining,” Machine Learning and Applications: An International Journal, vol. 3, no. 2, pp. 19–28, 2016.
[25] W. Xia, D. Wang, X. Pang, Z. Wang, B. Zhao, D. Hu, and X. Li, “Kinematic-aware prompting for generalizable articulated object manipulation with llms,” in 2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 2073–2080.
[26] Y. Wu, A. Kirillov, F. Massa, W.-Y. Lo, and R. Girshick, “Detectron2,” https://github.com/facebookresearch/detectron2, 2019.
[27] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” 2016.
[28] S. S. Ge and Y. J. Cui, “Dynamic motion planning for mobile robots using potential field method,” Autonomous robots, vol. 13, pp. 207–222, 2002.
[29] M. McKay, R. Beckman, and W. Conover, “A comparison of three methods for selecting values of input variables in the analysis of output from a computer code,” Technometrics, vol. 21, no. 2, pp. 239–245, 1979.
[30] S. Karaman and E. Frazzoli, “Sampling-based algorithms for optimal motion planning,” The international journal of robotics research, vol. 30, no. 7, pp. 846–894, 2011.
[31] H. H. Chieng and N. Wahid, “A performance comparison of genetic algorithm’s mutation operators in n-cities open loop travelling salesman problem,” in Recent Advances on Soft Computing and Data Mining: SCDM-2014. Springer, 2014, pp. 89–97.
[32] S. Boyd and J. Mattingley, “Branch and bound methods,” Notes for EE364b, Stanford University, vol. 2006, p. 07, 2007.
[33] G. Kizilateş and F. Nuriyeva, “On the nearest neighbor algorithms for the traveling salesman problem,” in Advances in Computational Science, Engineering and Information Technology: CCSEIT-2013. Springer, 2013, pp. 111–118.
[34] I. A. Sucan, M. Moll, and L. E. Kavraki, “The open motion planning library,” IEEE Robotics & Automation Magazine, vol. 19, no. 4, pp. 72–82, 2012.
[35] D. Fox, W. Burgard, and S. Thrun, “The dynamic window approach to collision avoidance,” IEEE Robotics & Automation Magazine, vol. 4, no. 1, pp. 23–33, 1997.
[36] S. Gezici et al., “Localization via ultra-wideband radios: A look at positioning aspects for future sensor networks,” IEEE Signal Processing Magazine, vol. 22, no. 4, pp. 70–84, 2005.
[37] G. Brazil, A. Kumar, J. Straub, N. Ravi, J. Johnson, and G. Gkioxari, “Omni3d: A large benchmark and model for 3d object detection in the wild,” 2023. [Online]. Available: https://arxiv.org/abs/2207.10660

MoMa-Pos: An Efficient Object-Kinematic-Aware Base Placement Optimization Framework for Mobile Manipulation

Abstract

I Introduction

II Problem Statement

III The MoMa-Pos Framework

III-A Object Importance Prediction for Kinematic Modeling

III-B Kinematic Modeling of Key Objects

III-C Base Placement Optimization

III-D Navigation to Feasible Base Placement

Proof:

IV Experiments

IV-A Results and Analysis

IV-A1 MoMa-Pos vs. Baselines

IV-A2 Object Importance Predictor for Efficiency in Kinematic Modelling

IV-A3 Kinematic Modeling Accuracy

V Conclusion

References

MoMa-Pos: An Efficient Object-Kinematic-Aware Base Placement
Optimization Framework for Mobile Manipulation