Towards Object-level Multimodal Task Planning for Long-term Robotic Manipulation Logo with Vision Language Model and Behavior Tree

1Nanakai University, 2The Hong Kong Polytechnic University, 3Imperial College London
Teaser Image

Our proposed framework combines the Vision-Language Model (VLM) with the Behavior Tree (BT).

Abstract

Long-term robotic manipulation in open environments requires unifying multimodal understanding with reliable, geometry-aware execution. Classical robotic motion planning approaches demand extensive domain modeling and hand-crafted goal specifications, while emerging LLM/VLM pipelines propose semantically plausible yet lack feasibility guarantees and executable grounding. To address the above limitations, we propose a hierarchical multimodal planning framework that combines VLM-based multimodal perception with behavior tree (BT) planning to bridge high-level semantic reasoning and low-level execution feasibility. Our framework integrates natural language instructions with open-set visual geometry to generate object-level representations and language-conditioned prototype plans. Then, a prompting-to-compilation scheme is designed to yield BT planning with explicit controller–status pairs, conditions of force, and geometric feasibility checks. The proposed framework is validated through real-world experiments across three long-term robotic manipulation tasks, showing higher task success than VLM-only and BT-only baselines and demonstrating robust, fully autonomous execution without human intervention.

Main Contribution

We propose a hierarchical multimodal planning framework for long-term robotic manipulation Logo that combines the strengths and overcomes the limitations of LLMs/VLMs and traditional task and motion planning (TAMP) methods. The main contributions of our work are summarized as follows:

  • We integrate natural language instruction and open-set visual detection with RGB-D geometry to produce a list of sub-task goals with VLM, which include the instance-level prototype plans with language-conditioned filtering.
  • A prompting and compilation scheme is designed to convert LLM sub-task decompositions into the behavior tree (BT) model with explicit controller–status pairs, conditions of force, and geometric feasibility checks.
  • The efficacy of the proposed framework is evaluated through real-world experiments with three typical long-term manipulation tasks, providing empirical validation of its effectiveness on long-term manipulation tasks.

Here we show a brief overview of the proposed hierarchical multimodal planning framework with VLM stack and BT model.

BibTeX (Coming soon)

......