Towards Object-level Multimodal Task Planning for Long-term Robotic Manipulation Logo with Vision Language Model and Behavior Tree

1Nanakai University, 2The Hong Kong Polytechnic University, 3Imperial College London
Teaser Image

Our proposed framework combines the Vision-Language Model (VLM) with the Behavior Tree (BT).

Abstract

Long-term robotic manipulation in open environments requires unifying multimodal understanding with reliable, geometry-aware execution. Classical robotic motion planning approaches demand extensive domain modeling and hand-crafted goal specifications, while emerging LLM/VLM pipelines propose semantically plausible yet lack feasibility guarantees and executable grounding. To address the above limitations, we propose a hierarchical multimodal planning framework that combines VLM-based multimodal perception with behavior tree (BT) planning to bridge high-level semantic reasoning and low-level execution feasibility. Our framework integrates natural language instructions with open-set visual geometry to generate object-level representations and language-conditioned prototype plans. Then, a prompting-to-compilation scheme is designed to yield BT planning with explicit controller–status pairs, conditions of force, and geometric feasibility checks. The proposed framework is validated through real-world experiments across three long-term robotic manipulation tasks, showing higher task success than VLM-only and BT-only baselines and demonstrating robust, fully autonomous execution without human intervention.

Main Contribution

We propose a hierarchical multimodal planning framework for long-term robotic manipulation Logo that combines the strengths and overcomes the limitations of LLMs/VLMs and traditional task and motion planning (TAMP) methods. The main contributions of our work are summarized as follows:

  • We integrate natural language instruction and open-set visual detection with RGB-D geometry to produce a list of sub-task goals with VLM, which include the instance-level prototype plans with language-conditioned filtering.
  • A prompting and compilation scheme is designed to convert LLM sub-task decompositions into the behavior tree (BT) model with explicit controller–status pairs, conditions of force, and geometric feasibility checks.
  • The efficacy of the proposed framework is evaluated through real-world experiments with three typical long-term manipulation tasks, providing empirical validation of its effectiveness on long-term manipulation tasks.

Here we show a brief overview of the proposed hierarchical multimodal planning framework with VLM stack and BT model.

Real-World Experiments

We conduct a desktop-level task suite with several long-term robotic manipulation tasks to validate the proposed framework in real-world settings. Here we show the hardware and environmental configuration of the entire experiment.

We consider three representative long-term desktop-level tasks with varying object and primitive action sets. For each task, we randomize the initial relative poses of visible objects within the current field of view to test robustness and repeatability.

Initial State (START)

KEY FRAME

Final State (END)

Initial State (START)

KEY FRAME

Final State (END)

Initial State (START)

KEY FRAME

Final State (END)

Example Results Analysis

Our approach improves reliability by combining multimodal planning via VLM stack and controller–status verified BT execution, yielding robust long-term manipulation with no manual oversight.

PRESS the following button to see more video ⬇

BibTeX (Coming soon)

@inproceedings{luo2026towards,
  title={Towards Object-Level Multimodal Task Planning for Long-Term Robotic Manipulation with Vision Language Model and Behavior Tree},
  author={Luo, Hanqian and Liu, Zezhi and Cao, Jiannong and Qi, Xiuxiu and McCann, Julie A and Fang, Yongchun},
  booktitle={ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={19677--19681},
  year={2026},
  organization={IEEE}
}