MVPNet

Abstract

Active perception in vision-based robotic manipulation aims to move the camera toward more informative observation viewpoints, thereby providing high-quality perceptual inputs for downstream tasks. Most existing active perception methods rely on iterative optimization, leading to high time and motion costs, and are tightly coupled with task-specific objectives, which limits their transferability. In this paper, we propose a general one-shot multimodal active perception framework for robotic manipulation. The framework enables direct inference of optimal viewpoints and comprises a data collection pipeline and an optimal viewpoint prediction network. Specifically, the framework decouples viewpoint quality evaluation from the overall architecture, supporting heterogeneous task requirements. Optimal viewpoints are defined through systematic sampling and evaluation of candidate viewpoints, after which large-scale training datasets are constructed via domain randomization. Moreover, a multimodal optimal viewpoint prediction network is developed, leveraging cross-attention to align and fuse multimodal features and directly predict camera pose adjustments. The proposed framework is instantiated in robotic grasping under viewpoint-constrained environments. Experimental results demonstrate that active perception guided by the framework significantly improves grasp success rates. Notably, real-world evaluations achieve nearly double the grasp success rate and enable seamless sim-to-real transfer without additional fine-tuning, demonstrating the effectiveness of the proposed framework.

Main Contribution

We propose a data-driven multimodal active perception framework that directly predicts the optimal observation viewpoint, enabling improved perception with only a single relook, and generalizes across different tasks. The main contributions of our work are summarized as follows:

A general one-shot multimodal active perception framework is proposed, comprising a data collection pipeline and an optimal viewpoint prediction network. This framework enables the unified modeling of diverse task requirements, thereby extending its applicability to a broader range of task scenarios.
An optimal observation viewpoint data collection pipeline is established, in which optimal viewpoints are defined through task-specific viewpoint quality evaluation functions, and large-scale datasets are constructed via domain randomization.
An optimal observation viewpoint prediction network is developed. Utilizing the cross-attention mechanism, this network aligns and fuses multimodal features to predict the required camera pose adjustment.
The proposed framework is instantiated in robotic grasping under viewpoint-constrained environments, where data collection and network training are conducted, and its effectiveness and robustness are validated through extensive simulation and real-world experiments.

Framework Overview

Overall framework of the proposed method, illustrated with robotic grasping in viewpoint-constrained environments: (a) sampling and evaluating candidate viewpoints to obtain the optimal viewpoint for each object, followed by dataset construction via domain randomization; (b) training the MVPNet based on the constructed dataset; and (c) deploying the trained network and conducting comparative evaluations.

Network Architecture

First, the current observation is obtained and preprocessed together with the natural language description of the target object. Subsequently, modality-specific encoders are employed to extract features, which are then aligned and fused using a Transformer. Finally, an MLP maps the fused representation to the camera pose adjustments.

Synthetic Dataset Construction

Define Optimal Observation Viewpoint

Data Collection

Simulated Experiments

Real Robot Experiments

BibTeX (Coming soon)

......

A General One-Shot Multimodal Active Perception Framework for Robotic Manipulation: Learning to Predict Optimal Viewpoint

The "Focus-then-Execute" active perception paradigm.