Key Contributions:
Articulated object manipulation is essential for real-world robotic tasks, yet generalizing across diverse objects remains challenging. The key lies in understanding functional parts (e.g., handles, knobs) that indicate where and how to manipulate across diverse categories and shapes.
Previous approaches using 2D foundation features face critical limitations when lifted to 3D: long runtimes, multi-view inconsistencies, and low spatial resolution with insufficient geometric information.
We propose Part-Aware 3D Feature Field (PA3FF), a novel dense 3D representation with part awareness for generalizable manipulation. PA3FF is trained via contrastive learning on 3D part proposals from large-scale datasets. Given point clouds as input, it predicts continuous 3D feature fields in a feedforward manner, where feature proximity reflects functional part relationships.
Building on PA3FF, we introduce Part-Aware Diffusion Policy (PADP) for enhanced sample efficiency and generalization. PADP significantly outperforms existing 2D and 3D representations (CLIP, DINOv2, Grounded-SAM), achieving state-of-the-art performance on both simulated and real-world tasks.
Technical Innovation: Unlike methods that lift 2D features to 3D via multi-view fusion (suffering from inconsistencies and limited resolution), PA3FF is 3D-native and predicts features in a single feedforward pass. This enables: (a) efficient inference, (b) consistent 3D feature fields, and (c) dense per-point features with accurate geometric cues.
Why 2D Feature Lifting Fails: Multi-view feature lifting suffers from: (1) inconsistent visibility across views, (2) missing thin/small parts in 2D renders, (3) low spatial resolution from patch-based processing (14× smaller in DINOv2), and (4) computationally expensive fusion. See paper Appendix A.1 and Figure 7 for detailed analysis.
Pulling lid of pot
Open drawer
Close box
Close lid of laptop
Open microwave
Open bottle
Put lid on kettle
Press dispenser
Five-Level Generalization Protocol:
Key Findings (Table 5 in paper): Feature refinement via contrastive learning provides the largest performance gain (+16% from 46% to 62%), demonstrating that part-aware learning is critical for manipulation. Simply combining Sonata with DP3 yields only modest improvement (+2% over DP3 baseline), confirming that our algorithmic contributions are essential.