Learning Part-Aware Dense 3D Feature Field for Generalizable Articulated Object Manipulation

Anonymous Authors
Paper under review at ICLR 2026

Overview


PA3FF Overview
PA3FF Framework: We propose a feedforward model that predicts part-aware 3D feature fields, enabling generalizable manipulation across unseen objects. Our part-aware diffusion policy (PADP) achieves significant performance improvements with only 6.25% performance drop on unseen objects. PA3FF exhibits consistency across shapes, enabling downstream applications including correspondence learning and segmentation.

Key Contributions:

  • We introduce PA3FF, a 3D-native representation that encodes dense, semantic, and functional part-aware features directly from point clouds
  • We develop PADP, a diffusion policy that leverages PA3FF for generalizable manipulation with strong sample efficiency
  • PA3FF can further enable diverse downstream methods, including correspondence learning and segmentation, making it a versatile foundation for robotic manipulation
  • We validate our approach on 16 PartInstruct and 8 real-world tasks, where it significantly outperforms prior 2D and 3D representations (CLIP, DINOv2, and Grounded-SAM), offering a 15% and 16.5% increase

Abstract


Articulated object manipulation is essential for real-world robotic tasks, yet generalizing across diverse objects remains challenging. The key lies in understanding functional parts (e.g., handles, knobs) that indicate where and how to manipulate across diverse categories and shapes.

Previous approaches using 2D foundation features face critical limitations when lifted to 3D: long runtimes, multi-view inconsistencies, and low spatial resolution with insufficient geometric information.

We propose Part-Aware 3D Feature Field (PA3FF), a novel dense 3D representation with part awareness for generalizable manipulation. PA3FF is trained via contrastive learning on 3D part proposals from large-scale datasets. Given point clouds as input, it predicts continuous 3D feature fields in a feedforward manner, where feature proximity reflects functional part relationships.

Building on PA3FF, we introduce Part-Aware Diffusion Policy (PADP) for enhanced sample efficiency and generalization. PADP significantly outperforms existing 2D and 3D representations (CLIP, DINOv2, Grounded-SAM), achieving state-of-the-art performance on both simulated and real-world tasks.

Video


Method: Part-Aware 3D Feature Field


Method Pipeline
Three-Stage Training Framework: Stage I - Leverage 3D geometric priors from large-scale datasets through self-distillation using PointTransformer V3. Stage II - Learn part-aware dense 3D feature fields via contrastive learning to enhance part-level consistency and distinctiveness. Stage III - Integrate refined features into a diffusion policy for generalizable action generation in robotic manipulation tasks. See paper Section 3 for technical details.

Technical Innovation: Unlike methods that lift 2D features to 3D via multi-view fusion (suffering from inconsistencies and limited resolution), PA3FF is 3D-native and predicts features in a single feedforward pass. This enables: (a) efficient inference, (b) consistent 3D feature fields, and (c) dense per-point features with accurate geometric cues.

Feature Visualization Comparison


Feature Comparison
Qualitative Comparison: PA3FF generates smoother, less noisy feature fields compared to 2D methods (DINOv2, SigLip), with better highlighting of key functional parts. Compared to Sonata (3D baseline), PA3FF provides more semantically meaningful and discriminative part-level representations. Note how 2D methods struggle with thin parts (e.g., refrigerator handles) and exhibit multi-view inconsistencies (e.g., faucet features).

Why 2D Feature Lifting Fails: Multi-view feature lifting suffers from: (1) inconsistent visibility across views, (2) missing thin/small parts in 2D renders, (3) low spatial resolution from patch-based processing (14× smaller in DINOv2), and (4) computationally expensive fusion. See paper Appendix A.1 and Figure 7 for detailed analysis.

Real-World Task Evaluation


Task Illustrations
Eight Diverse Real-World Tasks: Evaluation covers pulling lid of pot, opening drawer, closing box, closing laptop lid, opening microwave, opening bottle, putting lid on kettle, and pressing dispenser. These tasks require precise part-level interactions across varied manipulation scenarios with different object categories and functional parts.

Task Execution Videos

Pulling lid of pot

Open drawer

Close box

Close lid of laptop

Open microwave

Open bottle

Put lid on kettle

Press dispenser

Experimental Results


58.75%
PADP Success Rate (Unseen)
35%
Best Baseline (Unseen)
+23.75%
Absolute Improvement
Real-World Results
Real-World Task Success Rates (Table 2 in paper): PADP significantly outperforms baselines across all eight tasks. Mean success rate on unseen test objects: PADP 58.75% vs. GenDP (best baseline) 35%, representing a 67.9% relative improvement. Each method evaluated with 10 trials per task under randomized initial conditions. Only 30 demonstrations per task were provided for training.
Simulation Results
PartInstruct Benchmark Results (Table 1 in paper): Performance across five generalization test sets - Object States (OS), Object Instances (OI), Task Parts (TP), Task Categories (TC), and Object Categories (OC). PADP achieves 28.79% average success rate, outperforming GenDP (19.36%) by 9.43 percentage points. Notably, PADP shows strong generalization to novel object categories (26.67% vs. 14.61%).

Five-Level Generalization Protocol:

  • Test 1 (OS): Novel object positions and rotations - 36.76% success
  • Test 2 (OI): Novel object instances within same category - 34.33% success
  • Test 3 (TP): Novel part combinations in same task type - 32.45% success
  • Test 4 (TC): Novel task categories - 13.75% success
  • Test 5 (OC): Novel object categories - 26.67% success

Component Analysis

62%
Full PADP Method
46%
w/o Feature Refinement
39%
Sonata + DP3

Key Findings (Table 5 in paper): Feature refinement via contrastive learning provides the largest performance gain (+16% from 46% to 62%), demonstrating that part-aware learning is critical for manipulation. Simply combining Sonata with DP3 yields only modest improvement (+2% over DP3 baseline), confirming that our algorithmic contributions are essential.

Downstream Applications


Downstream Applications
3D Shape Correspondences and Part Segmentation: PA3FF enables precise cross-shape correspondences using Functional Maps, even for shapes with significant topology/pose differences. The learned part hierarchy allows accurate segmentation via agglomerative clustering. PA3FF exhibits superior consistency compared to DINOv2 features.
Semantic Heatmaps
Instruction-Conditioned Feature Attention: Since PA3FF features contain semantic information, computing similarity between different task instructions and point features allows focusing on task-relevant object parts. Heatmap visualization shows cosine similarity between encoded instructions and features, demonstrating the semantic richness of PA3FF representations.

Quantitative Segmentation Results

Segmentation Results
PartNetE Dataset Segmentation (Table 4 in paper): Category-wise mAP50 scores (%) across different object categories. PA3FF achieves 70.6% average mAP50, substantially outperforming PartSlip++ (62.6%) and PartSlip (63.4%). Particularly strong performance on bottles (94.6% vs. 78.5%), displays (86.5% vs. 74.1%), and storage furniture (49.6% vs. 36.7%).

Limitations of 2D Feature Lifting


Feature Lifting Limitations
Challenges in 2D-to-3D Feature Lifting: Although 3D priors enhance generalization, naively lifting 2D features introduces significant problems: (1) Multi-view inconsistency - features from frozen 2D networks have inconsistent visibility across views, (2) Missing thin parts - rendered 2D images fail to capture thin/small functional parts like handles or buttons due to limited resolution, (3) Computational cost - multi-view fusion is expensive and slow. PA3FF addresses these by being 3D-native from the start.