PA3FF: Part-Aware 3D Feature Field

Overview

Key Contributions:

• We introduce PA3FF, a 3D-native representation that encodes dense, semantic, and functional part-aware features directly from point clouds
• We develop PADP, a diffusion policy that leverages PA3FF for generalizable manipulation with strong sample efficiency
• PA3FF can further enable diverse downstream methods, including correspondence learning and segmentation, making it a versatile foundation for robotic manipulation
• We validate our approach on 16 PartInstruct and 8 real-world tasks, where it significantly outperforms prior 2D and 3D representations (CLIP, DINOv2, and Grounded-SAM), offering a 15% and 16.5% increase

Abstract

Articulated object manipulation is essential for real-world robotic tasks, yet generalizing across diverse objects remains challenging. The key lies in understanding functional parts (e.g., handles, knobs) that indicate where and how to manipulate across diverse categories and shapes.

Previous approaches using 2D foundation features face critical limitations when lifted to 3D: long runtimes, multi-view inconsistencies, and low spatial resolution with insufficient geometric information.

We propose Part-Aware 3D Feature Field (PA3FF), a novel dense 3D representation with part awareness for generalizable manipulation. PA3FF is trained via contrastive learning on 3D part proposals from large-scale datasets. Given point clouds as input, it predicts continuous 3D feature fields in a feedforward manner, where feature proximity reflects functional part relationships.

Building on PA3FF, we introduce Part-Aware Diffusion Policy (PADP) for enhanced sample efficiency and generalization. PADP significantly outperforms existing 2D and 3D representations (CLIP, DINOv2, Grounded-SAM), achieving state-of-the-art performance on both simulated and real-world tasks.

Video

Method: Part-Aware 3D Feature Field

Method Pipeline — **Three-Stage Training Framework:** **Stage I** - Leverage 3D geometric priors from large-scale datasets through self-distillation using PointTransformer V3. **Stage II** - Learn part-aware dense 3D feature fields via contrastive learning to enhance part-level consistency and distinctiveness. **Stage III** - Integrate refined features into a diffusion policy for generalizable action generation in robotic manipulation tasks. See paper Section 3 for technical details.

Technical Innovation: Unlike methods that lift 2D features to 3D via multi-view fusion (suffering from inconsistencies and limited resolution), PA3FF is 3D-native and predicts features in a single feedforward pass. This enables: (a) efficient inference, (b) consistent 3D feature fields, and (c) dense per-point features with accurate geometric cues.

Feature Visualization Comparison

Feature Comparison — **Qualitative Comparison:** PA3FF generates smoother, less noisy feature fields compared to 2D methods (DINOv2, SigLip), with better highlighting of key functional parts. Compared to Sonata (3D baseline), PA3FF provides more semantically meaningful and discriminative part-level representations. Note how 2D methods struggle with thin parts (e.g., refrigerator handles) and exhibit multi-view inconsistencies (e.g., faucet features).

Why 2D Feature Lifting Fails: Multi-view feature lifting suffers from: (1) inconsistent visibility across views, (2) missing thin/small parts in 2D renders, (3) low spatial resolution from patch-based processing (14× smaller in DINOv2), and (4) computationally expensive fusion. See paper Appendix A.1 and Figure 7 for detailed analysis.

Real-World Task Evaluation

Task Illustrations — **Eight Diverse Real-World Tasks:** Evaluation covers pulling lid of pot, opening drawer, closing box, closing laptop lid, opening microwave, opening bottle, putting lid on kettle, and pressing dispenser. These tasks require precise part-level interactions across varied manipulation scenarios with different object categories and functional parts.

Task Execution Videos

Pulling lid of pot

Open drawer

Close box

Close lid of laptop

Open microwave

Open bottle

Put lid on kettle

Press dispenser

58.75%

PADP Success Rate (Unseen)

35%

Best Baseline (Unseen)

+23.75%

Absolute Improvement

Real-World Results — **Real-World Task Success Rates (Table 2 in paper):** PADP significantly outperforms baselines across all eight tasks. Mean success rate on unseen test objects: PADP 58.75% vs. GenDP (best baseline) 35%, representing a 67.9% relative improvement. Each method evaluated with 10 trials per task under randomized initial conditions. Only 30 demonstrations per task were provided for training.

Simulation Results — **PartInstruct Benchmark Results (Table 1 in paper):** Performance across five generalization test sets - Object States (OS), Object Instances (OI), Task Parts (TP), Task Categories (TC), and Object Categories (OC). PADP achieves 28.79% average success rate, outperforming GenDP (19.36%) by 9.43 percentage points. Notably, PADP shows strong generalization to novel object categories (26.67% vs. 14.61%).

Five-Level Generalization Protocol:

Test 1 (OS): Novel object positions and rotations - 36.76% success
Test 2 (OI): Novel object instances within same category - 34.33% success
Test 3 (TP): Novel part combinations in same task type - 32.45% success
Test 4 (TC): Novel task categories - 13.75% success
Test 5 (OC): Novel object categories - 26.67% success

Component Analysis

62%

Full PADP Method

46%

w/o Feature Refinement

39%

Sonata + DP3

Key Findings (Table 5 in paper): Feature refinement via contrastive learning provides the largest performance gain (+16% from 46% to 62%), demonstrating that part-aware learning is critical for manipulation. Simply combining Sonata with DP3 yields only modest improvement (+2% over DP3 baseline), confirming that our algorithmic contributions are essential.

Downstream Applications

Semantic Heatmaps — **Instruction-Conditioned Feature Attention:** Since PA3FF features contain semantic information, computing similarity between different task instructions and point features allows focusing on task-relevant object parts. Heatmap visualization shows cosine similarity between encoded instructions and features, demonstrating the semantic richness of PA3FF representations.

Quantitative Segmentation Results

Limitations of 2D Feature Lifting

Feature Lifting Limitations — **Challenges in 2D-to-3D Feature Lifting:** Although 3D priors enhance generalization, naively lifting 2D features introduces significant problems: (1) **Multi-view inconsistency** - features from frozen 2D networks have inconsistent visibility across views, (2) **Missing thin parts** - rendered 2D images fail to capture thin/small functional parts like handles or buttons due to limited resolution, (3) **Computational cost** - multi-view fusion is expensive and slow. PA3FF addresses these by being 3D-native from the start.

Learning Part-Aware Dense 3D Feature Field for Generalizable Articulated Object Manipulation