244 Views
May 14, 26
スライド概要
DL輪読会資料
Geometry-aware 4D Video Generation for Robot Manipulation Shu MORIKUNI, Matsuo-Iwasawa Lab 1
Bibliography Information Title Geometry-aware 4D Video Generation for Robot Manipulation Authors Zeyi Liu1∗, Shuang Li1, Eric Cousineau2, Siyuan Feng2, Benjamin Burchfiel2, Shuran Song1 Affliations 1 Stanford University, 2 Toyota Research Institute Publication ICLR2026 Arxiv https://arxiv.org/abs/2507.01099 Project Page https://robot4dgen.github.io/ GitHub https://github.com/lzylucy/4dgen Summary 1. 2. Proposed a 4D video model, which generates view-consistent RGB-D sequences from novel viewpoints via cross-view pointmap alignment during training. Used an off-the-shelf 6-DOF tracker to extract robot trajectories from the videos, enabling manipulation policies that generalize to unseen viewpoints. 2
Generated Unseen View Ground Truth Quick Demo 1: Generated Consistent RGB-D Video Task: Store Cereal Box Under the Shelf 3
MV-DP 3D-DP Proposed Quick Demo 2: Downstream Policy Rollout From Novel View 4
Motivation ● Robots must anticipate scene dynamics, such as, object motion, occlusion, and contact responses. ● Even small viewpoint changes can significantly degrade manipulation policy performance. ● Collecting large-scale real-world robotic data is expensive. Video generation models offer a way to learn this “intuitive physics”, to generate plausible futures and plan against them. 5
The Challenge & The Goal Why is it difficult to leverage existing video generation models for robotics? ● Pixel-based video models (e.g., SVD) handle motion well, but lack 3D structure, causing flickering, deformation, or object disappearance. ● 3D-aware methods enforce geometry but are limited to simple, static, single-object scenes. The Goal: Maintain temporal coherence and 3D consistency across viewpoints for dynamic manipulation scenes. 6
Video Generation with Pose Tracking to Robot Execution RGB-D Observation to predict future pointmaps & RGB videos 6DoF EEF Pose Tracking Downstream execution 7
Video Generation with Pose Tracking to Robot Execution RGB-D Observation to predict future pointmaps & RGB videos 6DoF EEF Pose Tracking Downstream execution 8
4D Pointmaps Prediction DUSt3R is adapted and used here - The trick is to train a diffusion model to jointly predict: A pointmap from the reference view Vn, expressed in Vn’s own frame. A pointmap from a second view Vm, but re-expressed in Vn’s frame. 9
SVD Architecture - Inputs & Encoders ● ● ● ● Backbone: Stable Video Diffusion RGB VAE: frozen SVD VAE encodes each RGB frame to a latent. Pointmap VAE: a separate VAE, initialized from SVD’s VAE then fine-tuned on pointmap data. The two latents are concatenated along the channel axis, giving the U-Net a unified RGB + pointmap latent. 10
SVD Architecture - Decoders & Cross-attention ● ● Two U-Net decoders with identical architecture but independent weights. After each decoder block in Vm’s branch, a cross-attention layer is inserted. This is the channel through which geometric information flows from the reference branch into the novel-view branch. 11
Three Training Objectives + One Trick ● ● ● ● RGB diffusion loss on both views (standard SVD objective). 3D diffusion loss applied to both the native pointmap and the re-projected pointmap. Where 1{wg (t′ )=1} is an indicator function, which is always set to 1. Effectively doubling the contribution of loss terms at gripper regions. Because downstream gripper-pose tracking is critical, indicator function sharpens the model where it matters most for control. 12
Inference Steps 1. 2. 3. 4. 5. Mask the gripper in the first frame with SAM2. Run FoundationPose (off-the-shelf 6DoF pose tracker) on the generated RGB-D sequence, with camera intrinsics and the gripper’s CAD model as input. a. Output: SE(3) gripper pose per frame + confidence. Run on both views a. Keep the result with the higher confidence. b. Transform into the global frame using the reference view’s known extrinsics. Gripper open/close: a. Project the two finger point clouds, measure distance between their centroids, with thresholds. Execute in open-loop for the predicted horizon, re-plan on the next observation. Key Points: ● No policy is trained. The “policy” is a video model + pose tracker. ● Generalization comes from the video model’s view-invariance, not from imitation-learned features. 13
Experimental Setup Real-world Simulation (LBM/Drake-based) ● ● Three table-top tasks. 4 tasks on Franka Panda dual-arm: ○ StoreCerealBoxUnderShelf ○ AddOrangeSlicesToBowl ○ PutSpatulaOnTable ○ PutCupOnSaucer ○ PlaceAppleFromBowlIntoBin ○ TwistCapOffBottle ○ PutSpatulaOnTable 25 demos × 16 camera poses = 400 videos / task ○ ● ● 12 train + 4 test views. Cameras sampled from a half-sphere shell (radius ● 20 teleoperated demos each. ● Two FRAMOS D415e RGB-D cameras at different 0.7–1.2 m) around the table. angles. ● Sim-trained model fine-tuned for 15k steps on real data. 14
Results - 4D Generation Quality (Quantitative) 15
Results - 4D Generation Quality (Qualitative) 16
Results - Robot Policy Success Rate in Simulation ● ● ● Dreamitate: no depth, no geometric consistency → view-inconsistent video → bad poses. Diffusion Policy: implicit features can’t bridge viewpoints despite multi-view training data. DP3: helps on cereal-box grasping (depth matters) but small objects (spatula, apple) still fail. In Conclusion: explicit geometric supervision + off-the-shelf pose tracker on RGB-D = the action signal stays accurate across views. 17
Compute Trade-off ● ● This methods pay a 2× inference time over SVD to predict pointmaps and 15× over 4D Gaussian But 4D Gaussian collapses on novel views (mIoU 0.00 on real-world). The authors claimed that the cost buys generalization. 18
Summary Limitations ● ● ● ● Data requirements: needs multi-view RGB-D with varying camera poses. Inference latency: 30s for 10 frames blocks closed-loop reactivity. Single-task models: each task is trained separately. No real-world robot policy rollout evaluation. Take-aways ● ● Pointmap alignment is a strong inductive bias for video models. Decoupling perception from policy works with zero policy learning. Afterthought ● ● The cost vs generalization can be worth paying in some scenarios in reality. In some future directions, ○ Faster generators, real-world depth from RGB, hierarchical inference, multi-task models. ○ Robotic “world models” with explicit geometry. 19