---
title: 【DL輪読会】Geometry-aware 4D Video Generation for Robot Manipulation
tags: 
author: [Deep Learning JP](https://www.docswell.com/user/DeepLearning2023)
site: [Docswell](https://www.docswell.com/)
thumbnail: https://bcdn.docswell.com/page/P7R9PPXYE9.jpg?width=480
description: 【DL輪読会】Geometry-aware 4D Video Generation for Robot Manipulation by Deep Learning JP
published: May 14, 26
canonical: https://www.docswell.com/s/DeepLearning2023/K7NEW8-2026-05-18-110837
---
# Page. 1

![Page Image](https://bcdn.docswell.com/page/P7R9PPXYE9.jpg)

Geometry-aware 4D Video Generation for Robot
Manipulation
Shu MORIKUNI, Matsuo-Iwasawa Lab
1


# Page. 2

![Page Image](https://bcdn.docswell.com/page/PJXQ33647X.jpg)

Bibliography Information
Title
Geometry-aware 4D Video Generation for Robot Manipulation
Authors
Zeyi Liu1∗, Shuang Li1, Eric Cousineau2, Siyuan Feng2, Benjamin Burchfiel2, Shuran Song1
Affliations
1 Stanford University, 2 Toyota Research Institute
Publication
ICLR2026
Arxiv
https://arxiv.org/abs/2507.01099
Project Page
https://robot4dgen.github.io/
GitHub
https://github.com/lzylucy/4dgen
Summary
1.
2.
Proposed a 4D video model, which generates view-consistent RGB-D sequences from
novel viewpoints via cross-view pointmap alignment during training.
Used an off-the-shelf 6-DOF tracker to extract robot trajectories from the videos,
enabling manipulation policies that generalize to unseen viewpoints.
2


# Page. 3

![Page Image](https://bcdn.docswell.com/page/3JK9YY1PJD.jpg)

Generated
Unseen View
Ground Truth
Quick Demo 1: Generated Consistent RGB-D Video
Task: Store Cereal Box Under the Shelf
3


# Page. 4

![Page Image](https://bcdn.docswell.com/page/LE3W99Q4E5.jpg)

MV-DP
3D-DP
Proposed
Quick Demo 2: Downstream Policy Rollout From Novel View
4


# Page. 5

![Page Image](https://bcdn.docswell.com/page/8EDKGG957G.jpg)

Motivation
●
Robots must anticipate scene dynamics, such as, object motion, occlusion, and contact
responses.
●
Even small viewpoint changes can significantly degrade manipulation policy
performance.
●
Collecting large-scale real-world robotic data is expensive.
Video generation models offer a way to learn this “intuitive physics”,
to generate plausible futures and plan against them.
5


# Page. 6

![Page Image](https://bcdn.docswell.com/page/V7PK33ZDJ8.jpg)

The Challenge &amp; The Goal
Why is it difficult to leverage existing video generation models for robotics?
●
Pixel-based video models (e.g., SVD) handle motion well, but lack 3D structure, causing
flickering, deformation, or object disappearance.
●
3D-aware methods enforce geometry but are limited to simple, static, single-object scenes.
The Goal: Maintain temporal coherence and 3D consistency across viewpoints
for dynamic manipulation scenes.
6


# Page. 7

![Page Image](https://bcdn.docswell.com/page/2JVV44WGJQ.jpg)

Video Generation with Pose Tracking to Robot Execution
RGB-D Observation to predict
future pointmaps &amp; RGB videos
6DoF
EEF Pose Tracking
Downstream execution
7


# Page. 8

![Page Image](https://bcdn.docswell.com/page/5EGL11GDJL.jpg)

Video Generation with Pose Tracking to Robot Execution
RGB-D Observation to predict
future pointmaps &amp; RGB videos
6DoF
EEF Pose Tracking
Downstream execution
8


# Page. 9

![Page Image](https://bcdn.docswell.com/page/4JQYDDXX7P.jpg)

4D Pointmaps Prediction
DUSt3R is adapted and used here
-
The trick is to train a diffusion model to jointly predict:
A pointmap from the reference view Vn, expressed in Vn’s own frame.
A pointmap from a second view Vm, but re-expressed in Vn’s frame.
9


# Page. 10

![Page Image](https://bcdn.docswell.com/page/K74WZZ62E1.jpg)

SVD Architecture - Inputs &amp; Encoders
●
●
●
●
Backbone: Stable Video Diffusion
RGB VAE: frozen SVD VAE encodes each
RGB frame to a latent.
Pointmap VAE: a separate VAE, initialized
from SVD’s VAE then fine-tuned on
pointmap data.
The two latents are concatenated along the
channel axis, giving the U-Net a unified
RGB + pointmap latent.
10


# Page. 11

![Page Image](https://bcdn.docswell.com/page/LJ1YRRVKEG.jpg)

SVD Architecture - Decoders &amp; Cross-attention
●
●
Two U-Net decoders with identical
architecture but independent
weights.
After each decoder block in Vm’s
branch, a cross-attention layer is
inserted.
This is the channel through which geometric information flows
from the reference branch into the novel-view branch.
11


# Page. 12

![Page Image](https://bcdn.docswell.com/page/GJWG11LP72.jpg)

Three Training Objectives + One Trick
●
●
●
●
RGB diffusion loss on both views (standard SVD objective).
3D diffusion loss applied to both the native pointmap and the re-projected pointmap.
Where 1{wg (t′ )=1} is an indicator function, which is always set to 1.
Effectively doubling the contribution of loss terms at gripper regions.
Because downstream gripper-pose tracking is critical,
indicator function sharpens the model where it matters most for control.
12


# Page. 13

![Page Image](https://bcdn.docswell.com/page/4EZLPPM673.jpg)

Inference Steps
1.
2.
3.
4.
5.
Mask the gripper in the first frame with SAM2.
Run FoundationPose (off-the-shelf 6DoF pose tracker) on the generated RGB-D sequence, with
camera intrinsics and the gripper’s CAD model as input.
a. Output: SE(3) gripper pose per frame + confidence.
Run on both views
a. Keep the result with the higher confidence.
b. Transform into the global frame using the reference view’s known extrinsics.
Gripper open/close:
a. Project the two finger point clouds, measure distance between their centroids, with thresholds.
Execute in open-loop for the predicted horizon, re-plan on the next observation.
Key Points:
● No policy is trained. The “policy” is a video model + pose tracker.
● Generalization comes from the video model’s view-invariance, not from imitation-learned features.
13


# Page. 14

![Page Image](https://bcdn.docswell.com/page/Y76WMMNL7V.jpg)

Experimental Setup
Real-world
Simulation (LBM/Drake-based)
●
●
Three table-top tasks.
4 tasks on Franka Panda dual-arm:
○
StoreCerealBoxUnderShelf
○
AddOrangeSlicesToBowl
○
PutSpatulaOnTable
○
PutCupOnSaucer
○
PlaceAppleFromBowlIntoBin
○
TwistCapOffBottle
○
PutSpatulaOnTable
25 demos × 16 camera poses = 400 videos / task
○
●
●
12 train + 4 test views.
Cameras sampled from a half-sphere shell (radius
●
20 teleoperated demos each.
●
Two FRAMOS D415e RGB-D cameras at different
0.7–1.2 m) around the table.
angles.
●
Sim-trained model fine-tuned for 15k steps on real data.
14


# Page. 15

![Page Image](https://bcdn.docswell.com/page/G75MZZ5M74.jpg)

Results - 4D Generation Quality (Quantitative)
15


# Page. 16

![Page Image](https://bcdn.docswell.com/page/9J29RRNRER.jpg)

Results - 4D Generation Quality (Qualitative)
16


# Page. 17

![Page Image](https://bcdn.docswell.com/page/DEY4DDP5JM.jpg)

Results - Robot Policy Success Rate in Simulation
●
●
●
Dreamitate: no depth, no geometric consistency → view-inconsistent video → bad poses.
Diffusion Policy: implicit features can’t bridge viewpoints despite multi-view training data.
DP3: helps on cereal-box grasping (depth matters) but small objects (spatula, apple) still fail.
In Conclusion: explicit geometric supervision + off-the-shelf pose tracker on RGB-D = the action
signal stays accurate across views.
17


# Page. 18

![Page Image](https://bcdn.docswell.com/page/VJNY665478.jpg)

Compute Trade-oﬀ
●
●
This methods pay a 2× inference time over SVD to predict pointmaps and 15× over 4D Gaussian
But 4D Gaussian collapses on novel views (mIoU 0.00 on real-world).
The authors claimed that the cost buys generalization.
18


# Page. 19

![Page Image](https://bcdn.docswell.com/page/YE9PLLY4J3.jpg)

Summary
Limitations
●
●
●
●
Data requirements: needs multi-view RGB-D with varying camera poses.
Inference latency: 30s for 10 frames blocks closed-loop reactivity.
Single-task models: each task is trained separately.
No real-world robot policy rollout evaluation.
Take-aways
●
●
Pointmap alignment is a strong inductive bias for video models.
Decoupling perception from policy works with zero policy learning.
Afterthought
●
●
The cost vs generalization can be worth paying in some scenarios in reality.
In some future directions,
○
Faster generators, real-world depth from RGB, hierarchical inference, multi-task models.
○
Robotic “world models” with explicit geometry.
19


