【DL輪読会】Improving Robotic Generalist Policies via Flow Reversal Steering

>100 Views

June 18, 26

スライド概要

Deep Learning JP

@DeepLearning2023

スライド一覧

DL輪読会資料

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

ダウンロード

関連スライド

【DL輪読会】KAN: Kolmogorov–Arnold Networks

Deep Learning JP 92.2K

【拡散モデル勉強会】拡散モデルの数理

Deep Learning JP 71K

【DL輪読会】Evolutionary Optimization of Model Merging Recipes モデルマージの進化的最適化

Deep Learning JP 61.4K

【DL輪読会】Conditional Flow Matching

Deep Learning JP 54.5K

【DL輪読会】Cosmos World Foundation Model Platform for Physical AI

Deep Learning JP 51.6K

【拡散モデル勉強会】Introduction to Diffusion Models

Deep Learning JP 50.1K

各ページのテキスト

DEEP LEARNING JP [DL Papers] Improving Robotic Generalist Policies via Flow Reversal Steering Jeremy Siburian, Matsuo-Iwasawa Lab, M2 http://deeplearning.jp/ 1

http://deeplearning.jp/

Paper Overview Improving Robotic Generalist Policies via Flow Reversal Steering Paper Details • Authors: Andy Tang1, William Chen1, Andrew Wagenmaker1 , Chelsea Finn1 , Sergey Levine 1 (1Stanford University, 1UC Berkeley) • • Arxiv Preprint, 2026 Links: – – ArXiv: https://arxiv.org/abs/2606.13675 Project Page: https://flow-reversal-steering.github.io/ (I recommend viewing the project page for more interactive visualizations!) Disclaimer: All credits for images, figures, tables, and other contents belong to the original authors. 2

Introduction • • • Robotic foundation models trained on large, diverse datasets can act as broad multi-task generalist policies. These models contain rich behavioral priors over useful robot skills, but they may still fail on new, long-horizon, or outof-distribution tasks. How can we effectively access and steer these priors to solve new tasks? Vision-Language-Action Models Large Behavior Models 3

Introduction • • • Policy steering guides a pretrained policy’s action generation toward task-relevant behaviors, without retraining the model. For diffusion/flow policies, steering modifies the denoising process or choosing input noises that produce better actions. Challenge: good noises are hard to find, and existing methods often rely on hand-tuned noising, sample-and-rank, or expensive trial-and-error RL. DSRL [Wagenmaker et al. 2025] Inference-Time Policy Steering [Wang et al. 2025] 4

Introduction Motivation: How can we use semantic knowledge to steer generalist policies towards sampling "reasonable" actions for new tasks? Flow Reversal Steering (FRS) maps coarse reference actions to their noises by passing them through flow policies in reverse Zero-shot steering with semantic guidance Efficient adaptation through noise-space BC or RL 5

Flow Reversal Steering (FRS) Key Idea: Use a coarse reference action to find a corresponding latent noise by reversing the flow, then denoise that noise to produce a refined, in-distribution action from the generalist policy. Standard flow denoising samples from random noise FRS starts from a reference action, reverses it to noise, then denoises it to a nearby in-distribution action mode. 6

Flow Reversal Steering (FRS) 3 Stages of FRS 1. Coarse Guidance: A human or VLM provides a rough reference action for the task. 2. Flow Reversal: FRS maps the reference action to noise, then denoises it into a refined generalist action. 3. Policy Improvement: The resulting actions/noises can be used for zero-shot execution, behavioral cloning, or RL. 7

FRS Stage 1: Coarse Guidance On novel tasks the generalist often fails when commanded directly. FRS uses a reasoner to steer it toward the correct behaviors in its prior. • • Humans or VLMs can provide useful high-level guidance, but usually cannot produce precise low-level robot actions. In FRS, the reasoner outputs a simple directional action (e.g. a Cartesian direction for the end-effector), converted into a rough steering action chunk: ineffective to execute directly, but useful as a reference for steering. 8

FRS Stage 2: Flow Reversal + Denoising Flow denoising is deterministic, so the flow can be reversed to recover the noise behind any reference action, then re-denoised to project it onto the generalist's prior. • • • The coarse reference action is passed backward through the flow policy to recover its latent noise. That noise is denoised forward into a refined action. With finite steps the round trip is lossy, and the error pulls the action toward the generalist's high-density modes. So the output stays roughly consistent with the reference, while being more fine-grained and in-distribution. 9

10.

FRS Stage 3: Policy Improvement Once FRS can turn coarse guidance into useful actions, how can we use it to improve the policy? 1. 2. 3. Zero-shot online steering: use FRS directly during deployment to guide the policy step by step. Supervised learning on FRS noises (DSBC): treat the recovered noises as targets for training a small noise policy. Bootstrapping RL with FRS (DSRL + FRS): use FRS trajectories as prior data to make RL exploration more efficient. 10

11.

#1: Zero-Shot Online Steering Key Idea: Use FRS directly during deployment, without any additional training. How It Works (every step): 1. Query a reasoner (e.g. human or VLM) for a coarse reference action 2. Pass it through flow reversal to find the corresponding noise 3. Denoise the noise to get the final action Limitation: Requires querying the reasoner every step, which can be expensive, especially for human guidance. 11

12.

#1: Zero-Shot Online Steering (Results) Takeaway #1: FRS refines coarse VLM guidance into better zero-shot actions, especially on hard tasks Baselines: base policy (no steering), direct VLM execution, partial noising, sample-and-rank. All share the same VLA, VLM, and prompt. Results: • Zero-shot FRS beats the base VLA across LIBERO. On hard tasks (base ≤2%), 11 of 42 gain ≥10% absolute, vs only 3-4 for other steering baselines, showing FRS wins in the low-success regimes hardest for generalist improvement. • FRS beats direct VLM execution, showing flow reversal refines the coarse references rather than reconstructing them. 12

13.

#2: Supervised Learning on FRS Key Idea: Diffusion Steering via Behavioral Cloning (DSBC) distills FRS into a cheap fixed policy. How It Works: 1. Train a small noise policy to imitate FRS-recovered noises (from successful rollouts or offline demos). 2. At test time it outputs noise that the frozen generalist denoises into an action. • • Why It Works: In OOD states, the generalist still maps the noise to a reasonable action, so DSBC resists compounding error where standard BC fails. Limitation: DSBC distills FRS but cannot surpass it. Offline reconstructions are also approximate. 13

14.

#2: Supervised Learning on FRS (Results) Takeaway #2: FRS trajectories distill into a cheap fixed noise policy that matches zero-shot FRS Baselines: base policy, zero-shot VLM FRS, and standard BC on FRS successes (same small architecture as DSBC). Results: • DSBC matches zero-shot FRS and beats the base VLA, distilling the gains into a fixed policy. • DSBC beats standard BC, since the VLA maps bad noises back to reasonable actions in OOD states, giving built-in robustness to compounding error. • Highly efficient: ~18 rollouts/task in sim, 10 on real robots, ~1 GB GPU, under a minute to train (vs hundreds of GBs to fine-tune a full VLA). 14

15.

#3 Bootstrapping RL with FRS Key Idea: FRS trajectories bootstrap DSRL (Diffusion Steering via Reinforcement Learning), skips the trial-and-error search in RL. How It Works: 1. Prefill DSRL's replay buffer with FRS trajectories for meaningful prior experience. 2. Add a DSBC auxiliary loss on successful rollouts, keeping the policy near useful FRS noises. Why It Helps: DSRL + FRS biases exploration toward useful behaviors, improving even when the base VLA nearly always fails. 15

16.

#3 Bootstrapping RL with FRS (Results) Takeaway #3: FRS bootstraps RL to surpass fixed zero-shot FRS and DSBC, even from a single success. • • Settings: Two LIBERO-90 splits, prefilling DSRL with FRS rollouts. (1) 15 tasks where FRS works well (20 rollouts prefilled). (2) 10 hard tasks where the base VLA nearly always fails (just 1 FRS success, which can take 50+ trials). Baselines: standard DSRL, residual RL (PLD-like), and RoboMeter as a reward model. All control for VLA, policy architecture, and RL algorithm (SAC). Results: • DSRL + FRS is the most sample-efficient in both settings, learning faster and reaching higher final success than standard RL. • On the hard split, naive DSRL plateaus around 30% due to the weak base policy, while DSRL + FRS uses FRS to find early successes and converges much higher. 16

17.

Real-World Experiments Takeaway #4: FRS transfers to real robots, with DSBC boosting performance from just 10 human rollouts. Setup: A human steers via feedback (one Cartesian direction per chunk) instead of dense teleoperation. Successful rollouts then train DSBC. Results: • Across 6 tasks the base VLA struggles on, DSBC boosts average absolute performance by 60% from just 10 human rollouts per task. • Standard BC completely fails in this regime, as it cannot fall back on the VLA prior. • Same efficiency as sim: under a minute, ~1 GB GPU. • DSBC can also bootstrap RL: on towel-hanging, 5% base → 50% DSBC → 80% post-RL. 17

18.

Real-World Experiments Takeaway #5: DSBC also learns from standard demos, no online steering needed. Setup: 20 teleoperated episodes of "hang tape on stand," a task too imprecise for coarse steering. Flow reversal augments each action with its noise for DSBC to learn from. Results: • Offline DSBC beats the base VLA and standard BC, which can't learn precise, temporally-coherent behaviors in this low-data regime. • Validates flow reversal as a simple recipe for turning ordinary demos into noise-action data, with no online execution needed. 18

19.

Summary & Takeaways • Key Takeaways FRS maps coarse reference actions into the flow policy's noise space via reversal, bridging semantic guidance and low-level control. Flow reversal yields fine-grained, "in-distribution" actions that stay roughly consistent with the reference, letting the policy refine what humans or VLMs suggest. Beyond zero-shot steering, the recovered noises enable fast noise-space BC (DSBC) and bootstrap RL (DSRL + FRS). • • • • Limitations / Future Work Depends on the prior: FRS elicits "reasonable" behaviors the generalist already roughly knows, rather than producing new ones. Guidance quality matters: coarse cardinal actions leave much of the prior untapped (VLM 11% vs oracle 93% on the hard split). Local steering, not full planning: long-horizon reasoning may still need higher-level planning. Future Work: combine FRS with stronger planners, value functions, or reward models for long-horizon adaptation. • • 19