---
title: 【DL輪読会】Improving Robotic Generalist Policies via Flow Reversal Steering
tags: 
author: [Deep Learning JP](https://www.docswell.com/user/DeepLearning2023)
site: [Docswell](https://www.docswell.com/)
thumbnail: https://bcdn.docswell.com/page/K74WD6XME1.jpg?width=480
description: 【DL輪読会】Improving Robotic Generalist Policies via Flow Reversal Steering by Deep Learning JP
published: June 18, 26
canonical: https://www.docswell.com/s/DeepLearning2023/KVJMQ2-2026-06-22-103815
---
# Page. 1

![Page Image](https://bcdn.docswell.com/page/K74WD6XME1.jpg)

DEEP LEARNING JP
[DL Papers]
Improving Robotic Generalist Policies
via Flow Reversal Steering
Jeremy Siburian, Matsuo-Iwasawa Lab, M2
http://deeplearning.jp/
1


# Page. 2

![Page Image](https://bcdn.docswell.com/page/LJ1YZVNYEG.jpg)

Paper Overview
Improving Robotic Generalist Policies via Flow Reversal Steering
Paper Details
• Authors:
Andy Tang1, William Chen1, Andrew Wagenmaker1 , Chelsea Finn1 , Sergey Levine 1
(1Stanford University, 1UC Berkeley)
•
•
Arxiv Preprint, 2026
Links:
–
–
ArXiv: https://arxiv.org/abs/2606.13675
Project Page: https://flow-reversal-steering.github.io/
(I recommend viewing the project page for more interactive visualizations!)
Disclaimer: All credits for images, figures, tables, and other contents belong to the original authors.
2


# Page. 3

![Page Image](https://bcdn.docswell.com/page/GJWG9L4172.jpg)

Introduction
•
•
•
Robotic foundation models trained on large, diverse datasets can act as broad multi-task generalist policies.
These models contain rich behavioral priors over useful robot skills, but they may still fail on new, long-horizon, or outof-distribution tasks.
How can we effectively access and steer these priors to solve new tasks?
Vision-Language-Action Models
Large Behavior Models
3


# Page. 4

![Page Image](https://bcdn.docswell.com/page/4EZL9M2X73.jpg)

Introduction
•
•
•
Policy steering guides a pretrained policy’s action generation toward task-relevant behaviors, without retraining the model.
For diffusion/flow policies, steering modifies the denoising process or choosing input noises that produce better actions.
Challenge: good noises are hard to find, and existing methods often rely on hand-tuned noising, sample-and-rank, or
expensive trial-and-error RL.
DSRL [Wagenmaker et al. 2025]
Inference-Time Policy Steering [Wang et al. 2025]
4


# Page. 5

![Page Image](https://bcdn.docswell.com/page/Y76WKNVP7V.jpg)

Introduction
Motivation:
How can we use semantic knowledge to steer generalist policies towards sampling &quot;reasonable&quot; actions for new tasks?
Flow Reversal Steering (FRS)
maps coarse reference actions to their noises by passing them through flow policies in reverse
Zero-shot steering with semantic guidance
Efficient adaptation through noise-space BC or RL
5


# Page. 6

![Page Image](https://bcdn.docswell.com/page/G75MP5DP74.jpg)

Flow Reversal Steering (FRS)
Key Idea:
Use a coarse reference action to find a corresponding latent noise by reversing the flow, then denoise that noise to
produce a refined, in-distribution action from the generalist policy.
Standard flow denoising
samples from random noise
FRS starts from a reference action,
reverses it to noise, then
denoises it to a nearby in-distribution action mode.
6


# Page. 7

![Page Image](https://bcdn.docswell.com/page/9J296N8ZER.jpg)

Flow Reversal Steering (FRS)
3 Stages of FRS
1. Coarse Guidance: A human or VLM provides a rough reference action for the task.
2. Flow Reversal: FRS maps the reference action to noise, then denoises it into a refined generalist action.
3. Policy Improvement: The resulting actions/noises can be used for zero-shot execution, behavioral cloning, or RL.
7


# Page. 8

![Page Image](https://bcdn.docswell.com/page/DEY49P6NJM.jpg)

FRS Stage 1: Coarse Guidance
On novel tasks the generalist often fails when commanded directly. FRS uses a reasoner to steer it toward the correct behaviors in its prior.
•
•
Humans or VLMs can provide useful high-level guidance, but usually cannot produce precise low-level robot actions.
In FRS, the reasoner outputs a simple directional action (e.g. a Cartesian direction for the end-effector), converted into a rough steering
action chunk: ineffective to execute directly, but useful as a reference for steering.
8


# Page. 9

![Page Image](https://bcdn.docswell.com/page/VJNYL5ZV78.jpg)

FRS Stage 2: Flow Reversal + Denoising
Flow denoising is deterministic, so the flow can be reversed to recover the noise behind any reference action, then re-denoised to project it
onto the generalist&#039;s prior.
•
•
•
The coarse reference action is passed backward through the flow policy to recover its latent noise.
That noise is denoised forward into a refined action.
With finite steps the round trip is lossy, and the error pulls the action toward the generalist&#039;s high-density modes. So the output stays
roughly consistent with the reference, while being more fine-grained and in-distribution.
9


# Page. 10

![Page Image](https://bcdn.docswell.com/page/YE9P4YZVJ3.jpg)

FRS Stage 3: Policy Improvement
Once FRS can turn coarse guidance into useful actions, how can we use it to improve the policy?
1.
2.
3.
Zero-shot online steering: use FRS directly during deployment to guide the policy step by step.
Supervised learning on FRS noises (DSBC): treat the recovered noises as targets for training a small noise policy.
Bootstrapping RL with FRS (DSRL + FRS): use FRS trajectories as prior data to make RL exploration more efficient.
10


# Page. 11

![Page Image](https://bcdn.docswell.com/page/GE8DQ614ED.jpg)

#1: Zero-Shot Online Steering
Key Idea:
Use FRS directly during deployment, without any additional training.
How It Works (every step):
1. Query a reasoner (e.g. human or VLM) for a coarse reference action
2. Pass it through flow reversal to find the corresponding noise
3. Denoise the noise to get the final action
Limitation: Requires querying the reasoner every step, which can be expensive, especially for human guidance.
11


# Page. 12

![Page Image](https://bcdn.docswell.com/page/LELMX9QV7R.jpg)

#1: Zero-Shot Online Steering (Results)
Takeaway #1: FRS refines coarse VLM guidance into better zero-shot actions, especially on hard tasks
Baselines: base policy (no steering), direct VLM execution, partial noising, sample-and-rank. All share the same VLA, VLM, and prompt.
Results:
• Zero-shot FRS beats the base VLA across LIBERO. On hard tasks (base ≤2%), 11 of 42 gain ≥10% absolute, vs only 3-4 for other steering
baselines, showing FRS wins in the low-success regimes hardest for generalist improvement.
• FRS beats direct VLM execution, showing flow reversal refines the coarse references rather than reconstructing them.
12


# Page. 13

![Page Image](https://bcdn.docswell.com/page/4JMYLZ1NJW.jpg)

#2: Supervised Learning on FRS
Key Idea:
Diffusion Steering via Behavioral Cloning (DSBC) distills FRS into a cheap fixed policy.
How It Works:
1. Train a small noise policy to imitate FRS-recovered noises (from successful rollouts or offline demos).
2. At test time it outputs noise that the frozen generalist denoises into an action.
•
•
Why It Works: In OOD states, the generalist still maps the noise to a reasonable action, so DSBC resists compounding error where
standard BC fails.
Limitation: DSBC distills FRS but cannot surpass it. Offline reconstructions are also approximate.
13


# Page. 14

![Page Image](https://bcdn.docswell.com/page/PJR9K6L479.jpg)

#2: Supervised Learning on FRS (Results)
Takeaway #2: FRS trajectories distill into a cheap fixed noise policy that matches zero-shot FRS
Baselines: base policy, zero-shot VLM FRS, and standard BC on FRS successes (same small architecture as DSBC).
Results:
• DSBC matches zero-shot FRS and beats the base VLA, distilling the gains into a fixed policy.
• DSBC beats standard BC, since the VLA maps bad noises back to reasonable actions in OOD states, giving built-in robustness to
compounding error.
• Highly efficient: ~18 rollouts/task in sim, 10 on real robots, ~1 GB GPU, under a minute to train (vs hundreds of GBs to fine-tune a full
VLA).
14


# Page. 15

![Page Image](https://bcdn.docswell.com/page/PEXQLMWZJX.jpg)

#3 Bootstrapping RL with FRS
Key Idea:
FRS trajectories bootstrap DSRL (Diffusion Steering via Reinforcement Learning), skips the trial-and-error search in RL.
How It Works:
1. Prefill DSRL&#039;s replay buffer with FRS trajectories for meaningful prior experience.
2. Add a DSBC auxiliary loss on successful rollouts, keeping the policy near useful FRS noises.
Why It Helps: DSRL + FRS biases exploration toward useful behaviors, improving even when the base VLA nearly always fails.
15


# Page. 16

![Page Image](https://bcdn.docswell.com/page/3EK9LGVVED.jpg)

#3 Bootstrapping RL with FRS (Results)
Takeaway #3: FRS bootstraps RL to surpass fixed zero-shot FRS and DSBC, even from a single success.
•
•
Settings: Two LIBERO-90 splits, prefilling DSRL with FRS rollouts. (1) 15 tasks where FRS works well (20 rollouts prefilled). (2) 10 hard
tasks where the base VLA nearly always fails (just 1 FRS success, which can take 50+ trials).
Baselines: standard DSRL, residual RL (PLD-like), and RoboMeter as a reward model. All control for VLA, policy architecture, and RL
algorithm (SAC).
Results:
• DSRL + FRS is the most sample-efficient in both settings, learning faster and reaching higher final success than standard RL.
• On the hard split, naive DSRL plateaus around 30% due to the weak base policy, while DSRL + FRS uses FRS to find early successes and
converges much higher.
16


# Page. 17

![Page Image](https://bcdn.docswell.com/page/L73W35QQ75.jpg)

Real-World Experiments
Takeaway #4: FRS transfers to real robots, with DSBC boosting performance from just 10 human rollouts.
Setup: A human steers via feedback (one Cartesian direction per chunk) instead of dense teleoperation. Successful rollouts then train DSBC.
Results:
• Across 6 tasks the base VLA struggles on, DSBC boosts average absolute performance by 60% from just 10 human rollouts per task.
• Standard BC completely fails in this regime, as it cannot fall back on the VLA prior.
• Same efficiency as sim: under a minute, ~1 GB GPU.
• DSBC can also bootstrap RL: on towel-hanging, 5% base → 50% DSBC → 80% post-RL.
17


# Page. 18

![Page Image](https://bcdn.docswell.com/page/87DK4D9WJG.jpg)

Real-World Experiments
Takeaway #5: DSBC also learns from standard demos, no online steering needed.
Setup: 20 teleoperated episodes of &quot;hang tape on stand,&quot; a task too imprecise for coarse steering. Flow reversal augments each action with
its noise for DSBC to learn from.
Results:
• Offline DSBC beats the base VLA and standard BC, which can&#039;t learn precise, temporally-coherent behaviors in this low-data regime.
• Validates flow reversal as a simple recipe for turning ordinary demos into noise-action data, with no online execution needed.
18


# Page. 19

![Page Image](https://bcdn.docswell.com/page/VJPKMXZXE8.jpg)

Summary &amp; Takeaways
•
Key Takeaways
FRS maps coarse reference actions into the flow policy&#039;s noise space via reversal, bridging semantic guidance and low-level control.
Flow reversal yields fine-grained, &quot;in-distribution&quot; actions that stay roughly consistent with the reference, letting the policy refine what
humans or VLMs suggest.
Beyond zero-shot steering, the recovered noises enable fast noise-space BC (DSBC) and bootstrap RL (DSRL + FRS).
•
•
•
•
Limitations / Future Work
Depends on the prior: FRS elicits &quot;reasonable&quot; behaviors the generalist already roughly knows, rather than producing new ones.
Guidance quality matters: coarse cardinal actions leave much of the prior untapped (VLM 11% vs oracle 93% on the hard split).
Local steering, not full planning: long-horizon reasoning may still need higher-level planning.
Future Work: combine FRS with stronger planners, value functions, or reward models for long-horizon adaptation.
•
•
19


