【DL輪読会】"RoboMonkey: Scaling Test-Time Sampling and Verification for Vision-Language-Action Models"

103 Views

November 13, 25

スライド概要

Deep Learning JP

@DeepLearning2023

スライド一覧

DL輪読会資料

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

ダウンロード

関連スライド

【DL輪読会】KAN: Kolmogorov–Arnold Networks

Deep Learning JP 89.3K

【拡散モデル勉強会】拡散モデルの数理

Deep Learning JP 63.9K

【DL輪読会】Evolutionary Optimization of Model Merging Recipes モデルマージの進化的最適化

Deep Learning JP 60.6K

【拡散モデル勉強会】Introduction to Diffusion Models

Deep Learning JP 44.9K

【DL輪読会】Conditional Flow Matching

Deep Learning JP 43.7K

【DL輪読会】Cosmos World Foundation Model Platform for Physical AI

Deep Learning JP 42.2K

各ページのテキスト

DEEP LEARNING JP [DL Papers] RoboMonkey: Scaling Test-Time Sampling and Verification for Vision-Language-Action Models Makoto Sato, Matsuo-Iwasawa Lab http://deeplearning.jp/ 1

http://deeplearning.jp/

Bibliographic Information • Title: RoboMonkey: Scaling Test-Time Sampling and Verification for Vision-Language-Action Models • Author: Jacky Kwok, Christopher Agia, Rohan Shinha, Matt Foutter, Shulu Li, Ion Stoica, Azalia Mirhoseini, Marco Pavone • Affiliation: Stanford University, UC Berkeley, NVIDIA Research • Conference: CoRL 2025 • Link: https://robomonkey-vla.github.io/ 2

https://robomonkey-vla.github.io/

Overview • Test-Time Scaling Method for VLA – They validate that test-time scaling is achievable by combining repeated sampling[1] (from LLMs) with VLM-based action verifier. – *RoboMonkey significantly enhances VLA performance, achieving a 25% absolute improvement in real-world out-of-distribution tasks and 9% on in-distribution SIMPLER environments. Power-law fit: assume 𝑒 ≈ 𝑎𝑘 𝑏 (i.e., log 𝑎 + 𝑏 log 𝑘); estimate 𝑎, 𝑏. *Title inspired by https://en.m.wikipedia.org/wiki/Infinite_monkey_theorem. 3

https://en.m.wikipedia.org/wiki/Infinite_monkey_theorem

Background • Previous Works & Motivations – Vision-Language-Action (VLA) models show strong visuomotor skills, but their robustness and safety in unstructured real-world settings remain limited. – Prior work focuses on post-training such as fine-tuning VLAs for multi-step reasoning with CoT and aligning VLAs with preferences. – In LLMs, spending more test-time scaling consistently improves performance without parameter updates. Test-time compute for VLAs remains under-explored in robotics. “Can test-time scaling be applied to VLAs in robotics as well?” 4

Contributions • Verify test-time scaling law: Efficient action sampling (incl. Gaussian perturbation); error vs. sample count follows an approximate power law across VLAs. • Propose verifier pipeline: Scalable synthetic preference generation and training of a VLM-based action verifier. • Demonstrate performance gains: RoboMonkey boosts VLAs by +25% OOD and +9% ID; joint fine-tuning VLA+VLM-verifier adds +7% on LIBERO-Long. 5

Preliminaries • Problem setup: MDP 𝑀 = (𝒮, 𝒜, 𝑃, 𝑅) – State: 𝒮 = 7-dimensional robot’s end effector pose – Action: 𝒜 = 7-dimensional robot’s end effector pose – Dynamics: 𝑃 𝑠 ′ 𝑠, 𝑎 = non-deterministic transition dynamics – Reward: 𝑅 = reward for state action pair – Language-conditioned policy: 𝜋𝜃 (𝑎|𝑠, 𝐼) = sample actions given state and instruction 𝑗 𝑗 – Main objective: ℒ 𝜃; 𝒟 = −𝔼 𝑠 𝑗 ,𝑎𝑗 ,𝐼𝑗 ~𝒟 log 𝜋𝜃 𝑎𝑡 𝑠𝑡 , 𝐼 𝑗 𝑡 𝑡 6

Proposed Approach: RoboMonkey (1 / 3) • RoboMonkey pipeline 7

Proposed Approach: RoboMonkey (2 / 3) • Stage1: Training Action Verifier – Synthesize preference dataset to rank the quality of actions – Train action verifier 𝑅𝜙 using Bradley-Terry loss model[2] 8

Proposed Approach: RoboMonkey (3 / 3) • Stage2: Scaling Test-Time Compute ෡ candidate actions from VLA model 𝜋𝜃 (𝑎|𝑠𝑡 , 𝐼; 𝒯) – Sample 𝑁 – Compute gripper state 𝑔𝑡 via majority voting – Fit Gaussian distribution to translation and rotation – Select best action using pretrained action verifier 9

10.

Experimental Setup • WidowX Robot & Franka Robot Environment – SIMPLER 10 tasks & 4 LIBERO-Long tasks – Use Bridge V2 Dataset (includes over 40,000 real-world data) – Train OpenVLA and LLaVA-7B as action verifier – Compare with OpenVLA[3], V-GPS[4] 10

11.

Result①: ID / OOD Performance Evaluation • Can RoboMonkey improve the precision of VLAs on in-distribution tasks? – RoboMonkey averages 47.5%, +9% vs OpenVLA; larger gains on Eggplant in Basket (+19%) and Stack Cube (+10%). V-GPS underperforms OpenVLA and RoboMonkey. • Can RoboMonkey improve the robustness of VLAs on out-of-distribution tasks? – Across four task suites (120 rollouts), RoboMonkey averages 60% vs 35% (OpenVLA) and 30% (V-GPS); strong on Stacking Cups, Hammer in Basket, and Banana in Basket. 11

12.

Result②: Latency Evaluation • How does RoboMonkey enable practical deployment for test-time scaling? – SGLang-based VLA serving + Gaussian perturbation; 16 candidates in ~650 ms (≈1.5 Hz), 41.3% lower latency than naive policy sampling. – Gaussian perturbation’s latency scales mainly with verification, while naive policy sampling scales with sampling + verification as 𝑘 grows. 12

13.

Result③: Synthetic Dataset Size Effects • How does scaling the synthetic training dataset impact downstream success rate? – Bigger synthetic preference sets lead higher success. On SIMPLER, average rises 37.5% → 46.3%; with >10⁶ comparisons, RoboMonkey surpasses OpenVLA and V-GPS. 13

14.

Result④: Task Performances • Can we effectively fine-tune the action verifier on new robot setup and task? – Average success: OpenVLA 49.8% → RoboMonkey 56.5% across 10 tasks (table shows per-task gains). – Joint FT helps: fine-tuning OpenVLA + verifier yields +6.7% over fine-tuning OpenVLA alone. 14

15.

Discussion and Limitations 1. Computational Overhead: – Multiple samples + VLM verifier increase latency; unsuitable for high-frequency control. 2. Scaling Synthetic Datasets: – Bigger synthetic datasets help; current work capped at 20M comparisons. 3. Evaluation Scope: – Tested on WidowX-250S and Franka; expand to broader embodiments and tasks. 15

16.

Summary • Summary • They apply test-time scaling (well known in LLMs) to VLAs and confirm similar, consistent performance gains. • For compute-sensitive VLA deployment, Gaussian perturbation works well and is a practical, low-cost alternative. • Personal take • Unlike LLMs, robotics needs real-time test-time scaling. Methods should prioritize low latency—like AlphaGo’s lightweight rollout policy idea. 16

17.

References 1. 2. 3. 4. B. Brown, J. Juravsky, R. Ehrlich, R. Clark, Q. V. Le, C. Re, and A. Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling. arXiv preprint arXiv:2407.21787, 2024. H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024. M. Nakamoto, O. Mees, A. Kumar, and S. Levine. Steering your generalists: Improving robotic foundation models via value guidance, 2025. URL https://arxiv.org/abs/2410.13816. 17