---
title: 【DL輪読会】SARM2: Multi-Task Stage Aware Reward Modeling for Self Improving Robotic Manipulation
tags: 
author: [Deep Learning JP](https://www.docswell.com/user/DeepLearning2023)
site: [Docswell](https://www.docswell.com/)
thumbnail: https://bcdn.docswell.com/page/KJ4WD6NM71.jpg?width=480
description: 【DL輪読会】SARM2: Multi-Task Stage Aware Reward Modeling for Self Improving Robotic Manipulation by Deep Learning JP
published: June 18, 26
canonical: https://www.docswell.com/s/DeepLearning2023/KY8VY9-2026-06-22-102521
---
# Page. 1

![Page Image](https://bcdn.docswell.com/page/KJ4WD6NM71.jpg)

DEEP LEARNING JP
[DL Papers]
SARM2: Multi-Task Stage-Aware Reward Modeling
for Self-Improving Robotic Manipulation
Kohei Sendai, Matsuo Lab
http://deeplearning.jp/
1


# Page. 2

![Page Image](https://bcdn.docswell.com/page/LE1YZV5Y7G.jpg)

書誌情報
タイトル :
S A R M 2 : M u lti-T a s k S ta g e A w a re R e w a rd M o d e lin g fo r S e lf Im p ro v in g R o b o
a rX iv p re p rin t, 2 0 2 6
著者 : Q ia n z h o n g C h e n , H a u Z h e n g , J u s tin Y u , S u n in g H u a n g , J ia n k a i S u n ,
K e n G o ld b e rg , C h u a n W e n , P ie te r A b b e e l, Y id e S h e n tu , P h ilip p W u , M
リンク :
a rX iv : h ttp s ://a rx iv .o rg /a b s /2 6 0 6 .1 0 3 0 5
P ro je c t : h ttp s ://q ia n z h o n g -c h e n .g ith u b .io /s a rm 2 .g ith u b .io /
昨年9 月に出たS ta g e
A w a re R e w a rd M o d e lin g (S A R M )の後続研究
2


# Page. 3

![Page Image](https://bcdn.docswell.com/page/GEWG9L51J2.jpg)

概要
• マルチタスク対応のS ta g e -A w a re 報酬モデル S A R M2 の提案
• A c tio n P rim itiv e (2 2 クラス)を分類するS ta g e E s tim a to rと, M u lti-g a te M o E 価値ヘッド
でタスク横断的に密な報酬を推定
• S A R M (単一タスク・タスク固有アノテーション必須)を多タスク・アノテーション不要に
拡張
• 報酬モデルを活用した自己改善フレームワーク S P IR A L の提案
• 密な報酬で残差R L (D IC E -R L /T D 3 )を駆動し, 自律ro llo u tから方策を継続改善するデータ
フライホイールを構築
• 実験結果
• 1 0 タスクのベンチマークで価値推定M S E を最強ベースライン比 約8 0 % 削減
• 成功率が大幅向上: Folding Shorts 58%→100%, Cleaning Whiteboard 50%→90%
3


# Page. 4

![Page Image](https://bcdn.docswell.com/page/47ZL9MNXJ3.jpg)

背景・課題
• V L A 方策のlo n g -h o riz o n 操作は依然として模倣学習(B C )に大きく依存
• 高品質なデモは高コストで, 方策がデモ分布の近傍に留まりやすい
• 報酬モデルはデモの再重み付けやo n -ro b o t R L の密な教師信号で この依存を低減でき
• ただし報酬モデルは 密(d e n s e )・正確(a c c u ra te )・汎用(g e n e ra l)
の3 条件を同時に満たす必要がある
• 課題 : 既存の報酬モデルは3 条件を同時に満たせない
• タスク固有のs ta g e -a w a re 型 : 正確だがタスクごとのアノテーションが必要
• V L M ベースの汎用型 : 広く使えるが粒度が粗く, 細かい進捗推定に不十分
• V L A の強化学習はD a g g e rスタイルのH u m a in th e lo o p が必要
• S e lf ro llo u tによる改善が難しい
4


# Page. 5

![Page Image](https://bcdn.docswell.com/page/YJ6WKN8PJV.jpg)

関連研究 タスク固有型(S A R M等) : 密だが単一タスク・要アノテーション
5


# Page. 6

![Page Image](https://bcdn.docswell.com/page/GJ5MP59PJ4.jpg)

関連研究 V L M ベースの汎用型 : 多タスクに汎化するが粒度が粗くノイジー
TopReward
Robometer
True/FalseのQ/Aを行い, Trueのlogprob を計測
RBM -1M という大規模データセットによるVLM
ベースの報酬モデル.
6


# Page. 7

![Page Image](https://bcdn.docswell.com/page/9E296N2Z7R.jpg)

提案手法 : S A R M 2 全体像
1 . V is u a l / T e x t / S ta te をE n c o d e rに通して
e m b e d d in g を取得
2 .1 S ta g e M o d e l : 2 2 個のa c tio n p riim itiv e
2 .2 V a lu e M o d e l : 後段のM M o E に接続
3 . M M o E R e g re s s io n H e a d .
S ta g e とV a lu e M o d e lの出力を入力として
受け取り, 最終的に出力を行う.
M o E を採用.
7


# Page. 8

![Page Image](https://bcdn.docswell.com/page/D7Y49PYNEM.jpg)

提案手法 : S A R M 2
A c tion -P rim itiv e S ta g e E s tim a to r
• S u b ta s k ではなく, a c tio n p re m itiv e をta s k に依存しない中間表現として採用
例) pick / place / pull/ push …
• 4 -la y e r C a s u a l T ra n s fo rm e r + lin e a r H e a d
• C ro s s -e n tro p y lo s s で学習
データセット
• 2 0 0 h のマニピュレーションデータ
• 1 0 0 ta s k
• K = 2 1 個のa c tio n p rim itiv e に分割.
この2 1 個で9 0 % 以上のデータをカバー
• 各p rim itiv e 3 h ごとに調整. 6 6 h
• さらに各a c tio n p rim itiv はM = 7 + 1 の
G ro u p に分割(M M o E で使用)
8


# Page. 9

![Page Image](https://bcdn.docswell.com/page/VENYL5PVJ8.jpg)

Traget :
9


# Page. 10

![Page Image](https://bcdn.docswell.com/page/Y79P4Y3VE3.jpg)

提案手法 : S P IR A L
10


# Page. 11

![Page Image](https://bcdn.docswell.com/page/G78DQ6M47D.jpg)

提案手法 : S P IR A L
1. O n e T im e A d a p ta tio n b y H u m a n a n n o ta tio n .
1. A b o u t 1 0 0 e p . 2 ~ 3 h a n n o a tio n c o s t.
2. O n ly O n e tim e .
2. M ix e d O b je c tiv e
1. T D ta rg e t b y re w a rd m o d e l
1. G o o d fo r lo n g H o riz o n
2. M C O b je c tiv e
1. P re fe r, fa s t s u c c e s s fu l e p is o d e s .
3. Ite ra te s e v e ra l tim e s .
11


# Page. 12

![Page Image](https://bcdn.docswell.com/page/L7LMX93VJR.jpg)

実験s e tu p
H a rd w a re
• Two 6-DOF YAM robot arms
manufactured by I2RT
• T h re e R e a ls e n s e D 4 0 5 c a m ra s
• rig h t_ w ris t, le ft_ w ris t, to p
D a ta C o lle c tio n
• GE LLO
• 3 0 fp s


# Page. 13

![Page Image](https://bcdn.docswell.com/page/4EMYLZDNEW.jpg)

結果1 re w a rd m o d e l
RW(ReWind)
TR(Top Reward)
RM(Robo meter)
RM(FT) (Robometer with LoRA FT)
w/oSE (without Stage Estimator)
w/o MG(without Multigate)
S1 var(trained on S1 task)
SARM2(proposed)
S1 : classic task
S2 : unconventiona]task
Rollout classification
Success
Partially Success
Fail
12 per each.
M o e D e n s ity :
0 : b a la n c e d
1 : c o lla p s e
13


# Page. 14

![Page Image](https://bcdn.docswell.com/page/PER9K644J9.jpg)

結果1 re w a rd m o d e l
14


# Page. 15

![Page Image](https://bcdn.docswell.com/page/P7XQLM2ZEX.jpg)

結果2 p o lic y e v a lua tio n
Rollout Sparse (human recoreded terminal success)
BC (behavior cloning)
Rollout RM(FT) (Robometer with Lora FT)
RABC(Reward Aligned BC)
RL-Sparse(RL with termianl reward) SARM(Proposed)
RL-Dense(RL with SPIRAL)
15


# Page. 16

![Page Image](https://bcdn.docswell.com/page/37K9LGMV7D.jpg)

結果2 p o lic y e v a lua tio n
16


# Page. 17

![Page Image](https://bcdn.docswell.com/page/LJ3W3N2QJ5.jpg)

L im ita tio n
• Embodiement Scope
• 一つのロボットでのみの検証.
• Multi taskに使用可能だが, Out-of –Boxで使用できるわけではない.
• TopReward / RoboMeterのようなGeneral Reward Modelを直接代替できてはいない
• Mobile manipulationには未適用.
• Ceiling of Resudlual RL
• VLAを直接Fine tune しているわけではないため,性能の工場に限界がある.
• Sample Inefficiency
• 人間がrolloutするよりは効率的であるがロボットによるrolloutは依然としてコストがかかる.
• 環境のresetや安全のためにある程度の人間の介入は依然として必要.
17


# Page. 18

![Page Image](https://bcdn.docswell.com/page/8JDK4WZWEG.jpg)

まとめ/感想
• SARMからの順当な改善
• 特定のロボット等のデータが多数ある場合に,Multitaskに使用でき,使用可能な場面が多い
• Self Improvementへの道
• SPIRALで示されたReward Modelによるpolicyのself improvemen.
• 1. 強いgeneral reward model + self rollout + environment resetが揃えば
自動的なpolicy のself improvementが可能になりそう.
• 環境のresetに人間の介入が以前として必要なためDaggerr styleの改善と比べてどの程度
必要かは議論の余地がありそうだと感じた.
18


# Page. 19

![Page Image](https://bcdn.docswell.com/page/VEPKM9DX78.jpg)

参考文献
[1] SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation
https://arxiv.org/abs/2509.25358
[2] Vision Language Models are In-Context Value Learners
https://arxiv.org/abs/2411.04549
[3] ReWiND: Language-Guided Rewards Teach Robot Policies without New Demonstrations
https://arxiv.org/abs/2505.10911
[4] https://huggingface.co/spaces/lerobot/robot-folding
[5] RoboMeter : https://arxiv.org/abs/2603.02115
[6] TopReward : https://arxiv.org/abs/2602.19313
[7] SARM 2 : https://arxiv.org/pdf/2606.10305v1
[8] From Prior to Pro: Efficient Skill Mastery via Distribution Contractive RL Finetuning
https://arxiv.org/abs/2603.10263
19