---
title: 【DL輪読会】Latent Visual Reasoning
tags: 
author: [Deep Learning JP](https://www.docswell.com/user/DeepLearning2023)
site: [Docswell](https://www.docswell.com/)
thumbnail: https://bcdn.docswell.com/page/4EZLLXD373.jpg?width=480
description: 【DL輪読会】Latent Visual Reasoning by Deep Learning JP
published: April 30, 26
canonical: https://www.docswell.com/s/DeepLearning2023/5Q2M8N-2026-05-01-110027
---
# Page. 1

![Page Image](https://bcdn.docswell.com/page/4EZLLXD373.jpg)

DEEP LEARNING JP Latent Visual Reasoning [DL Papers] Hiroto Osaka, Matsuo Iwasawa Lab, M2 http://deeplearning.jp/

# Page. 2

![Page Image](https://bcdn.docswell.com/page/Y76WW41Z7V.jpg)

書誌情報
■ 論文タイトル
Latent Visual Reasoning
■ 会議
ICLR 2026 poster
■ 著者
Bangzheng Li, Ximeng Sun, Jiang Liu, Ze Wang, Jialian Wu,
Xiaodong Yu, Hao Chen, Emad Barsoum, Muhao Chen, Zicheng Liu
■ 概要
視覚埋め込み空間で自己回帰的に推論する枠組みを提案。LLM の最
終層隠れ状態を視覚埋め込み空間で直接 supervise することで、補助
画像を必要とせず視覚的な中間思考を学習。テキスト CoT 系・ツー
ル使用系の両方を一貫して上回る。
MMVP +5.0pp / V* D.A. +2.7pp (vs Qwen2.5-VL 7B base)
Published as a conference paper at ICLR 2026
LATENT VISUAL REASONING
Bangzheng Li1*, Ximeng Sun2, Jiang Liu2, Ze Wang2, Jialian Wu2,
Xiaodong Yu2, Hao Chen2, Emad Barsoum2, Muhao Chen1, Zicheng Liu2
1University of California, Davis 2Advanced Micro Devices, Inc.
✉bzhli@ucdavis.edu
🌐 Website 💻 Code 😋 Model
ABSTRACT
Multimodal large language models (MLLMs) have achieved notable gains in various tasks by incorporating chain-of-thought (CoT) reasoning in language spaces. Recent work extends this direction by leveraging external tools for visual editing, thereby enhancing the visual signal along the reasoning trajectories. Nevertheless, these approaches remain fundamentally constrained: reasoning is still confined to the language space, with visual information treated as static preconditions. We introduce Latent Visual Reasoning (LVR), a new paradigm that enables autoregressive reasoning directly in the visual embedding space. A visual encoder first projects images into visual tokens within a joint semantic space shared with the language model. The language model is then trained to generate latent states that reconstruct key visual tokens critical for answering the query, constituting the process of latent visual reasoning. By interleaving LVR with standard text generation, our model achieves substantial gains on perception-intensive visual question answering tasks. In addition, we adapt the GRPO algorithm to conduct reinforcement learning on latent reasoning, further balancing LVR and textual generation. We show that LVR substantially improves fine-grained visual understanding and perception, achieving 71.67% on MMVP compared to 66.67% with Qwen2.5-VL. The Code base and model weights will be released later.

# Page. 3

![Page Image](https://bcdn.docswell.com/page/G75MMQ4974.jpg)

■ 背景
Latent Reasoning とは何か
■ 離散 CoT の限界
通常の CoT は中間思考を離散トークン列として出力する
■ 表現力: 有限 vocabulary に decode するため、細やかな中間値が
運べない
■ 長さコスト: token 単位で計算・latency
■ 言語バイアス: 答えに本質的でない filler が混じる
■ Latent Reasoning
中間思考を離散化せず、連続な隠れ状態のまま伝播させる
■ 隠れ状態１つあたりの情報量は離散トークンより数桁大きい
■ 2024 年末の Coconut [3] を契機に研究が加速
Explicit Reasoning
Step 1
Step 2
Step n
Bandwidth* ≈ 15 bits
Latent Reasoning
&quot;Horizontal&quot;
Layer -&gt; Layer ... -&gt; Layer
:
Layer -&gt; Layer -&gt; Layer
T=0 T=1 T=n
&quot;Vertical&quot;
Layer -&gt; Layer -&gt; Layer
:
Layer -&gt; Layer -&gt; Layer
T=0 T=1 T=n
Hidden state
Hidden state
Bandwidth* ≈ 40960 bits
Figure of [2]: Latent Reasoning と Explicit CoT の対比
近年は視覚・音声などへの拡張が急速に進展。LVR [1] はそのう
ち、視覚への拡張を最も簡潔に実現した一例。
Coconut [3] (24-12) → LVR [1] (25-09) という直系の流れに位置
する。

# Page. 4

![Page Image](https://bcdn.docswell.com/page/9J299PV5ER.jpg)

■ 背景
Multimodal CoT の限界
■ Think about Images
PAPO [4], Vision-R1 [5] 等
■ テキスト空間で CoT を展開し、視覚情報は静的な前提条件として
扱う
■ 知覚を要するタスクで性能が大幅に劣化 (V* で PAPO は base から
-42.4 pp)
■ Think with Images
PixelReasoner [6], Argus [7] 等
■ 画像編集ツール (crop, zoom, OCR) を呼び出す
■ ツールが提供する操作集合に閉じており、連続な視覚的意味の
操作には対応できない
What color is the
hat of the golfer
with sunglasses?
(a) Thinking about Images
(via CoT in Text Space)
(b) Thinking with images
(via Image Cropping Tools)
(c) Latent Visual Reasoning
(via Visual Semantics Reconstruction)
Figure 1 of [1]: (a) Think about / (b) Think with / (c) LVR の比較
両者とも最終的な推論はテキスト空間に閉じており、視覚情報と
テキスト生成の間に本質的なギャップが残る。
→ 知覚を要するタスクではテキスト CoT がかえって干渉的に働
く場合がある (PAPO V* -42 pp)

# Page. 5

![Page Image](https://bcdn.docswell.com/page/DEY44536JM.jpg)

■ 提案手法
Latent Visual Reasoning の着想
■ 出発点となる観察
視覚トークンとテキストトークンは MLLM 内で共有された意味空間
に射影されている
→ ならば両方のトークンに対して同時に推論を行うことが自然
■ 提案: Latent Visual Reasoning
■ 特殊トークン &lt;|lvr_start|&gt; / &lt;|lvr_end|&gt; でモード切替
■ その間に Tv 個の潜在トークンを自己回帰的に生成
■ LLM の最終層隠れ状態を、そのまま次位置の入力埋め込みとして
用いる
■ ROI バウンディングボックス内のパッチ埋め込みを ground truth
として MSE で近似
→ 視覚埋め込み空間における自己回帰的な思考
著者の比喩: &quot;We think visually before we speak&quot;
Image
visual tokens
Question
text tokens
Joint semantic space
visual / text
LVR
両方で推論
同じ空間に住んでいるのに、なぜテキスト側だけで推論するのか?
視覚 ↔ テキスト joint semantic space の概念図

# Page. 6

![Page Image](https://bcdn.docswell.com/page/VJNYYNK178.jpg)

■ 提案手法
Latent Visual Reasoning Architecture
■ Backbone
■ Qwen2.5-VL 3B / 7B
■ ViT と Projector は凍結 (既存のモダリティ整合性を保持)
■ 学習対象は LLM 部分のみ、追加の出力ヘッドは導入しない
■ Mode switching
■ 通常モード: LM head 経由で text token を生成
■ LVR モード (&lt;|lvr_start|&gt; 後): LM head を介さず、最終層隠れ
状態を直接次位置の入力埋め込みとして用いる
■ &lt;|lvr_end|&gt; で text モードに復帰
Input Embedding ...
Input ...
[Image]
[Question]
What does the tiger&#039;s
body posture suggest
about its behavior or
mood?
&lt;|lvr_start|&gt;
&lt;|lvr_end|&gt;
Image Embedding
Text Embedding
LVR Embedding
Special Embedding
Figure 2 of [1]: Training &amp; inference pipeline
モード切替は特殊トークン1つで完結し、新たな出力ヘッドは導
入しない (後の Ablation で再確認)

# Page. 7

![Page Image](https://bcdn.docswell.com/page/YE9PPRVYJ3.jpg)

■ 学習ステップ①
SFT: ROI パッチを埋め込み空間で再構成する
■ 手順
■ ROI (Region of Interest) バウンディングボックスは事前にアノテ
ーションされている (VISUALCOT [8], 438k 例)
■ ROI 内パッチの index {I1, ..., ITv} を O(1) で取得
■ 対応するパッチ埋め込み {v1, ..., vTv} を ground truth として使用
■ LLM が自己回帰的に {h1, ..., hTv} を生成し、ht と vt を MSE で近似
■ 損失
CLVR = 1/Tv Σ (t=1 to Tv) ||ht - vt||^2
L = LNTP + λLVR CLVR
画像 + ROI
ROI bbox
Ground-truth embeddings (frozen ViT)
v1 v2 v3 v4
LLM hidden states (autoregressive)
&lt;lvr_start&gt; h1 h2 h3 h4 &lt;lvr_end&gt; y1...
点線 ↕: MSE で h_t を v_t に寄せる / Decoder/head は使わず、last hidden state そのものを再構成に使う
★インデックス予測ではなく、embedding そのものを autoregressive に再構成する
SFT 動作イメージ: ROI パッチ ①②③④ を h_t で再構成 (MSE 対応)
ROI バウンディングボックスに基づく決定論的な teacher
forcing — まず LVR の基本動作を獲得させる段階

# Page. 8

![Page Image](https://bcdn.docswell.com/page/GE8DDWPKED.jpg)

■ 学習ステップ②
RL: GRPO latent
■ 標準 GRPO の直接適用が困難な理由
■ GRPO は token 確率比 rt = πθ/πθold を使う
■ LVR の出力は連続値の隠れ状態であり、トークン分布が定義され
ない
■ 解決策
■ ロールアウト時に LVR の隠れ状態 h_latent を記録
■ 損失計算時に、新旧両方の forward pass に同じ h_latent を teacher
forcing で挿入
■ テキストトークン位置に限って重要度比 ri,t を計算し、policy
gradient を更新
■ LVR 部分は文脈として固定され、直接の勾配は受けない
■ Reward 設計
■ Format: &lt;|lvr_start|&gt; と &lt;|lvr_end|&gt; の両方があれば 1
■ Accuracy: 答えが正解なら 1
Phase 1 Rollout (sample 生成)
Input q+I
LVR (隠れ状態を生成)
h, h, ..., h_L ← 保存
Text generation
y, ..., y_T ← 保存
保存した h_latent を注入
Phase 2 Teacher-forced replay (loss 計算)
Input q+I (同じ)
LVR slot (固定)
h, h, ..., h_L をバッチ
Text 確率を再評価
∇ はここのみ
★ LVR 部分は文脈として固定 → text 位置のみで importance ratio r_t を計算 → policy gradient
GRPO_latent: Phase 1 で h_latent をキャッシュ → Phase 2 で teacher-forced replay。勾配は
text 位置のみ

# Page. 9

![Page Image](https://bcdn.docswell.com/page/LELMMN1P7R.jpg)

■ 提案手法
Decoding 戦略 — LVR の終了判定をどう設計するか
■ 3 つの戦略
■ ✓ Fixed Token
・LVR ステップ数を定数固定 (4 / 8 / 16)
・最も安定で、本論文が採用
■ ✗ Latent End Token
・学習可能な終了ベクトル eend とのユークリッド/コサイン距離で判定
・どの距離尺度・閾値でも安定せず、最大ステップまで生成が継続
■ ✗ Mode Switching Loss
・LM head の &lt;|lvr_end|&gt; 確率を BCE で強化
・即座に終了することが損失最小となり、LVR ステップ数が 0 に縮退
■ 示唆
可変長制御は本論文では未解決の課題として残されている
→ 隠れ状態レベルの補助損失 (SIM-CoT [10] 等) による解決の可能性
戦略 V* MMVP IQ
✓ Fixed Token (8) 81.7 71.7 29.3
✗ Latent End Token 39.8 19.0 6.7
✗ Mode Switching collapse — —
Table of [1]: 復号戦略ごとの 7B 性能
一見すると可変長の方が柔軟だが、実際には固定長 4-8 トーク
ンで性能が飽和
→ クエリに必要な視覚情報は少数トークンに圧縮可能であるこ
とを示唆

# Page. 10

![Page Image](https://bcdn.docswell.com/page/4JMYYXW2JW.jpg)

■ 実験
実験設定
■ Backbone
■ Qwen2.5-VL 3B / 7B
■ 視覚解像度 max 5120 × 28 × 28 px
■ ViT / projector は frozen
■ SFT
■ データ: VISUALCOT [8] (438k VQA +
bbox)
■ 7B: 4× AMD MI250 / 40 時間
■ 約 2,500 steps
■ RL
■ データ: ViRL [9]
■ 3B のみ / 20 時間 / 1,500 steps
■ temperature 0.9, KL β = 0.04
■ 評価ベンチマーク
■ V* Bench — D.A. / R.P.
■ MMVP — perception robustness
■ BLINK subsets
・Counting / IQ-Test / JigSaw
・Relative Reflectance / Spatial
■ 比較対象
■ Qwen2.5-VL base
■ Think about (text-CoT)
・PAPO [4]
・Vision-R1 [5]
■ Think with (tool-use)
・PixelReasoner [6]
ViT / projector は凍結 — モダリティ射影は事前学習で十分整合済みと仮定し、LLM 側のパラメータのみを更新する設計 (後の Ablation
で検証される伏線)

# Page. 11

![Page Image](https://bcdn.docswell.com/page/PJR99NY579.jpg)

■ 実験結果
Main Results (7B): perception-intensive で一貫した改善
■ 主な所見
■ LVR は base から V* +3.2 / V* D.A. +2.7 / V* R.P. +3.9 / MMVP
+5.3
■ Think about 系の PAPO [4] は V* で base から -42 pp と顕著に劣
化
・テキスト CoT が知覚タスクに対して干渉的に働くことを示唆
■ ツール呼び出し型の PixelReasoner [6] も上回り、外部ツールに
依存しない設計の優位性を示す
■ 例外: Relative Reflectance のみ劣化
■ BLINK の Relative Reflectance のみ base を下回る
■ 複数画像を要する設定であり、SFT データが単一画像のみで構成さ
れているに起因
→ 画像間データ拡張 (cross-image augmentation) は future work
追加ヘッドや補助画像を導入しない簡潔な設計で、テキスト CoT
系・ツール使用系の双方を一貫して上回る
Method V* V*D.A. V*R.P. MMVP Counting IQ-Test JigSaw Relative Reflect
Close Source Models
GPT-4o 62.8 - - - 51.7 30.0 58.0 38.8
Gemini2.5-Pro 79.2 - - - - - - -
ARGUS-X3 78.5 - - 45.5 - - - -
Open Models based on Qwen2.5-VL-7B
Qwen2.5-VL 78.5 81.7 73.7 66.7 66.7 26.0 52.0 38.8
PAPO 36.1 25.2 52.6 54.3 66.7 29.3 52.0 39.6
Vision-R1 70.2 70.4 69.7 46.7 51.7 26.7 27.3 44.8
PixelReasoner 80.1 81.7 77.6 67.0 66.7 25.3 52.7 42.5
SFT 79.1 82.6 73.7 65.7 67.5 26.7 45.3 33.6
LVR (4 Steps) 81.2 84.4 76.3 72.0 69.2 28.7 52.7 42.5
LVR (8 Steps) 81.7 84.4 77.6 71.7 70.0 29.3 52.0 42.5
LVR (16 Steps) 80.6 81.7 79.0 71.7 70.8 27.3 52.7 41.8
Table 1 of [1]: 7B モデルの vision-centric ベンチマーク比較
Spatial Relation
76.9
-
-
87.4
88.8
66.4
88.1
88.8
89.5
86.0
87.4

# Page. 12

![Page Image](https://bcdn.docswell.com/page/PEXQQNGXJX.jpg)

■ 実験結果
Ablation 1: 追加 head を入れない方が良い
■ 観察
■ 追加ヘッドを持たない標準 LVR が最も高性能
■ MLP / GLU ヘッドを追加した両構成で一貫した劣化
■ Latent End Token を併用した構成では大幅な性能低下 (V* 81.7 →
39.8)
■ 解釈
backbone LLM 内で視覚とテキストの意味空間が既に整合している
→ 追加の変換層は、かえって意味的なギャップを生む
hidden state = visual semantics が成り立つ→ joint
embedding hypothesis の支持
Method V* V*D.A. V*R.P. MMVP IQ-Test JigSaw
LVR 81.7 84.4 77.6 71.7 29.3 52.0
LVR LatentEnd 39.8 32.2 51.3 19.0 6.7 13.3
LVR MLPhead 74.4 76.5 71.1 69.7 23.3 50.0
LVR GLUhead 79.6 82.6 75.0 69.0 25.3 44.0
Table 3 of [1]: 追加 head の ablation (7B)

# Page. 13

![Page Image](https://bcdn.docswell.com/page/3EK99N49ED.jpg)

■ 実験結果
Ablation 2: GRPO latent の RL 寄与
■ 観察
■ SFT のみの構成と比較し、+1～+3 pp の一貫した改善
■ 例: V* 8 step 65.5 → 67.0、MMVP 16 step 56.0 → 58.0
■ 解釈
GRPO_latent はテキスト位置の policy gradient を介して間接的に
LVR を refine できる
■ Failure mode
⚠ format reward を除いた設定では、テキストのみ生成する解に収束
→ 特殊トークン &lt;|lvr_start|&gt; / &lt;|lvr_end|&gt; が出力されな
くなる
※ 3B のみで検証。7B での RL 効果は未確認 (後続 Monet [13] /
COVT [12] は 7B + RL を提示)
Method V* V*D.A. V*R.P. MMVP IQ-Test JigSaw
PAPO 31.94 22.61 46.05 50 31.33 46.67
LVR (4 | 8 | 16) 64.9 | 65.5 | 66.5 69.6 | 71.3 | 71.3 60.5 | 60.5 | 56.6 54.7 | 56.0 | 56.0 29.3 | 30.7 | 30.0 52.7 | 52.7 | 52.0
LVR RL (4 | 8 | 16) 65.5 | 67.0 | 66.5 69.6 | 72.2 | 71.3 59.2 | 59.2 | 59.2 55.3 | 55.3 | 58.0 30.7 | 32.0 | 30.0 52.7 | 52.7 | 50.7
Table 2 of [1]: GRPO_latent の RL 寄与 (3B)

# Page. 14

![Page Image](https://bcdn.docswell.com/page/L73WWVL975.jpg)

■ まとめ
■ 本論文の貢献
■ LLM の最終層隠れ状態を視覚埋め込み空間で直接 supervise する
簡潔な視覚潜在推論
■ bbox 内の patch embedding を MSE で supervise する設計、補助
画像が不要
■ 固定長が可変長を上回る傾向で、視覚情報は少数トークンに圧縮
可能であることを示唆
■ GRPO_latent: テキスト位置のみで重要度比を計算する RL 拡張。
format reward が学習成立に必須
■ 個人的興味・疑問
■ 潜在思考の可視化
・projector 逆変換や最近傍 ViT パッチによる検証が未実施で、再構成内
容が未確認
■ 固定 4 ステップが性能のピーク
・ROI 内のパッチ数が 4 を大きく超える場合でも飽和し、視覚情報の圧縮
可能性を示唆
■ 可変長制御の実現
・隠れ状態レベルの補助損失 (SIM-CoT [10] 等) による解決可能性
■ 7B モデルでの RL 検証
・後続研究 (Monet [13]、COVT [12]) は 7B + RL を実施済み
■ 単一画像設定の制約
・Relative Reflectance の劣化は、画像間データ拡張による解消が期待さ
れる

# Page. 15

![Page Image](https://bcdn.docswell.com/page/87DKK8N8JG.jpg)

■ 参考文献
[1] B. Li, X. Sun, J. Liu, et al. &quot;Latent Visual Reasoning.&quot; ICLR 2026.
[2] R. Zhu, X. Wu, Q. Liu, et al. &quot;A Survey on Latent Reasoning.&quot; 2025.
[3] S. Hao, S. Sukhbaatar, D. Su, et al. &quot;Training Large Language Models to Reason in a Continuous Latent Space&quot; (Coconut). 2024.
[4] Z. Wang, et al. &quot;Perception-Aware Policy Optimization for Multimodal Reasoning&quot; (PAPO). 2025.
[5] W. Huang, et al. &quot;Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models.&quot; 2025.
[6] J. Su, et al. &quot;Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning.&quot; 2025.
[7] Y. Man, et al. &quot;Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought.&quot; CVPR 2025.
[8] H. Shao, S. Qian, H. Xiao, et al. &quot;Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning&quot;
(VISUALCOT). NeurIPS 2024.
[9] H. Wang, et al. &quot;VL-Rethinker (ViRL): Incentivizing Self-Reflection of VLMs with RL.&quot; 2025.
[10] X. Wei, et al. &quot;SIM-CoT: Supervised Implicit Chain-of-Thought.&quot; 2025.
[11] J. Ma, et al. &quot;Multimodal Reasoning via Latent Refocusing&quot; (CoCoVa). 2025.
[12] Y. Qin, et al. &quot;Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens&quot; (COVT). 2025.
[13] Q. Wang, et al. &quot;Monet: Reasoning in Latent Visual Space Beyond Images and Language.&quot; 2025.
[14] G. Sun, et al. &quot;Latent Chain-of-Thought for Visual Reasoning.&quot; 2025.