---
title: 【DL輪読会】The Topological Trouble With Transformers
tags: 
author: [Deep Learning JP](https://www.docswell.com/user/DeepLearning2023)
site: [Docswell](https://www.docswell.com/)
thumbnail: https://bcdn.docswell.com/page/8EDK8QVW7G.jpg?width=480
description: 【DL輪読会】The Topological Trouble With Transformers by Deep Learning JP
published: June 04, 26
canonical: https://www.docswell.com/s/DeepLearning2023/K1QXD3-2026-06-04-171006
---
# Page. 1

![Page Image](https://bcdn.docswell.com/page/8EDK8QVW7G.jpg)

The Topological Trouble With Transformers
Kohsei Matsutani, Matsuo Lab
1


# Page. 2

![Page Image](https://bcdn.docswell.com/page/V7PK8LVXJ8.jpg)

書籍情報
- The Topological Trouble With Transformers
-
Author: Michael C. Mozer, Shoaib Ahmed Siddiqui, Rosanne Liu
Institution: Google Deepmind
arXiv: https://arxiv.org/abs/2604.17121
Position paperで、実験や理論などはない
2


# Page. 3

![Page Image](https://bcdn.docswell.com/page/2JVVNQZ3JQ.jpg)

概要
- Transformer の純粋な feedforward 構造は、状態を逐次的に更新し続け
る「state tracking」にトポロジー上の限界を持つ
- state tracking := the iterative updating of latent variables reflecting an evolving
environment
- Recurrent and continuous-thought Transformerのアーキテクチャの分
類をする。
- Temporally extended cognitionには、reccurent architectureによって、
explicitなCoT tracesからimplicitなactivation dynamicsに焦点を移す必
要がある。
- 有望な研究の方向性を提示する。
3


# Page. 4

![Page Image](https://bcdn.docswell.com/page/5EGLKW4YJL.jpg)

State Tracking
- state tracking := the iterative updating of latent variables reflecting an evolving environment
- state ≈ brief state, world state, sufficient summary of the knowledge an agent has about its
environment.
- LLMのstate trackingの失敗例1 : The game of Twenty Questions
- LLMが思い浮かべた1-100の数字の中を当てるゲーム (Gemini 3)
CoTのtraceに明示的に42と書いていても
←矛盾
←矛盾
higher: 正解は、あなたの推測より大きい
lower: 正解は、あなたの推測より小さい
you got it: 正解!
c.f. Laban et al., LLMs Get Lost in Multi-Turn Conversation, ICLR 2026 Outstanding Paper, 2026.
4


# Page. 5

![Page Image](https://bcdn.docswell.com/page/4JQYN3G67P.jpg)

State Tracking
- LLMのstate trackingの失敗例2 : Polysemous word (多義語)
← Bank = River Bank
← Bank = Financial Bank ←矛盾
-
あるゆる可能な環境の状態について確率分布を完全に保持・追跡することは、
AIにも人間にも不可能、なぜなら次元が爆発するから
- Fred は本当に川岸に行ったのか
- 釣り堀かもしれない
- 銀行の近くの川岸かもしれない
- ATM のある施設かもしれない
- ユーザーが意図的に曖昧な質問をしているのかもしれない
-
人間は、①いくつかの候補だけsamplingする、②複雑な分布を典型的な分布に潰す（fishing pole + bank -&gt; river bank）、③前提
に最も合うメンタルモデルを作る（Fred は釣り竿を持って川岸に行ったという具体的な場面を頭の中に作る; もっともありそうな解釈;
maximum a posteriori (MAP) estimate）らしい。
5


# Page. 6

![Page Image](https://bcdn.docswell.com/page/K74WG15ME1.jpg)

State Tracking
- しかし、Transformerはそもそも有限メモリで決定的なstate trackingでさえ失敗する
Transformerでは、activationが下の層から
上の層に流れる
State representationは上にpushされるので、state
trackingは層数にupper boundされる
State Progression:
6


# Page. 7

![Page Image](https://bcdn.docswell.com/page/LJ1YDGWYEG.jpg)

State Tracking
-
-
Transformerのstate trackingは層数にupper boundされる
- Transformerは、毎stepでself-attentionで過去を全て見る
- RNNやSSMはそんなことない:
注意: ただし、すべての state-tracking 問題が、系列長に線形に比例する深さを必要とするわけではない。
- 例: 長さ n までの正規言語の認識, n頂点のグラフ連結性問題はlognの層数で十分
- 系列を左から右に処理しなくても、木構造のようにペアを作れば良い
-
-
例えば8個の入力であれば、1層目で隣同士をまとめて、2層目で２つのペアをまとめて、3層目で全体をまとめれば、
層で解ける
- しかし、これはconductivity (expressivity) であって、learnabilityではない
- Merrill and Sabharwal (2025). など
Brief state cascade
- 層数に限界があると、深い層で得られた
表現（bank = river bank）が、次の浅い層で使えない。
Merrill, W. and Sabharwal, A. (2025). A little depth goes a long way: The expressive power of log-depth transformers. https://arxiv.org/abs/2503.03961. 7


# Page. 8

![Page Image](https://bcdn.docswell.com/page/GJWGYK6172.jpg)

State Tracking
- でも、Transformer (LLMなどの大規模並列モデル) は、特にCoTすると、GPUのおかげでpracticalにそ
こそこうまく行っている
- なぜか -&gt; state trackingの問題をworking memoryの問題に置き換えてきた
- Transformerはshortcut solutionを構成する
- 特に、lookback, associative scans, formal language understanding
- 例: inherently sequentialな問題を並列に解く
-
前のスライドのタスクなど（eg, bit parity）
- Transformerは、state compositionalityをsupprtできる
- state representationをembeddingに分散して非同期に更新できる
- Aliceの情報、Bobの情報、イベント情報などは複数のtoken/embedding表現に分離できる
- RNNだと、 という一つのvectorに異なるstateの情報を詰め込むことになる
- RecurrentなアーキテクチャがTransfomerに必要
8


# Page. 9

![Page Image](https://bcdn.docswell.com/page/4EZLXZYX73.jpg)

Recurrence Taxonomy
- Transformerのrecurrenceは、3軸で分類できる
- layer/depth: 層方向のrecurrence
- autoregressive steps: 各stepでstateを更新し、次のstepに渡すrecurrence
- input steps: 入力tokenの位置・step
- Transformer / latent-thought model / SSMをこの観点で分類する
9


# Page. 10

![Page Image](https://bcdn.docswell.com/page/Y76W4ZDP7V.jpg)

Recurrence in Transformer
(b) looped transformer, universal transformer
(c ) block-recurrent transformer
(d ) まだない
10


# Page. 11

![Page Image](https://bcdn.docswell.com/page/G75MQW3P74.jpg)

Recurrence in Latent Thought Model
COCONUT
11


# Page. 12

![Page Image](https://bcdn.docswell.com/page/9J29PQZZER.jpg)

Recurrence in SSM
12


# Page. 13

![Page Image](https://bcdn.docswell.com/page/DEY45WRNJM.jpg)

Promising Directions
1. SSMを拡張する
a. Linear SSMはTransformerのexpressivityを超えないが、DeltaNet (linear attention + delta rule)
などはTransformerとhybridにするとよりexpressivityを高くできる可能性がある（Merrill et al.
2026.）
circuit complexity classes
Merrill et al., Olmo Hybrid: From Theory to Practice and Back, arXiv:2604.03444, 2026.
2. Feedforward Transformerにstate trackingを近似させる
b. BST, NextLatは良いが、compositional state representationsを考慮するべき
Hidden statesのtrajectoryが線形に
なる正則化JEPA
Teoh et al., Next-Latent Prediction Transformers Learn Compact World Models, arXiv:2511.05963, 2025.
Huang et al., Semantic Tube Prediction: Beating LLM Data Efficiency with JEPA, arXiv:2602.22617, 2026.
13


# Page. 14

![Page Image](https://bcdn.docswell.com/page/VJNYN9DV78.jpg)

Promising Directions
3. Coarse recurrence
a. reccurenceの単位を粗くすることで、reccurenceのbottleneckのcomputation costを改善する
i. 例 Thought Gestalt Model: 各文を latent thought vector に圧縮し、それらを working memory
として参照しながら次の token を予測することで、言語をtoken列ではな thought列としてモデル
化しようとするアーキテクチャ
Borazjanizadeh and McClelland, Modeling Language as a Sequence of Thoughts, arXiv:2512.25026, 2025.
4. representation alignmentを活用する
b. residual connectionによってTransformerの各層での表現はある程度揃っている
c. 層の再利用、層のスキップ、反復的推論、計算量を入力ごとに変える adaptive computation、途中層の表現を
別の場所で再利用する手法 などをできないか
i. 例えば、canon layer
1.
各トークンの表現に、近くの過去トークンの表現を軽く混ぜるための、局所的な横方向residual connection
14
Allen-Zhu, Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers, arXiv:2512.17351, 2025.


# Page. 15

![Page Image](https://bcdn.docswell.com/page/YE9PR2GVJ3.jpg)

Promising Directions
5. Efficient training of recurrence
a. Feedforward Transformerと異なり、reccurenceを入れると並列化ができない。
i. Feedforward Transformerとして学習して、その後のstageでreccurenceを入れるようにpost-trainingす
る。
ii. truncated gradient methodsを使う
1. 最後のstepの勾配だけを戻してあとはdetach
iii. attractor dynamics 用のrecurrent backpropagation
iv. arithmetic intensityを高めて、GPU utilizationをあげる実装
15