【DL輪読会】相関バイアスを考慮した動的scene graph生成の新しいモデル

4.5K Views

December 01, 23

#動的scene graph #相関バイアス #ノイズラベル #Transformer #MLN

スライド概要

Deep Learning JP

@DeepLearning2023

スライド一覧

DL輪読会資料

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

ダウンロード

関連スライド

【DL輪読会】KAN: Kolmogorov–Arnold Networks

Deep Learning JP 90.3K

【拡散モデル勉強会】拡散モデルの数理

Deep Learning JP 66.5K

【DL輪読会】Evolutionary Optimization of Model Merging Recipes モデルマージの進化的最適化

Deep Learning JP 61K

【DL輪読会】Conditional Flow Matching

Deep Learning JP 48.3K

【拡散モデル勉強会】Introduction to Diffusion Models

Deep Learning JP 46.4K

【DL輪読会】Cosmos World Foundation Model Platform for Physical AI

Deep Learning JP 45.6K

各ページのテキスト

DEEP LEARNING JP [DL Papers] 相関バイアスを考慮した動的scene graph生成の新しいモデル Eito Takahama, Toyo University 高浜英人, 東洋大学 http://deeplearning.jp/

http://deeplearning.jp/

論文情報 ▪ タイトル: Correlation Debiasing for Unbiased Scene Graph Generation in Videos (バランスの取れていない動的なscene graphのための相関の偏りの軽減手法) ▪ 著者 : Anant Khandelwal Glance AI Bangalore, India 2

3 背景と必要性

scene graphとは 4 Scene graph 画像(動画)データ中のオブジェクトを判断しそれを判断した上でオブジェクトどうしの関係性（述語）を言語化したもの男性・箱・女性などのオブジェクトとそれらの関係を言語で示している画像の出典：Fujitsu HP

scene graphの活用 • 防犯カメラの自動化 • 自動運転機能 • ロボット 5

6 モデル内容

動的scene graphの課題 (論文中で示されたもの) 画像SGGと共通の課題 • Long-tailed predicates：オブジェクトの関係を示す述語に偏りがある →”on” “in” はとても使われるが、ほとんどの単語は使われない • Noisy annotations : オブジェクトのラベル付けが不適切 →学習が不適切に行われる動画SGG特有の課題 • 時間的な変動が起こりやすいシーン全体のオブジェクトの包括的な理解 • 異なるオブジェクトとの時間的な動きや相互作用のモデル化 7

先行研究 8 Spatial-temporal Transformer Yuren Cong et al. (2021) • オブジェクト間の動的な関係とフレーム間の時間的な依存関係を構築する • 空間エンコーダと時間エンコーダを分けている • 空間エンコーダの出力を時間エンコーダに入力する Chenchen Liu et al. (2020) • フレーム間の時間的移動に対応するためにオブジェクト追跡メカニズムを採用している

先行研究 9 TEMPURA • バイアスのない関係表現を合成する • 関係の予測の不確実性を減衰させる • オブジェクトレベルの時間的一貫性を取り入れる • Transformer を用いたシーケンス • メモリガイドトレーニング • ガウス混合モデル（GMM）参考文献： Sayak Nag et al. (2023)

10.

先行研究 10 TEMPURA • バイアスのない関係表現を合成するオブジェクト追跡メカニズムが計算量が多い • 関係の予測の不確実性を減衰させる • オブジェクトレベルの時間的一貫性を取り入れる Long-tailed predicatesに対する明確な対処はない • Transformer を用いたシーケンス • メモリガイドトレーニング • ガウス混合モデル（GMM）参考文献： Sayak Nag et al. (2023)

11.

使用されたモデル 11 物体間の特徴量 Transformerのエンコーダでマスク Correlation debiasing Transformer エンコーダ Union box 特徴量物体の検出 MLN Transformer デコーダ物体の特徴量分類損失関数対比損失関数 Union box : 関係のある複数の物体

12.

使用されたモデル 12 物体間の特徴量 Transformerのエンコーダでマスク Correlation debiasing Transformer エンコーダ Union box 特徴量物体の検出 MLN Transformer デコーダ物体の特徴量分類損失関数対比損失関数 Union box : 関係のある複数の物体

13.

時系列を考慮した物体検出 TFoD (Temporal Flow-AwareObject Detection) transformer encoder with masked self-attentionを使用する →未来の情報を隠しつつ時系列のラベルを管理するそのままエンコーダに渡すのではなく特徴量にフロー変形を加える 13

14.

フロー変形の例示 14 フロー変形 : ピクセルがどれくらい動いたのかを計算し、ベクトルに変換する参考文献：Go Fujita et al. (2017)

15.

述語の埋め込み動画におけるオブジェクト間の関係は以下の相関に基づいて決定される a) 空間的相関 b) 時間的相関 c) ビデオフレーム間の述語とオブジェクト相関 →これらをtransformerを使用してモデルを決定するオブジェクト間の述語に偏りがある現在の相関行列と前の行列の加重平均として加重が減衰係数によって決定される相関行列を更新する 15

16.

Predicate Classification 16 Noisy annotations を克服するようなClassfier framework をモデル化する必要 MLN (mixture of logic networks)* ラベルの不確実性を予測可能なものと不可能なものに分ける *論文中では”MLNs”という表記もされる

17.

17 評価

18.

Action Genome 18

19.

20.

21.

22.

23.

参考文献 23 • Yuren Cong, Wentong Liao, Hanno Ackermann, Bodo Rosenhahn, and Michael Ying Yang. Spatialtemporal transformer for dynamic scene graph generation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 16372–16382, 2021. • Sayak Nag, Kyle Min, Subarna Tripathi, and Amit K Roy-Chowdhury. Unbiased scene graph generation in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22803–22813, 2023. • Go Fujita, Motoki Tano (2017) "Drawing Complementation by optical flow for Handwritten Animations“ • Chenchen Liu, Yang Jin, Kehan Xu, Guoqiang Gong, and Yadong Mu. Beyond short-term snippet: Video relation detection with spatio-temporal global context. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10840–10849, 2020.