[DL輪読会]Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron(ICML 2018)

>100 Views

June 25, 21

#deep learning #音声合成 #Google #音声研究 #Reference Encoder #prosody

スライド概要

2021/06/25
Deep Learning JP:
http://deeplearning.jp/seminar-2/

Deep Learning JP

@DeepLearning2023

スライド一覧

DL輪読会資料

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

（ダウンロード不可）

関連スライド

【DL輪読会】KAN: Kolmogorov–Arnold Networks

Deep Learning JP 90.8K

【拡散モデル勉強会】拡散モデルの数理

Deep Learning JP 67.6K

【DL輪読会】Evolutionary Optimization of Model Merging Recipes モデルマージの進化的最適化

Deep Learning JP 61.2K

【DL輪読会】Conditional Flow Matching

Deep Learning JP 50K

【DL輪読会】Cosmos World Foundation Model Platform for Physical AI

Deep Learning JP 47.4K

【拡散モデル勉強会】Introduction to Diffusion Models

Deep Learning JP 47.2K

各ページのテキスト

書誌情報 ● ICML 2018 ○ ● ● ● ● ● http://proceedings.mlr.press/v80/skerry-ryan18a.html 組織: Google 著者: RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron J. Weiss, Rob Clark, Rif A. Saurous Googleの音声合成研究一覧音声サンプル一言で ○ どのように話すかを表す prosodyの転移を行えるように Reference encoderを提案. 参照音声の prosodyをEmbedding vectorに落とし込み既存の音声合成モデルに条件付けする形で end-to-end での学習を可能にした． 2

音声合成の全体像 Image from https://engineer.dena.com/posts/2020.03/speech-synthesis-for-entertainment/ 3

https://engineer.dena.com/posts/2020.03/speech-synthesis-for-entertainment/

音声合成の例「こんにちは、僕の名前は原田です、いまDL輪読会で発表しています。」音声合成を試せるColab(ESPnet) 4

https://colab.research.google.com/github/espnet/notebook/blob/master/espnet2_tts_realtime_demo.ipynb

音声合成の活用例 https://twitter.com/Google/status/1400174577188225034 5

https://twitter.com/Google/status/1400174577188225034

背景音声合成をよりリアリスティックにするために ● 音声合成でどのような音声にするかを決める要素 ○ ○ ● 既存のEnd-to-Endの音声合成モデル(Tacotronなど)はprosodyをモデル化していない ○ ● 文字 + イントネーション , stress, リズム, loudnessなど(これらをまとめて prosodyという) 話し方をコントロールできない ■ 同じような口調でしか音声合成ができない明示的なアノテーションなしにProsodyをモデル化したい ○ ○ ラベリング大変こんなふうに話して , と音声で条件づけられると良い 6

提案手法音声合成をよりリアリスティックにするために ● ● 入力: 音声波形→出力: prosodyを表す低次元のembeddingを行うEncoder 音声合成を行う際にembedding vectorで条件付けするようなモデルを提案 Image from https://ai.googleblog.com/2018/03/expressive-speech-synthesis-with.html 7

https://ai.googleblog.com/2018/03/expressive-speech-synthesis-with.html

提案手法全体図元論文Figure1 8

提案手法 Reference Encoder ● ● ● 入力: メルスペクトログラム Conv2D, GRUを経て1次元ベクトルへ学習時 ○ ○ Referenceとtarget speech同じ追加のロスを定義するわけでもない ■ Tacotronでのreconstruction loss でEnd-to-Endで学習 9

10.

実験データセット ● Single-speaker dataset ○ ● 49冊のオーディオブックの録音 (147時間分) Multi-speaker dataset ○ 44人, 合計296時間 ■ オーストラリアアクセントや英国アクセントなど含まれている 10

11.

実験評価指標 ● 客観的指標 ○ ○ ○ ○ ● Mel Cepstral Distortion(MCD) Gross Pitch Error(GPE) Voicing Decision Error(VDE) F0 Frame Error(FFE) 主観的指標 ○ ○ AXY discrimination test ■ 著者が提案 ■ Reference (A)を聞いて ● Xに近いと思うと -3, 同じ0, Yに近いと思う+3 イントネーションとか stress, speaking rate, pauseで近い方を選んでね , 音質とかpronunciationの違いでは評価しないでねと指示 11

12.

実験実験結果: 提案手法(TANH-128)が優位別のSpeaker間でもprosody transferが行えている →Reference encoderの入力をそのまま出力していないことを確認 12

13.

実験実験結果: 提案手法(TANH-128)が優位 13

14.

実験 Referenceで発している文字とは違う文字を読ませる text and audio from https://ai.googleblog.com/2018/03/expressive-speech-synthesis-with.html 14

https://ai.googleblog.com/2018/03/expressive-speech-synthesis-with.html

15.

実験 prosody embeddingとspeaker embeddingはdisentangleされている？ ● 男性の声のreference + 女性の声のspeaker embedding ○ ○ ● 理想は男性の話し方を真似した女性の声しかし女性の声が男性寄りになっている output音声に話者識別を行うと ○ 61%がreferenceの話者と判断 , 21%がtargetの話者と判断 ■ 理想は100%target話者と判断 15

16.

まとめ ● ● Prosodyの制御は今後の音声合成の課題 Prosody転移を行えるようにReference Encoderを提案 ○ ● ● 訓練データにない話者のprosodyでも転移できる Prosodyとtext, speaker embeddingのdisentanglementが課題 ○ ○ ● 既存音声合成モデルでも End-to-Endで学習できる Referenceで話している文字と inputの文字が大きく異なると prosodyの転移は難しい男性のprosody + 女性 target→男性っぽい声になってしまう議論したい点 ○ 複数のmodality, inputがある問題設定でそれぞれの modalityから情報がoverlapしないようにするためにはどうするのでしょう ? ■ この論文とか読んでみれば？などご意見お持ちの方いましたら是非 16