>100 Views
January 28, 25
スライド概要
テキスト系列からの動的トピック の抽出とボラティリティ予測への 応用 川崎能典 統計数理研究所 森本孝之 関西学院大学理学部 2025年1月25日 統計数理研究所共同研究集会 「ビッグデータ解析と再現可能研究」
This talks is continuation of our 2017 APFM paper… • Morimoto, T. and Kawasaki, Y. (2017). Forecasting Financial Market Volatility Using a Dynamic Topic Model, AsiaPacific Financial Markets, Vol. 24, pp. 149-167. DOI: 10.1007/s10690-017-9228-z
Motivation • Counts of keywords sometimes helps • (Ex.) Google SVI (Search Volume Index) • Have to find nice keywords. • From news (text) data, we want to extract topics (defined by distribution of words) that may affect market sentiments • Construct topic score time series 𝑆𝑆𝐶𝐶𝑡𝑡 • Investigate if 𝑆𝑆𝐶𝐶𝑡𝑡 improves volatility forecasting • Seek more effective specification in multiscale dynamic topic model
Illustration: topic score and realized volatility Realized volatility estimated from high frequent data Estimated topic score (one of 20 scores)
“Bag-of-Words” model • We only focus on word frequencies, and neglect other information (order of words, dependency and so on. • (Ex.) A document 𝐷𝐷 = “It is fine today” can be expressed 𝐷𝐷 = {“𝑖𝑖𝑖𝑖𝑖, “𝑖𝑖𝑖𝑖𝑖, “𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓, “𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡}. • Usually we exclude so-called “stop words” such as “a”, “the”, “for”, etc. • In this research, after morphological analysis, we choose nouns only, and remove numerals, suffixes, non-independent words, pronouns and symbols.
Latent Dirichlet Allocation Model • A standard method for topic modeling • Often abbreviated as LDA • Distribution of words follows multinomial distribution (gives likelihood) • Dirichlet distribution gives a prior distribution of words frequencies • Word distribution 𝜙𝜙𝑧𝑧 characterizes a topic 𝑧𝑧, and each document 𝑑𝑑 consists of many topics of which distribution 𝜃𝜃𝑑𝑑 .
Typical MCMC cycle for LDA This algorithm is for a single document. We do this day by day for Reuters news, and want to ensure some continuity of topics along time axis.
Multiscale Dynamic Topic Model • Proposed by Iwata et al. (2010) • Parameter 𝜙𝜙𝑡𝑡,𝑧𝑧 of word distribution of topic 𝑧𝑧 at time 𝑡𝑡 has some time dependent structure. 𝑠𝑠 𝜙𝜙𝑡𝑡,𝑧𝑧 ∼ Dirichlet 𝑆𝑆 𝑠𝑠 � 𝜆𝜆𝑡𝑡,𝑧𝑧,𝑠𝑠 𝜔𝜔 �𝑡𝑡−1,𝑧𝑧 𝑠𝑠=0 • 𝜔𝜔 �𝑡𝑡−1,𝑧𝑧 : distribution of words over topic 𝑧𝑧 with scale 𝑠𝑠 at time 𝑡𝑡 − 1 • 𝜆𝜆𝑡𝑡,𝑧𝑧,𝑠𝑠 : weight for scale 𝑠𝑠 in topic 𝑧𝑧 at time 𝑡𝑡
Original specification in Iwata et al. 𝑠𝑠 • 𝜔𝜔 �𝑡𝑡−1,𝑧𝑧 indicated the word distribution (w.d.) from epoch 𝑡𝑡 − 1 − 2𝑠𝑠−1 + 1 • If 𝑆𝑆 = 4, 𝑠𝑠 runs through 0,1,2,3,4. • 𝑠𝑠 = 4 →w.d. comes from 𝑡𝑡 − 8 to 𝑡𝑡 − 1 • 𝑠𝑠 = 3 →w.d. comes from 𝑡𝑡 − 4 to 𝑡𝑡 − 1 • 𝑠𝑠 = 2 →w.d. comes from 𝑡𝑡 − 2 to 𝑡𝑡 − 1 • 𝑠𝑠 = 1 →word distribution comes at 𝑡𝑡 − 1 0 • 𝑠𝑠 = 0; assume uniform distribution for 𝜔𝜔 �𝑡𝑡−1,𝑧𝑧
Illustration of Multiscale Word Distribution Word distributions are likely to be smoothed as the time scale becomes long. Iwata, T. et al. (2000) Proceedings of 16th ACM SIGKDD, p.663-672.
HAR-like specification (key idea) 𝑠𝑠 • 𝜔𝜔 �𝑡𝑡−1,𝑧𝑧 indicated the word distribution (w.d.) and we consider three different spans. • 𝑆𝑆 = 3, and 𝑠𝑠 runs through 0,1,2,3. • 𝑠𝑠 = 3 →w.d. comes from 𝑡𝑡 − 22 to 𝑡𝑡 − 1 • 𝑠𝑠 = 2 →w.d. comes from 𝑡𝑡 − 5 to 𝑡𝑡 − 1 • 𝑠𝑠 = 1 →word distribution comes at 𝑡𝑡 − 1 0 • 𝑠𝑠 = 0; assume uniform distribution for 𝜔𝜔 �𝑡𝑡−1,𝑧𝑧 In this talk, we call this specification Heterogeneous MTDM.
MCMC cycle for MDTM Weights 𝜆𝜆𝑡𝑡,𝑧𝑧,𝑠𝑠 and hyperparameter 𝛼𝛼𝑡𝑡−1,𝑧𝑧 are estimated in an outer loop of this cycle, by stochastic EM algorithm and fixed point iteration method.
Construction of topic scores • Topic scores are made up by estimated topic proportions 𝜃𝜃𝑡𝑡,𝑗𝑗,𝑖𝑖 (percentage of topic 𝑖𝑖 included in 𝑗𝑗-th document at time 𝑡𝑡) 𝐷𝐷𝑡𝑡 𝑆𝑆𝐶𝐶𝑡𝑡𝑖𝑖 = � 𝜃𝜃𝑡𝑡,𝑗𝑗,𝑖𝑖 𝑗𝑗=𝑑𝑑 • 𝑆𝑆𝐶𝐶𝑡𝑡𝑖𝑖 : score for topic 𝑖𝑖 at time 𝑡𝑡 • 𝐷𝐷𝑡𝑡 : number of documents at time 𝑡𝑡 • 𝜃𝜃𝑡𝑡,𝑗𝑗,𝑖𝑖 : 𝑖𝑖-th element of the topic distribution within 𝑗𝑗th document at time 𝑡𝑡
Word distribution (June 2, 2008) • We consider 20 topics in all. • Word distribution in Topic 1 and Topic 2 • Only top 10 words are shown
Data • High frequent data of stock index (TOPIX) • January 7th 2008 – December 28th 2012, 𝑇𝑇 = 1223 • Generate 1 min return and calculate daily realized volatility (𝑅𝑅𝑉𝑉𝑡𝑡 ) and realized quarticity (𝑅𝑅𝑄𝑄𝑡𝑡 ) • News data taken from Reuter Japan’s web site • Language = Japanese • 298,205 documents, 24,227 non-overlapping words excluding stop words
Forecasting models • Heterogeneous Autoregressive (HAR) model Baseline model, Corsi (2009) • HARQ model, adding realized quarticity (𝑅𝑅𝑄𝑄𝑡𝑡−1 ) in the coefficient of 𝑅𝑅𝑉𝑉𝑡𝑡−1 Bollerslev, Patton and Quaedvleig (2016) • HAR + topic score (HAR-SC) • HARQ + topic score (HARQ-SC) • In our 2017 paper, we did AR vs. AR-SC and ARQ vs. ARQ-SC comparison which will be omitted here.
高頻度データの集約 • 𝑟𝑟𝑡𝑡,𝑖𝑖 を何らかの金融資産価格から作成した(例えば)1分間隔の収益 率系列とする。 第𝑡𝑡日における第𝑖𝑖収益率。元データは秒単位で計 測されているが、等間隔・分刻みに集約。 • 第𝑡𝑡日内で定義された収益率データの個数を 𝑛𝑛𝑡𝑡 で表す。 𝑛𝑛 2 𝑡𝑡 • 第𝑡𝑡日の実現ボラティリティRV(Realized Volatility)はRV𝑡𝑡 = ∑𝑖𝑖=1 𝑟𝑟𝑡𝑡,𝑖𝑖 で 定義される。 • 同 じ よ う に 、 実 現 Quarticity (Realized Quarticity, RQ) を RQ 𝑡𝑡 = 𝑛𝑛 𝑛𝑛𝑡𝑡 4 𝑟𝑟𝑡𝑡,𝑖𝑖 で定義する。すなわちこれは、第𝑡𝑡日における収益率の ( 𝑡𝑡 ) ∑𝑖𝑖=1 3 標本4次モーメントであり、RV𝑡𝑡 の標本分散に対応する。
HAR vs. HAR-SC • HAR-SC model is defined by 𝑅𝑅𝑉𝑉𝑡𝑡 = 𝛽𝛽0 + 𝛽𝛽1 𝑅𝑅𝑉𝑉𝑡𝑡−1 + 𝛽𝛽2 𝑅𝑅𝑉𝑉𝑡𝑡−1|𝑡𝑡−5 + 𝛽𝛽3 𝑅𝑅𝑉𝑉𝑡𝑡−1|𝑡𝑡−22 + 𝛾𝛾𝛾𝛾𝐶𝐶𝑡𝑡−1 + 𝑢𝑢𝑡𝑡 1 ℎ ∑ where 𝑅𝑅𝑉𝑉𝑡𝑡−𝑗𝑗|𝑡𝑡−ℎ = 𝑅𝑅𝑉𝑉𝑡𝑡−𝑖𝑖 𝑖𝑖=𝑗𝑗 ℎ+1−𝑗𝑗 • Omitting 𝛾𝛾𝛾𝛾𝐶𝐶𝑡𝑡−1 reduces to HAR model
HARQ vs. HARQ-SC • HARQ-SC model is defined by 𝑅𝑅𝑉𝑉𝑡𝑡 1⁄2 = 𝛽𝛽0 + (𝛽𝛽1 +𝛽𝛽1𝑄𝑄 𝑅𝑅𝑄𝑄𝑡𝑡−1 )𝑅𝑅𝑉𝑉𝑡𝑡−1 + 𝛽𝛽2 𝑅𝑅𝑉𝑉𝑡𝑡−1|𝑡𝑡−5 + 𝛽𝛽3 𝑅𝑅𝑉𝑉𝑡𝑡−1|𝑡𝑡−22 + 𝛾𝛾𝛾𝛾𝐶𝐶𝑡𝑡−1 + 𝑢𝑢𝑡𝑡 1 ℎ ∑𝑖𝑖=𝑗𝑗 𝑅𝑅𝑉𝑉𝑡𝑡−𝑖𝑖 where 𝑅𝑅𝑉𝑉𝑡𝑡−𝑗𝑗|𝑡𝑡−ℎ = ℎ+1−𝑗𝑗 • Omitting 𝛾𝛾𝛾𝛾𝐶𝐶𝑡𝑡−1 reduces to HARQ model
HAR-HARSC: another complication • HAR-HARSC model is defined by 𝑅𝑅𝑉𝑉𝑡𝑡 = 𝛽𝛽0 + 𝛽𝛽1 𝑅𝑅𝑉𝑉𝑡𝑡−1 + 𝛽𝛽2 𝑅𝑅𝑉𝑉𝑡𝑡−1|𝑡𝑡−5 + 𝛽𝛽3 𝑅𝑅𝑉𝑉𝑡𝑡−1|𝑡𝑡−22 + 𝛾𝛾1 𝑆𝑆𝐶𝐶𝑡𝑡−1 + 𝛾𝛾2 𝑆𝑆𝐶𝐶𝑡𝑡−1|𝑡𝑡−5 + 𝛾𝛾3 𝑆𝑆𝐶𝐶𝑡𝑡−1|𝑡𝑡−22 + 𝑢𝑢𝑡𝑡 1 ∑ℎ𝑖𝑖=𝑗𝑗 𝑆𝑆𝐶𝐶𝑡𝑡−𝑖𝑖 where 𝑆𝑆𝐶𝐶𝑡𝑡−𝑗𝑗|𝑡𝑡−ℎ = ℎ+1−𝑗𝑗
HARQ-HARSC: yet another complication • HARQ-HARSC model is defined by 𝑅𝑅𝑉𝑉𝑡𝑡 1⁄2 = 𝛽𝛽0 + (𝛽𝛽1 +𝛽𝛽1𝑄𝑄 𝑅𝑅𝑄𝑄𝑡𝑡−1 )𝑅𝑅𝑉𝑉𝑡𝑡−1 + 𝛽𝛽2 𝑅𝑅𝑉𝑉𝑡𝑡−1|𝑡𝑡−5 + 𝛽𝛽3 𝑅𝑅𝑉𝑉𝑡𝑡−1|𝑡𝑡−22 + 𝛾𝛾1 𝑆𝑆𝐶𝐶𝑡𝑡−1 + 𝛾𝛾2 𝑆𝑆𝐶𝐶𝑡𝑡−1|𝑡𝑡−5 + 𝛾𝛾3 𝑆𝑆𝐶𝐶𝑡𝑡−1|𝑡𝑡−22 + 𝑢𝑢𝑡𝑡 1 ℎ ∑ where 𝑆𝑆𝐶𝐶𝑡𝑡−𝑗𝑗|𝑡𝑡−ℎ = 𝑆𝑆𝐶𝐶𝑡𝑡−𝑖𝑖 𝑖𝑖=𝑗𝑗 ℎ+1−𝑗𝑗
In-Sample Forecasting Comparison Nikkei Index 2008-2012 Scale MSE (RW) H-MTDM MSE (IW) H-MTDM QLIKE (RW) H-MTDM QLIKE (IW) MDTM(4) Topic# 19 4 19 14 Model HARQ-HARSC HARQ-HARSC HAR-HARSC HAR-HARSC Accumulated error function value is rescaled by that of HAR model. MSE: 𝐿𝐿 𝑅𝑅𝑉𝑉𝑡𝑡 , 𝑋𝑋𝑡𝑡 = 𝑅𝑅𝑉𝑉𝑡𝑡 − 𝑋𝑋𝑡𝑡 2 QLIKE: 𝐿𝐿 𝑅𝑅𝑉𝑉𝑡𝑡 , 𝑋𝑋𝑡𝑡 𝑅𝑅𝑉𝑉𝑡𝑡 𝑅𝑅𝑉𝑉𝑡𝑡 = − log 𝑋𝑋𝑡𝑡 𝑋𝑋𝑡𝑡 −1 IW: increasing window in regression RW: rolling regression with fixed window size Error 0.692 0.971 0.973 0.990
In-Sample Forecasting Comparison TSE Bank Sector Index 2008-2012 Scale MSE (RW) H-MDTM MSE (IW) MDTM(3) QLIKE (RW) H-MDTM QLIKE (IW) H-MDTM Topic# 19 13 19 19 Model HARQ-HARSC HARQ-SC HAR-HARSC HAR-HARSC Accumulated error function value is rescaled by that of HAR model. MSE: 𝐿𝐿 𝑅𝑅𝑉𝑉𝑡𝑡 , 𝑋𝑋𝑡𝑡 = 𝑅𝑅𝑉𝑉𝑡𝑡 − 𝑋𝑋𝑡𝑡 2 QLIKE: 𝐿𝐿 𝑅𝑅𝑉𝑉𝑡𝑡 , 𝑋𝑋𝑡𝑡 𝑅𝑅𝑉𝑉𝑡𝑡 𝑅𝑅𝑉𝑉𝑡𝑡 = − log 𝑋𝑋𝑡𝑡 𝑋𝑋𝑡𝑡 −1 IW: increasing window in regression RW: rolling regression with fixed window size Error 0.631 0.790 0.886 0.958
Quantitative Comparison of Forecasting Model Confidence Set (MCS) by Hansen, Lunde and Nason (2011) • ℳ0 = {1,2, … , 𝑚𝑚0 } : Initial model set • Big discrepancy in the values of error function shows the difference in forecasting ability. • 𝑑𝑑𝑖𝑖𝑖𝑖,𝑡𝑡 = ℒ 𝑣𝑣𝑡𝑡 , 𝑣𝑣�𝑖𝑖𝑖𝑖 − ℒ 𝑣𝑣𝑡𝑡 , 𝑣𝑣�𝑗𝑗𝑡𝑡 , 𝑡𝑡 = 1, … , 𝑇𝑇 • Test the null of 𝐻𝐻0 : E 𝑑𝑑𝑖𝑖𝑖𝑖,𝑡𝑡 = 0, ∀𝑖𝑖, 𝑗𝑗 ∈ ℳ ⊂ ℳ0 , 𝑖𝑖 > 𝑗𝑗 • Perform the joint test of “all are equal”. If rejected, we eliminate the most inferior model from ℳ.
Eliminating models, constructing MCS • Following Hansen et al. (2011), we employ the 𝑑𝑑�𝑖𝑖𝑖𝑖 test statistic max , where 𝑑𝑑̅𝑖𝑖𝑖𝑖 = 𝑖𝑖,𝑗𝑗∈ℳ var(𝑑𝑑𝑖𝑖𝑖𝑖 ) 1 𝑇𝑇 ∑𝑡𝑡=1 𝑑𝑑𝑖𝑖𝑖𝑖 𝑇𝑇 • 𝑑𝑑̅𝑖𝑖𝑖𝑖 , var(𝑑𝑑𝑖𝑖𝑖𝑖 ) are calculated by block bootstrap • Elimination rule: Set ℳ = 𝑚𝑚, 𝑑𝑑̅𝑖𝑖 = ∑𝑗𝑗∈ℳ 𝑑𝑑̅𝑖𝑖𝑖𝑖 ∕ 𝑚𝑚 − 1 , and eliminate model 𝑖𝑖 ∗ which satisfies 𝑑𝑑�𝑖𝑖 ∗ 𝑖𝑖 = argmax𝑖𝑖∈ℳ . Then repeat the var(𝑑𝑑�𝑖𝑖 ) procedure after shrinking ℳ.
Preliminary results of MCS The most frequently chosen model is HARQ-HARSC. Scale 6 means 𝑆𝑆 = 6, so it covers the lags 1,2 4,8,16,32. Taking larger lags deteriorates the forecasting performance, while sometimes Scale 9 attains the best.