Modeless Japanese Input Method

>100 Views

June 16, 15

学会

スライド概要

Presentation about Modeless Japanese Input Method

Yukino Ikegami

@yukinoi

スライド一覧

池上有希乃です・・・†

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

ダウンロード

ダウンロード(pdf - 565.59kB)

関連スライド

Pythonで機械学習を自動化 auto sklearn

Yukino Ikegami >100

続・本当にあった怖い話クローラ編

Yukino Ikegami >100

本当にあった怖い話　「Hadoopで炎上しかけた話」

Yukino Ikegami >100

出会って5行でディープラーニング推論

Yukino Ikegami >100

制約充足問題として最強のボジョレーを求める

Yukino Ikegami >100

Topic and Opinion Classification based Information Credibility Analysis on Twitter

学会

Yukino Ikegami >100

各ページのテキスト

Hybrid method for modeless Japanese input using N-gram based binary classification and dictionary Yukino Ikegami Setsuo Tsuruta 2014/01/20

Necessity of Japanese Input Method • Japanese has many characters – Kana • Hiragana – 81 characters e.g.) いろはにほへと • Katakana – 81 characters e.g.) イロハニホヘト – Kanji (Chinese-characters) • More than 6,000 characters e.g.) 以呂波仁保反止 • We can’t input directly by a keyboard ➢ Japanese input method (Converting alphabet to Japanese character) is necessary 2

If all Japanese characters are assigned to each key… • Toooo many keys! • Japanese input method is necessary

Japanese Input Method -Roman to Kana-Kanji Converter• Flow 1. Receive the Romanized alphabets ①n ekodesu 2. Convert the Romanized alphabets ②ねこです into Kana using Roman-to-Kana table 3. Convert Kana into Kanji (if necessary) ③猫です 4

Problems on Japanese Input Method • Need to switch input modes between Japanese and ASCII e.g. To input ‘あれは8Byteです’ (That is 8Byte) areha [Return][ASCII Mode] 8byte [Japanese Mode] desu Switching Switching • Switching is cumbersome! 5

Adding Term to Dictionary for Switching Mode Problem • Adding term of other languages to dictionary of conventional input method editor • Shortcoming – New term is created continuously – Homograph problem

Related Work • Modeless Pinyin-Chinese Input [Chen et al. 2000] – Convert alphabet (Pinyin) to Chinese – Using word-surface feature only for classification • Type-Any [Ehara et al. 2009] – Convert Alphabet to Any Language – Need press Delimiter-key when converting – Using word-surface feature only for classification 7

Approach -Modeless Japanese Input Method• Automatically switching input mode 1. Generate discriminating model by Support Vector Machine (SVM) – the model describe multiple n-gram features 2. Distinguish a segment whether Kana or not in alphabet sequences using the discriminating model – e.g. nekohacatdesu → nekoha / cat / desu → ねこはcatです Japanese / English / Japanese 8

Main flow of Modeless Japanese Input Method User input (alphabet sequence) Non Japanese Dic. Kana-conversion Discriminative Model each character in user inputs if character is still ASCII? True System Response (Kana & alphabet sequence) False Kana conversion 9

10.

Flow of Generating Discriminative Model Load Texts • 猫はcatです Kanji to Kana • Using Japanese Morphological Analyzer (MeCab) • ネコハcatデス Kana to ASCII • Using Kana to ASCII table (used by Google Japanese input) • nakohacatdesu ASCII to n-gram n-gram to ID Describe as binary model Learning on SVM • character-surface: ne, ek, nek, ko, eko, oh, koh, ha, oha... • character-type: LL, LL, LLL, LL, LLL, LL, LLL... • History: KK,KK, KKK, KK, KKK, KKK... • 1, 3, 4, 13, 22... • 1:1, 3:1, 4:1, 13:1, 32:1... • 1.344, 0.691, 0,023, -1.398... 10

11.

n-gram Features あれは 8 B y t e a r e h a 8 B y t e (in case of n-gram upper limit n = 2, window size m = 2, focus-point xi = 2nd “a”) • Character-Surface – Substring of backward and forward at focus point – e.g.) -2/ha -1/a8 0/8B 1/By • Character-Type – Upper-case(U), Lower-case(L), Number(N), and Symbol(S). – e.g.) -2/LL -1/LN 0/NU 1/UL 11

12.

Generating Non-Japanese Dictionary • Words never appeared in Japanese only text – More than 5 length – Contains substring can’t convert to Kana • Source – Corpus of Contemporary American English (COCA) – Japanese Wikipedia article title list 12

13.

Compare with Conventional IME Conventional method areha [Return][Alphabet Mode] 8Byte [Japanese Mode] desu Switching Switching Typing : 17 Modeless Japanese input method areha8Bytedesu Typing : 14 • The number of typing key is decreased 13

14.

Datasets used in Evaluation Experiment • Generating Model & Evaluating Method – Balanced Corpus of Contemporary Written Japanese (BCCWJ) • book, magazine, blog, government document and others • Non Japanese Dictionary Source – COCA – Japanese Wikipedia article title list 14

15.

Criteria

16.

Results of Evaluation Baseline (Char. surface n-gram) Proposed method (Char. {surface, type} n-gram & Dictionary) Kana Precision .998 .999 ASCII Precision .989 .996 Kana Recall .993 .780 .998 .884 .953 .858 .968 .924 ASCII Recall Kana F1-measure ASCII F1-measure • Outperforms baseline 16

17.

User test • 4 females and 7 males • Input example sentences (chat, mail, technological text) Person No. 1 2 3 4 5 6 7 8 9 Conventional 18.18 17.89 15.4 IME 12.71 11.09 10.18 11.42 12.38 10.48 Proposed method 12.23 6.03 13.34 14.68 9.88 7.00 … 11.03 11.37 10.30 • Outperforms conventional method 17

18.

Summary • Switching input mode is cumbersome • Hybrid Modeless Japanese Input Method – Automatically switching input mode between Japanese and ASCII – Using n-gram features model for discrimination • character-{surface, type} – Outperforms conventional methods 18

Modeless Japanese Input Method

Yukino Ikegami

関連スライド

Pythonで機械学習を自動化 auto sklearn

続・本当にあった怖い話 クローラ編

本当にあった怖い話 「Hadoopで炎上しかけた話」

出会って5行でディープラーニング推論

制約充足問題として最強のボジョレーを求める

Topic and Opinion Classification based Information Credibility Analysis on Twitter

各ページのテキスト

続・本当にあった怖い話クローラ編

本当にあった怖い話　「Hadoopで炎上しかけた話」