Manga Scene Estimation by Quiz and Answer

627 Views

September 08, 24

スライド概要

In reading manga, it is common to look back at the storyline when following a serialized work. Although there are services that assist comic re-reading through quizzes, searching for specific parts related to the quiz takes a lot of time and complicates the review process. Therefore, in this study, we examined whether it is possible to estimate the scenes related to the quiz based on the quiz questions, answers, and manga-specific features. To achieve this, we extracted key elements from the comic and proposed two estimation methods: a word-based CS method and a context-based GPT method. Furthermore, we discussed extractable and difficult-to-estimate scenes in comics. The results showed that the pages containing the answers could be estimated with a probability of 66.7%. Pages containing specific keywords or events were easier to estimate, while those requiring an understanding of the comic’s overall time series and context were more difficult to estimate. In addition, since the accuracy varied greatly depending on the presence or absence of the answer text, it can be considered that the content as close as possible to the topic of the quiz can be estimated if important keywords such as the answer text are included.

profile-image

明治大学 総合数理学部 先端メディアサイエンス学科 中村聡史研究室

シェア

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

関連スライド

各ページのテキスト
1.

Manga Scene Estimation by Quiz Question and Answer Tsubasa Sakurai, Yume Tanaka, Yuto Sekiguchi and Satoshi Nakamura Graduate School of Advanced Mathematical Sciences, Meiji University

2.

Introduction ComiQA [Tanaka et al. 2024] ⚫ Services on quizzes for comic ⚫ Users can create quizzes, answer them and share them with other users (Session Code: k24-506) k24-506, ComiQA: A Comic Question-Answer Sharing System that Helps Users to Recollect the 2 Content of Previous Volumes., Yume Tanaka, Yuto Sekiguchi, Tsubasa Sakurai and Satoshi Nakamura.

3.

Background Research Question: Can LLM support answering quizzes of comics? Quiz Text: Q. What happened at △△? A. ○○ + LLM 3 comics ✗

4.

Contribution ⚫ We proposed a method for estimating specific scenes in comics from quiz-style texts ⚫ We showed that LLM can support page-finding ⚫ We clarified the characteristics of the comic scenes that could/could not be estimated from the quiz 4

5.

Background Requirements for LLM to support answering the quiz ⚫ Finding scenes relevant to the quiz ⚫ Understanding the story and context for answering the quiz ➔ The LLM needs to find for scenes matching quiz answers from limited information "What was the reason the protagonist started playing volleyball?" 5 © Haruichi Furudate, Haikyu!!

6.

Related Work Video-Scene Retrieval ⚫ Used MSR-VTT video dataset with captions to capture scene order and object relationships using QIK+ system [Zachariah et al. 2023] ⚫ Proposed an online cross-modal scene retrieval framework for streaming image and text data [Qi et al. 2017] ➔ Few studies have targeted comics data and treated quiz sentences as queries 6

7.

Research Purpose ⚫ Finding relevant scenes from the questions and answers of a quiz ⚫ Considering the elements of comic in the process ➔ We investigate whether LLM can support answering quizzes Finding pages that match the quiz text 7 LLM to support answering the quiz Automatic quiz generation from comics

8.

Dataset Construction Number of Quizzes Quizzes were created uniformly from within the comic pages for eight works (sports genre) ⚫ Creating one question each from the beginning, middle, and end ⚫ 5 collaborators, total of 138 questions 8 Distribution of Quizzes per Page

9.

Estimation Method Comparison of the two methods ⚫ CS method: Estimate using cosine similarity between texts ⚫ GPT method: From the top 5 pages based on CS method, GPT estimates the most likely candidate page CS method GPT method Select top 5 pages with highest similarity Quiz Text: “Q. What happened at ○○? A. △△” 0.1 0.1 0.2 0.6 0.1 0.4 0.5 0.6 0.4 0.4 Page Estimation 9 p. 1 p. 2 p. 52 p. 53 p. 105 GPT

10.

Estimation Method Quiz Text (Question + Answer) ⇋ Estimation of the similarity between texts using TFIDF values of words in comic elements on each page Using the following information from each page in the comic : ⚫ Lines information ⚫ Information obtained from illustrations Quiz Text ⚫ Characters information 0.1 0.2 0.1 Text Data 10 0.6 0.1 Lines Lines Lines Lines Lines Images Images Images Images Images Characters Characters Characters Characters Characters p. 1 p. 2 p. 52 p. 53 p. 105

11.

Estimation Method Point of concern (GPT method) Handling of Copyrighted Works When Using LLM Services ✗ Comics json data Detecting Lines 11 Local model (comic frames) Textual data from Images Characters Characters Images Comics Lines

12.

Estimation Method Line information ➔ Recognize and detect all text information using OCR (mokuro[1]) Line Caption Generation Appearance or Not Illustration Information ➔ Generate image captions for each panel (BLIP[2], comic-panel-detectors API[3]) Character information ➔ Whether they appear on each page 12 [1] kha-white, “Mokuro, URL : https://github.com/kha-white/mokuro, Oct.2023. [2] Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi, “BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation, “Computer Vision and Pattern Recognition, no.39, vol.162, pp. 12888-12900, Jan.2022. [3] roboflow, “comic-panel-detectors API, URL : https://universe.roboflow.com/personal-ov9jg/comic-panel-detectors/model/7, Oct.2023. © Fuwai and Kyu Sakazuki, Sokyu Boys

13.

Estimation Method Prompts used to estimate page for quiz and answer (GPT method) INPUT OUTPUT Prompt Example of output results Estimate from which page of the comic the following quiz (Question: “What is Hinata’s weapon that Kageyama is killing by his toss?”, Answer: “His agility.”) was created. Note that the contents of the comic should refer to the information in JSON format shown below. {explanation of the JSON structure}. Estimate page number is “184”. Reason for selection: In the question text, the keyword “swiftness” is present, and the content of page 184 includes… { { “page": “102", “line”: ***, “appearance ": ***, “image2text": *** }, { “page": “184", “line”: ***, “appearance ": ***, “image2text": *** }, ... } 13 ... In the output, please describe the estimated “page number” and the “reason for selecting” for that page. The following is JSON data. ...

14.

Estimation Results Accuracy in the two methods for each feature used for estimation Prompt Image (Caption Generation) Ex) A dark-haired girl smiles with her hands outstretched. Appearance or Not lines image appearing ✓ ✓ ✓ ✓ ✓ CS method GPT method 55.1% 66.7% 52.8% 60.3% 50.0% 65.3% 53.6% 62.0% Lines ✓ ✓ © Fuwai and Kyu Sakazuki, Sokyu Boys 14 ✓

15.

Estimation Results Accuracy in the two methods for each feature used for estimation Prompt Image (Caption Generation) Ex) A dark-haired girl smiles with her hands outstretched. lines image appearing ✓ ✓ ✓ ✓ ✓ CS method GPT method 55.1% 66.7% 52.8% 60.3% 50.0% 65.3% 53.6% 62.0% Lines ✓ ✓ © Fuwai and Kyu Sakazuki, Sokyu Boys 15 ✓

16.

Estimation Results Accuracy in the two methods for each feature used for estimation Prompt Appearance or Not lines image appearing ✓ ✓ ✓ ✓ ✓ CS method GPT method 55.1% 66.7% 52.8% 60.3% 50.0% 65.3% 53.6% 62.0% Lines ✓ ✓ © Fuwai and Kyu Sakazuki, Sokyu Boys 16 ✓

17.

Estimation Results Accuracy in the two methods for each feature used for estimation Prompt lines image appearing ✓ ✓ ✓ ✓ ✓ CS method GPT method 55.1% 66.7% 52.8% 60.3% 50.0% 65.3% 53.6% 62.0% Lines ✓ ✓ © Fuwai and Kyu Sakazuki, Sokyu Boys 17 ✓

18.

Estimation Results Accuracy in the two methods for each feature used for estimation Prompt Image (Caption Generation) Ex) A dark-haired girl smiles with her hands outstretched. Appearance or Not lines image appearing ✓ ✓ ✓ ✓ ✓ CS method GPT method 55.1% 66.7% 52.8% 60.3% 50.0% 65.3% 53.6% 62.0% Lines ✓ ✓ © Fuwai and Kyu Sakazuki, Sokyu Boys 18 ✓

19.

Distribution of Cosine Similarity Examples of peaks in cosine similarity beyond the point where the quiz was created (quiz that can be estimated) Answer page Degree of relevance to the quiz ≒ 19

20.

Distribution of Cosine Similarity Quizzes that could lead to subsequent developments (quiz that can be estimated) Q. “What is the name of the shoot that Sigma scored a point with against the captain?” (Sokyu Boys) A. Prongeon shoot (flying squirrel shoot) Answer page © Fuwai and Kyu Sakazuki, Sokyu Boys 20

21.

Distribution of Cosine Similarity Examples where the distribution is high on average, with the second or third candidate being the answer page (quiz that can not be estimated) 21

22.

Distribution of Cosine Similarity Answer page Quizzes with answers on pages without lines (quiz that can not be estimated) Q. “Who scored the goal at the end of the game in a 7-on-7 mini-game in the first round of Tokyo City Esperion’s Youth Entrance Test?" (Aoashi) A. Akinori Kaneda Estimated page Answer page 22 Estimated page ©Yugo Kobayashi, Aoashi

23.

Distribution of Cosine Similarity Answer page Quizzes in which the topic at the beginning of the quiz is revealed at the end of the story (quiz that can not be estimated) Q. “What does Kageyama Tobio’s nickname, ‘King on the Court,’ mean? (Haikyu!!)” A. The King of Self-Centeredness, Tyrannical Dictator Estimated page Answer page Estimated page 23 © Haruichi Furudate, Haikyu!!

24.

Output Results from GPT Examples of output results when incorrect estimation is performed using GPT method Output results The estimated page number is “24.” The reason for the selection is that a line that appears on page 24 includes the origin of the name “king” and the part where it is expressed as “The king is bad, in addition to violent tossing and turning, blocking and serving, where is the tyrant’s opponent?” This expression is consistent with the answer to the quiz, “Because he is self-centered and a tyrannical dictator,” which is why Kageyama is called “king on the court,” so we determined that this page is the source of the quiz. 24 Estimated page Answer page Scene from earlier talking about Kageyama Tobio rumor Scene where origin of rumor is revealed © Haruichi Furudate, Haikyu!!

25.

Discussion Estimation from the quiz ⚫ GPT method provided an understanding of the context that cannot be determined by cosine similarity alone ⚫ Difficulties with using the middle or end of the page as the answer among the candidates Issues ⚫ Challenges remain in understanding the overall storyline and selecting the correct page ⚫ Unclear whether it can be applied to quizzes in other genres or those after the first volume 25

26.

Discussion Estimation from the quiz ⚫ GPT method provided an understanding of the context that cannot be determined by cosine similarity alone ⚫ Difficulties with using the middle or end of the page as the answer among the candidates Issues ⚫ Challenges remain in understanding the overall storyline and selecting the correct page ⚫ Unclear whether it can be applied to quizzes in other genres or those after the first volume 26

27.

Prospects ⚫ Examine estimation from vague sentences outside of quiz formats ⚫ Investigate query type keywords similar to search terms 27

28.

Summary Background Research purpose Dataset Construction Can LLM support answering quizzes of comics? Finding relevant scenes from the questions and answers of a quiz. ⚫ Quizzes were created for eight works (sports genre). ⚫ Total of 138 questions. The LLM needs to find for scenes from limited information. Estimation Method Estimation Results Discussion and prospects ⚫ CS method: Increased accuracy with GPT, 66.7%, and better contextual understanding. ⚫ Difficulties to estimate the middle of the page and beyond as the answer. ⚫ Estimation from outside of quiz formats. Using cosine similarity between texts. ⚫ GPT method: GPT estimates the most likely page. 28