【DL輪読会】Token Probabilities to Mitigate Large Language Models Overconfidence in Answering Medical Questions: Quantitative Study

>100 Views

January 15, 26

スライド概要

シェア

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

ダウンロード

関連スライド

各ページのテキスト
1.

DEEP LEARNING JP [DL Papers] Token Probabilities to Mitigate Large Language Models Overconfidence in Answering Medical Questions: Quantitative Study Presenter: Animesh Harsh http://deeplearning.jp/ 1

2.

Bibliography • Title: – Token Probabilities to Mitigate Large Language Models Overconfidence in Answering Medical Questions: Quantitative Study • Authors – Raphael Bentegeac, Bastien Le Guellec, Gregory Kuchcinski, Philippe Amouyel, Aghiles Hamroun • Affiliation – Lille University, Lille University Hospital Center, Institut Pasteur de Lille, France • Link – https://www.jmir.org/2025/1/e64348 2

3.

Overview • Confidence Estimation for Medical LLMs – They validate that token probabilities significantly outperform self-reported confidence in predicting LLM response accuracy on medical questions. – Tested 9 LLMs on 12,000+ medical licensing exam questions across 5 countries/languages • Key Results: - Self-reported confidence AUROC: 0.51–0.70 - Token probability AUROC: 0.70–0.87 - All models showed overconfidence (>80%) • Practical Impact: - GPT-3.5: False positive rate dropped from 91% → 37% − GPT-4o: 98% accuracy on high-confidence answers 3

4.

Background Previous Works & Motivations - LLMs achieving passing grades on medical board exams (USMLE, etc.) builds trust for clinical deployment. - However, chatbots systematically express high confidence regardless of actual correctness — potentially misleading for patients and clinicians. - Prior work (Krishna et al., 2024): ChatGPT rated confidence 8/10+ even when wrong (100% for GPT-3.5, 77% for GPT-4). - Token probabilities represent the model’s internal mathematical certainty separate from verbalized confidence. 4

5.

The Big Question “Can token probabilities provide a more reliable measure of LLM uncertainty than self-reported confidence?” 5

6.

Experimental Setup 1. Comprehensive evaluation across 9 LLMs: Commercial (GPT-3.5, GPT-4, GPT-4o) and open-source (Llama, Phi, Gemma) models with API access to log probabilities. 2. Multilingual medical benchmark: 12,619 questions from US, China, Taiwan, France, and India medical licensing exams. 3. Demonstrate token probability superiority: AUROC improvement from 0.51–0.70 (expressed) to 0.70–0.87 (token prob.) across all models (P < .001). 4. Practical recommendations: Token probabilities as easily accessible alternative for detecting “inner doubts” of LLMs in high-stakes settings. 6

7.

Preliminaries Problem Setup − Task: Multiple-choice medical Q&A (single correct answer) − Input: Question + answer options + language instruction − Output: Selected answer letter + confidence rating (0‒100) Two Confidence Measures: 1. Expressed Confidence: Self-reported in model output (“My confidence is 95%”) 2. Token Probability: Internal probability assigned to answer token (A, B, C, D, E) 𝑃 𝑎𝑛𝑠𝑤𝑒𝑟 𝑡𝑜𝑘𝑒𝑛 𝑐𝑜𝑛𝑡𝑒𝑥𝑡, 𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛) Evaluation Metrics: AUROC, Adaptive Calibration Error (ACE), Brier Score 7

8.

Methodology(1/2) Model Creator License Version Release date Parameter count MMLU GPT-3.5 Turbo OpenAI Commercial 0125 2024-01-25 Unknown 70.0* GPT-4 OpenAI Commercial 0613 2023-06-13 Unknown 86.4* GPT-4o OpenAI Commercial 2024-11-20 2024-11-20 Unknown 88.7* Llama3.1-8b Meta Open weights Instruct 2024-07-23 8 Billion 69.4 Llama3.1-70b Meta Open weights Instruct 2024-07-23 70 Billion 83.6 Phi 3 Mini Microsoft Open weights 4k-Instruct 2024-06-27 3.8 Billion 70.9 Phi 3 Medium Microsoft Open weights 4k-Instruct 2024-05-21 14 Billion 78.0 Gemma2-9b Google Open weights Instruct 2024-06-27 9 Billion 71.3 Gemma2-27b Google Open weights Instruct 2024-06-27 27 Billion 75.2 Selection criteria: Top MMLU performance + API access to log probabilities Note: Claude and Gemini excluded — no log probability access at time of study 8

9.

Methodology(2/2) Datasets & Experimental Setup MedQA USMLE MedQA TWMLE MedQA MCMLE MedMCQA French MedMCQA Country (year) USA (2020) Taiwan (2020) China (2020) India (2022) France (2022) Language English Traditional Chinese Simplified Chinese English French Number of questions 2,487 2,734 3,414 2,763 1,076 4-option: A/B/C/D 4-option: A/B/C/D 4-option: A/B/C/D 5-option: A/B/C/D/E 5-option: A/B/C/D/E Only one answer Only one answer Only one answer Only one answer National medical board examination National medical board examination National medical board examination National-level postgraduate medical entrance exam National-level medical board examination in Pharmacy Real-world medical exam questions (Medicine, Surgery, Radiology, and Biochemistry) 50% Answer options Source Only one answer Topics Medical knowledge, clinical problemsolving Medical knowledge, clinical problemsolving Medical knowledge, clinical problemsolving Diagnostic Reasoning and Treatment, Pharmacology, Psychology, Biology, Physical Examination, General Management Strategies, Medical Knowledge Passing grade 60% 60% 60% 50% Prompting: Vanilla (no prompt engineering), Temperature = 0 Sensitivity Analyses: Expert prompts, few-shot, confidence scaling (0–1), temperature = 0.5 9

10.

Result 1 : Model Accuracy on USMLE How well do LLMs perform on medical board exams? • Best model: GPT-4o at 89% (near expert level, but below 90% threshold) • 7/9 models passed (60% threshold); smallest models (Phi-3-Mini, Gemma 2-9B) failed 10

11.

Result 1 • The models scored 89% on the tests. Well above the passing requirement of 50%. Does this imply? 11

12.

Result 1 NO! 12

13.

Result 2 : The Overconfidence Problem Do LLMs accurately report their confidence? - All models expressed extremely high confidence (80‒100%) regardless of correctness - Confidence expressed in round numbers (multiples of 5) ̶ mimicking human patterns Key Finding: Models expressed >80% confidence even when completely wrong. GPT-3.5: 100% confidence on incorrect answers GPT-4: 77% of wrong answers had high confidence 13

14.

Result 3 : Token Probability vs. Expressed Confidence Which measure better predicts accuracy? – Token probability consistently outperformed expressed confidence (all P < .001) – Improvement: barely-better-than-random →genuinely useful prediction AUROC interpretation: 0.50 = random, <0.70 = poor, 0.70–0.80 = acceptable, >0.80 = good 14

15.

Result 4 : Practical Impact — False Positive Rates How often do high-confidence answers turn out wrong? - Using token probability dramatically reduces false positives - GPT-3.5: 91% →37% reduction in trusting wrong answers FPR = False Positive Rate (incorrect answers above confidence threshold) 15

16.

Result 5 : Calibration Analysis Are models well-calibrated? - All models showed tendency toward overconfidence - Smaller models exhibited worse calibration (ACE >40% for Phi-3-Mini) - Token probability improved Brier scores across all models (P < .05) Brier Score (lower = better): - GPT-4o: 0.09 (token) vs 0.10(expressed) - Phi-3-Mini: 0.25 vs 0.42 Adaptive Calibration Error: - Large models: <10% - Small models: 20–40%+ - ACE >25% = poor calibration Pattern: Higher-performing models →better calibration for both metrics 16

17.

Result 6 : Sensitivity Analyses Are results robust across conditions? - Multilingual datasets: Results consistent across English, Chinese, and French Ø Exception: Smaller models showed lower performance on Mainland China dataset - Prompting strategies: Vanilla ≈Expert ≈Few-shot ≈Confidence scaling Ø Few-shot slightly improved GPT-3.5 and Llama 3.1-8B (AUC +0.03‒0.04) - Knowledge type: No difference between Step 1 (basic) vs Steps 2/3 (clinical) - Alternative metrics: Ø Shannon entropy: Comparable to token probability Ø Perplexity: Consistently underperformed (P < .001) 17

18.

Discussion Why Does This Happen? 18

19.

Discussion “The model has accurate internal uncertainty — it just doesn’t share it.” 19

20.

Discussion The “Human Mimicry” Explanation - LLMs trained on human text learn to express confidence like humans: Ø Round numbers (80, 90, 95, 100) Ø Avoiding explicit uncertainty Ø Exaggerated expressions (“absolutely certain”) - Key insight: Self-reported confidence is another form of generation, not introspection - The model “plays a character” ̶ mimicking confident humans rather than reporting internal state 20

21.

Discussion and Limitations 1. Single-choice questions only: - Findings may not apply to free-text responses spanning multiple tokens - Methods for aggregating token probabilities still under development 2. Limited prompt engineering exploration: - Deliberately used default configuration to mimic real-world usaage - Other techniques could potentially impact outcomes 3. Model coverage: - Did not include Claude, Gemini (no log probability access) - Reasoning models (o1, etc.) involve multi-step decisions 4. Practical implementation: - Need interface design for communicating uncertainty to clinicians - Integration into clinical workflows remains unexplored 21

22.

Summary Summary - LLMs perform well on medical exams (56.5‒89%) but systematically overstate confidence - Self-reported confidence is nearly useless for predicting accuracy (AUROC: 0.51‒0.70) - Token probabilities significantly outperform expressed confidence (AUROC: 0.70‒0.87) - Token probabilities are already available via API ̶ no special techniques needed - Recommendation: Use token probability thresholds to flag responses needing expert review Personal Take - This connects to broader questions about LLM “self-awareness” ̶ models have useful internal signals but donʼt surface them naturally - Interface problem, not model problem ̶ we need to redesign how we present AI 22