Clinical Applicability of Whisper-Based Automatic Transcription for Korean Speech Audiometry Sentence Tests in Older Adults with Hearing Loss
난청 노인의 Korean Speech Audiometry 문장 검사에 대한 Whisper 기반 자동 전사의 임상적 적용 가능성
Article information
Abstract
Presbycusis often disrupts sentence-level communication, while clinical speech audiometry still depends on labor-intensive scoring. This exploratory two-case study examined whether the large-scale automatic speech recognition (ASR) model Whisper large-v3 can reliably transcribe sentence recognition performances in elderly listeners with presbycusis and how age-related auditory and speech characteristics shape its error patterns. Two native Korean elderly listeners with symmetric sensorineural hearing loss (S-001: moderate steeply sloping; S-002: mild sloping highfrequency loss) completed the Korean Speech Audiometry (KSA) sentence test (80 sentences). Their repetitions in the quiet were recorded and independently transcribed by four experienced audiologists. Expert transcriptions were compared with Whisper outputs generated under fixed decoding parameters and standardized text normalization, using sentence match rate, word error rate (WER), and character error rate (CER). Whisper showed relatively low CER (about 8%-15%) but substantially higher word- and sentence-level errors (WER 30% for S-001 vs. 18% for S-002; sentence match 38.5% vs. 71.2%). Errors clustered in the high-frequency fricatives/affricates, final consonants, low-frequency and polysyllabic words, and longer syntactically complex sentences. Better clinical speech audiometry scores (KSA sentence/word recognition and word recognition score) were associated with higher ASR sentence match rates and lower WER/CER across the two cases. Generic ASR partially agreed with expert transcriptions, suggesting potential as a complementary tool, but elderly-and hearing loss-tailored ASR models and test designs are needed for reliable AI-based sentence recognition.
Introduction
Presbycusis is characterized primarily by sensorineural hearing loss in the high-frequency range and is associated with reduced understanding of everyday conversation, social isolation, and an increased risk of cognitive decline [1,2]. Because real-world communication is more closely related to sentence-level comprehension than to the recognition of in dividual words, assessment of sentence recognition ability is clinically important [3,4].
However, the sentence test in the Korean Speech Audiometry (KSA), which is widely used in Korea, relies entirely on manual stimulus presentation, response transcription, and scoring by the examiner, resulting in substantial time and labor demands [5]. Maintaining consistent scoring across examiners can also be challenging. For these reasons, simpler wordbased tests are often preferred over sentence tests in clinical practice, or sentence test scores are used only to a limited extent.
In recent years, deep learning-based automatic speech recognition (ASR) has rapidly expanded into a wide range of services, including smartphone voice assistants and real-time captioning, and its potential applications in medical and rehabilitation settings have also been actively discussed [6-8]. Whisper large-v3 (OpenAI) is a large-scale multilingual ASR model first released in 2022. It was trained on extensive multilingual and multidomain data. Furthermore, it is reported to show strong transcription performance even in noisy environments, demonstrating a measurable level of performance in Korean without additional training [8]. More recently, studies have evaluated ASR performance in vulnerable populations, including individuals with hearing loss, older adults, and residents of long-term care facilities, and have explored its integration into clinical and welfare services [9]. Accordingly, the potential role of this technology should also be considered in auditory rehabilitation [10].
Nevertheless, most ASR systems have been trained primarily on speech from adults with normal hearing, and performance degradation has been reported in out-of-distribution groups such as older speakers and speakers with hearing loss [8,9]. Furthermore, research and data involving Korean older adults with hearing loss remain very limited.
If ASR can reliably transcribe KSA sentence test utterances produced by older adults with hearing loss, it may help automate part of speech audiometry and may also serve as a tool for tracking changes in sentence recognition in home-based or remote settings[10]. At the same time, identifying which errors occur most frequently in high-frequency consonants, lowfrequency or long-syllable words, and long or complex sentences may become a key task for developing ASR models tailored to older adults with hearing loss and for designing AI-based sentence tests [9]. Despite this potential, there have been no domestic reports in which speech from older adults with hearing loss was transcribed using a general-purpose ASR system based on the KSA sentence test and then quantitatively compared with clinical speech audiometric indices. In this case study, we administered the KSA sentence test to two patients with presbycusis and performed automatic transcription using Whisper large-v3. We aimed to explore the extent to which a general-purpose ASR system agrees with audiologist transcription and KSA indices, to describe the major error patterns observed in presbycusis, and to discuss the potential of Korean-language ASR transcription as an adjunctive clinical tool for future evaluation of sentence recognition in patients with presbycusis.
Case
The study participants were two native Korean-speaking patients with presbycusis who visited Hallym Speech-Hearing Center: S-001, an 87-year-old man, and S-002, a 78-year-old woman. Both participants showed bilateral sensorineural hearing loss on pure-tone audiometry and bilateral type A tympanograms on immittance testing, indicating normal middle-ear function.
Specifically, in S-001, the right ear air-conduction thresholds ranged from 50-85 dB HL across 0.25-8 kHz, with a four-frequency pure-tone average across 0.5-4 kHz of 61.25 dB HL. The left ear ranged from 45-85 dB HL, with a fourfrequency average of 57.5 dB HL, corresponding to bilateral moderate-to-severe hearing loss. In S-002, the right ear airconduction thresholds ranged from 20-60 dB HL, with a four-frequency average of 36.25 dB HL. The left ear ranged from 25-60 dB HL, with a four-frequency average of 40 dB HL, indicating bilateral moderate hearing loss.
On the Korean Mini-Mental State Examination [11], S-001 scored 25 and S-002 scored 24. Overall cognitive function was considered relatively preserved. Both participants had attained at least higher education, including experience studying abroad. Regarding hearing-assistive device history, S-001 had been wearing bilateral hearing aids for approximately 2 months, whereas S-002 reported no prior hearing-aid use. All procedures were approved by the Hallym University Institutional Review Board, and written informed consent was obtained from the participants before study participation. The auditory stimuli consisted of recordings by a standardized female speaker (announcer) from the KSA sentence test, a standard adult measure designed to balance sentence length, lexical difficulty, and phoneme distribution. A total of 80 sentences were used. Testing was conducted in a quiet laboratory with background noise below 30 dBA, and the presentation level was set at each participant’s most comfortable level.
Each sentence was presented only once, and the participants were instructed to repeat it immediately, “exactly as you hear it.” All utterances were recorded using a digital recorder at 44.1 kHz, 16-bit, mono.
ASR and transcription were performed using the Korean version of Whisper large-v3, a large-scale general-purpose ASR model released by OpenAI. Automatic language detection was enabled during model execution, although the likelihood of language identification error was considered low because all input speech was in Korean. Because speech produced by older adults with presbycusis may show lower articulatory accuracy and greater variability than that of adults with normal hearing, a conservative decoding strategy was adopted to ensure consistency of the transcription results.
Specifically, beam search (beam size=5) was combined with greedy decoding (temperature=0.0). The best_of value was set to 5 so that the sequence with the highest log-probability among the candidate transcriptions would be selected. In addition, to account for cases in which the confidence criterion was not met during the initial decoding stage, a temperature fallback strategy was used in which low temperature values (0.0, 0.2, 0.4) were applied sequentially. This approach was intended to yield relatively stable transcription results even in segments with substantial speech distortion or acoustic uncertainty [12].
Four audiologists with MS- or PhD-level training independently transcribed all utterances. To evaluate transcription reliability, sentence-level agreement among the four transcribers was additionally calculated, and inter-transcriber agreement was assessed using percentage agreement. At the sentence level, the most frequent string among the five transcriptions was defined as the majority transcription, and sentence agreement rates (calculated both with and without strict whitespace matching) were based on agreement with this transcription. At the word level, substitutions, deletions, and insertions were distinguished through token alignment, and the word error rate (WER) was calculated. Whisper performance was defined as the mean of the expert-specific WER values (AI-to-human mean WER). At the character level, the character error rate (CER) was calculated using Levenshtein distance after removing whitespace [12]. The relationships among the KSA sentence and word recognition scores (WRSs), the speech reception threshold, the WRS, and the AI-derived metrics were compared descriptively at the case level.
At the sentence level, Whisper transcription showed a clear performance difference between the two participants (Table 1). For S-001, the sentence agreement rate was 38.5% when whitespace was ignored and 28.2% under strict whitespace matching, indicating that more than half of the presented sentences did not match the majority transcription. In contrast, S-002 showed corresponding values of 71.2% and 63.7%, indicating relatively stable sentence-level transcription performance for the same sentence set. Sentence-level agreement among the four audiologists exceeded 90% in both cases, demonstrating very high inter-transcriber agreement. This suggests that the expert transcriptions were highly reliable and that variability among human transcribers was minimal, whereas agreement with Whisper transcription depended strongly on speaker characteristics.
At the word level, a meaningful degree of error was observed in both cases, although the magnitude differed clearly. The WER for S-001 was 0.2985, indicating that approximately 30% of words contained at least one substitution, deletion, or insertion error, whereas the WER for S-002 was 0.1802, corresponding to approximately 18% (Table 2). In both cases, substitution errors accounted for most of the error profile, indicating that Whisper more often misrecognized words as other words than completely missed them or produced false insertions. Deletion and insertion errors were relatively less frequent, but they occurred more often in S-001 than in S-002. In the context of these two cases, this observation tentatively suggests that severe hearing loss and reduced speech clarity may also negatively affect word-level stability. In contrast, the WER of expert transcription was only 1%-2% in both cases, confirming that, for the same speech samples, Whisper had a WER approximately 5-10 times higher than that of the experts.
At the character level, Whisper transcription showed relatively good performance, but it still warrants consideration for clinical application. The CER was 0.1486 for S-001 and 0.0811 for S-002, which were lower than the word-level metrics but still corresponded to errors in approximately 1 out of every 10 characters. This suggests that, although Whisper preserved recognition of individual phonemes or graphemes to some extent, additional errors accumulated at higher linguistic levels, such as word boundary detection, whitespace handling, and compound-word segmentation. Given that the CER of expert transcription was essentially close to 0, these character-level errors might also be viewed as an ASR-specific structural limitation.
A consistent pattern was also observed with speech audiometric measures (Table 3). S-001 showed relatively poor speech recognition performance, with 40% sentence recognition on the KSA, 60% word recognition, and a WRS of 68%, and this pattern was accompanied by a low sentence agreement rate and high WER and CER in Whisper transcription. In contrast, S-002 had 80% sentence recognition, 88% word recognition, and a WRS of 84%, and likewise showed a high sentence agreement rate and low WER and CER in Whisper transcription. Although no statistical testing was performed because of the small sample size, both cases showed a parallel pattern in which better clinical speech recognition performance was associated with better ASR transcription performance, cautiously suggesting that conventional KSA indices may serve as indirect indicators of variation in ASR performance in older adults with hearing loss.
In the detailed error-pattern analysis, Whisper showed vulnerability in both speakers to high-frequency fricatives and affricates (e.g., /ㅅ, ㅆ, ㅈ, ㅊ/) and to word-final consonants. In particular, for S-001, the error rate for high-frequency consonants was several times higher than the mean error rate of expert transcribers, and the WER increased prominently for low-frequency words, long words of three or more syllables, and words containing high-frequency consonants. At the sentence level, the error rate also increased for long sentences containing many words and for sentences with greater syntactic complexity, including conditional clauses and directive or imperative forms, suggesting a tendency for Whisper transcription of speech from older adults with hearing loss to become less stable as linguistic load increased.
Discussion
In this study, we compared the KSA sentence test with automatic transcription by Whisper large-v3 in two speakers with presbycusis to explore the extent to which a general-purpose ASR system can support sentence recognition assessment in older adults with hearing loss. Although the CER was relatively favorable at around 10%, performance at the word and sentence levels was markedly poorer, particularly in S-001, who showed moderate-to-severe, steeply sloping hearing loss, with a WER of approximately 30% and a sentence agreement rate below 40%. This contrasted clearly with S-002, whose hearing was relatively better. These findings are consistent with previous reports showing that ASR recognition accuracy is lower in older adults and speakers with hearing loss than in typical adult speakers [9], and they suggest that Whisper also has structural limitations for older speakers with hearing loss.
The parallel pattern between conventional speech audiometric measures, including KSA sentence and word recognition and WRS, and Whisper-derived indices is clinically meaningful. In both cases, better speech recognition performance was accompanied by a higher sentence agreement rate and lower WER and CER in ASR transcription, whereas poorer speech recognition performance was accompanied by reduced AI transcription performance. This suggests that, in its current form, a general-purpose ASR system may be more useful as an auxiliary indicator that complements conventional speech audiometry than as an independent testing tool that replaces the KSA [12]. More specifically, it may help indirectly characterize speaker-specific speech clarity, vulnerability across frequency regions, and performance changes associated with increasing linguistic load [12]. For example, even among patients with the same KSA scores, comparison of ASR error patterns may allow more detailed identification of additional vulnerable areas, such as high-frequency consonants, low-frequency or long-syllable words, and complex sentences.
The detailed error patterns provide direct implications for the future design of ASR systems tailored to older adults with hearing loss and of AI-based sentence tests. Whisper showed concentrated errors in high-frequency fricatives and affricates, word-final consonants, low-frequency words, long words of three or more syllables, and long, syntactically complex sentences. This indicates that acoustic and articulatory characteristics typically observed in presbycusis, such as high-frequency hearing loss, reduced vocal intensity, and weakening of word-final consonants [13], may conflict with the assumptions of the acoustic and language models used in current general-purpose ASR systems [14]. Therefore, rather than applying the existing KSA sentence lists for older adults with hearing loss directly to AI-based evaluation, it may be necessary to reconfigure difficulty on the basis of lexical frequency, syllable structure, and syntactic complexity and to use words containing high-frequency consonants or word-final consonants as separate subindices [15]. Such structured, error-based indices may provide clinically richer information than simple accuracy rates.
This study has several limitations. First, the number of cases was very small, with only two participants. Accordingly, the present findings should be interpreted as exploratory observations at the hypothesis-generating stage rather than as findings generalizable to a broader population. In addition, because of the very limited sample size, we did not perform a statistical comparison between ASR transcription and expert transcription, and the interpretation of the results was based on exploratory descriptive comparisons. In particular, it is difficult to exclude the possibility that the difference in the degree of hearing loss between the two participants influenced the difference in Whisper’s transcription performance. Second, because the study was conducted under relatively idealized conditions, including a quiet test room, a standardized female speaker, and a single ASR model, performance is likely to decline further in real outpatient or home environments, where background noise, dialect or accent, hearing-aid or cochlear implant use, and changes in speaking rate may be present. Third, we did not apply domain-specific fine-tuning or additional pre- and post-processing to improve Whisper transcription performance. Exploring data augmentation strategies that incorporate speech from older adults or individuals with disabilities, along with domain-specific training, could be a potential direction for improving ASR performance. Future studies should therefore validate the clinical validity and reliability of an ASR model specifically designed for presbycusis by constructing corpora that include speech from older adults and speakers with hearing loss, fine-tuning models to reflect KSA-based sentence and lexical structures, and developing adaptive pre- and post-processing algorithms for changes in vocal intensity and speaking rate. Future research will also require quantitative and statistical comparisons between ASR transcription and expert transcription in larger samples with a wider range of hearing characteristics.
This study provides preliminary insights by directly combining the KSA sentence test with a large-scale general-purpose ASR system. Specifically, it quantitatively presents Whisper’s performance and error patterns for actual clinical speech produced by speakers with presbycusis. Future studies should include larger samples, diverse hearing-loss configurations, hearing-assistive device use, noisy environments, and multiple ASR systems to systematically clarify the relationship between KSA indices and AI-derived transcription metrics. In addition, several steps are necessary to develop an ASR-based sentence test tailored to presbycusis that reflects the structure of the KSA. These include constructing corpora that reflect speech from older adults and speakers with hearing loss, performing model fine-tuning, and developing pre- and post-processing algorithms that account for changes in vocal intensity, speaking rate, and intonation. If these technical foundations are established, such a system may eventually serve as a practical tool for the automation of clinical speech audiometry and for remote auditory assessment.
Supplementary Materials
Korean translation of this article is available with the Online-only Data Supplement at https://doi.org/10.3342/kjorl-hns.2026.00052.
Notes
Acknowledgments
None
Author Contribution
Conceptualization: Woojae Han. Data curation: Sunmi Ma and Sangmin Park. Formal analysis: Tae-Jin Yoon. Funding acquisition: Woojae Han and Tae-Jin Yoon. Methodology: all authors. Project administration: Woojae Han. Resources: Woojae Han. Writing—original draft: Woojae Han. Writing—review & editing: all authors.
