| Home | E-Submission | Sitemap | Editorial Office |  
top_img
Korean Journal of Otorhinolaryngology-Head and Neck Surgery > Epub ahead of print
Clinical Applicability of Whisper-Based Automatic Transcription for Korean Speech Audiometry Sentence Tests in Older Adults with Hearing Loss

Abstract

Presbycusis often disrupts sentence-level communication, while clinical speech audiometry still depends on labor-intensive scoring. This exploratory two-case study examined whether the large-scale automatic speech recognition (ASR) model Whisper large-v3 can reliably transcribe sentence recognition performances in elderly listeners with presbycusis and how age-related auditory and speech characteristics shape its error patterns. Two native Korean elderly listeners with symmetric sensorineural hearing loss (S-001: moderate steeply sloping; S-002: mild sloping highfrequency loss) completed the Korean Speech Audiometry (KSA) sentence test (80 sentences). Their repetitions in the quiet were recorded and independently transcribed by four experienced audiologists. Expert transcriptions were compared with Whisper outputs generated under fixed decoding parameters and standardized text normalization, using sentence match rate, word error rate (WER), and character error rate (CER). Whisper showed relatively low CER (about 8%-15%) but substantially higher word- and sentence-level errors (WER 30% for S-001 vs. 18% for S-002; sentence match 38.5% vs. 71.2%). Errors clustered in the high-frequency fricatives/affricates, final consonants, low-frequency and polysyllabic words, and longer syntactically complex sentences. Better clinical speech audiometry scores (KSA sentence/word recognition and word recognition score) were associated with higher ASR sentence match rates and lower WER/CER across the two cases. Generic ASR partially agreed with expert transcriptions, suggesting potential as a complementary tool, but elderly-and hearing loss-tailored ASR models and test designs are needed for reliable AI-based sentence recognition.

Introduction

Presbycusis is characterized primarily by sensorineural hearing loss in the high-frequency range and is associated with reduced understanding of everyday conversation, social isolation, and an increased risk of cognitive decline [1,2]. Because real-world communication is more closely related to sentence-level comprehension than to the recognition of in dividual words, assessment of sentence recognition ability is clinically important [3,4].
However, the sentence test in the Korean Speech Audiometry (KSA), which is widely used in Korea, relies entirely on manual stimulus presentation, response transcription, and scoring by the examiner, resulting in substantial time and labor demands [5]. Maintaining consistent scoring across examiners can also be challenging. For these reasons, simpler wordbased tests are often preferred over sentence tests in clinical practice, or sentence test scores are used only to a limited extent.
In recent years, deep learning-based automatic speech recognition (ASR) has rapidly expanded into a wide range of services, including smartphone voice assistants and real-time captioning, and its potential applications in medical and rehabilitation settings have also been actively discussed [6-8]. Whisper large-v3 (OpenAI) is a large-scale multilingual ASR model first released in 2022. It was trained on extensive multilingual and multidomain data. Furthermore, it is reported to show strong transcription performance even in noisy environments, demonstrating a measurable level of performance in Korean without additional training [8]. More recently, studies have evaluated ASR performance in vulnerable populations, including individuals with hearing loss, older adults, and residents of long-term care facilities, and have explored its integration into clinical and welfare services [9]. Accordingly, the potential role of this technology should also be considered in auditory rehabilitation [10].
Nevertheless, most ASR systems have been trained primarily on speech from adults with normal hearing, and performance degradation has been reported in out-of-distribution groups such as older speakers and speakers with hearing loss [8,9]. Furthermore, research and data involving Korean older adults with hearing loss remain very limited.
If ASR can reliably transcribe KSA sentence test utterances produced by older adults with hearing loss, it may help automate part of speech audiometry and may also serve as a tool for tracking changes in sentence recognition in home-based or remote settings[10]. At the same time, identifying which errors occur most frequently in high-frequency consonants, lowfrequency or long-syllable words, and long or complex sentences may become a key task for developing ASR models tailored to older adults with hearing loss and for designing AI-based sentence tests [9]. Despite this potential, there have been no domestic reports in which speech from older adults with hearing loss was transcribed using a general-purpose ASR system based on the KSA sentence test and then quantitatively compared with clinical speech audiometric indices. In this case study, we administered the KSA sentence test to two patients with presbycusis and performed automatic transcription using Whisper large-v3. We aimed to explore the extent to which a general-purpose ASR system agrees with audiologist transcription and KSA indices, to describe the major error patterns observed in presbycusis, and to discuss the potential of Korean-language ASR transcription as an adjunctive clinical tool for future evaluation of sentence recognition in patients with presbycusis.

Case

The study participants were two native Korean-speaking patients with presbycusis who visited Hallym Speech-Hearing Center: S-001, an 87-year-old man, and S-002, a 78-year-old woman. Both participants showed bilateral sensorineural hearing loss on pure-tone audiometry and bilateral type A tympanograms on immittance testing, indicating normal middle-ear function.
Specifically, in S-001, the right ear air-conduction thresholds ranged from 50-85 dB HL across 0.25-8 kHz, with a four-frequency pure-tone average across 0.5-4 kHz of 61.25 dB HL. The left ear ranged from 45-85 dB HL, with a fourfrequency average of 57.5 dB HL, corresponding to bilateral moderate-to-severe hearing loss. In S-002, the right ear airconduction thresholds ranged from 20-60 dB HL, with a four-frequency average of 36.25 dB HL. The left ear ranged from 25-60 dB HL, with a four-frequency average of 40 dB HL, indicating bilateral moderate hearing loss.
On the Korean Mini-Mental State Examination [11], S-001 scored 25 and S-002 scored 24. Overall cognitive function was considered relatively preserved. Both participants had attained at least higher education, including experience studying abroad. Regarding hearing-assistive device history, S-001 had been wearing bilateral hearing aids for approximately 2 months, whereas S-002 reported no prior hearing-aid use. All procedures were approved by the Hallym University Institutional Review Board, and written informed consent was obtained from the participants before study participation. The auditory stimuli consisted of recordings by a standardized female speaker (announcer) from the KSA sentence test, a standard adult measure designed to balance sentence length, lexical difficulty, and phoneme distribution. A total of 80 sentences were used. Testing was conducted in a quiet laboratory with background noise below 30 dBA, and the presentation level was set at each participant’s most comfortable level.
Each sentence was presented only once, and the participants were instructed to repeat it immediately, “exactly as you hear it.” All utterances were recorded using a digital recorder at 44.1 kHz, 16-bit, mono.
ASR and transcription were performed using the Korean version of Whisper large-v3, a large-scale general-purpose ASR model released by OpenAI. Automatic language detection was enabled during model execution, although the likelihood of language identification error was considered low because all input speech was in Korean. Because speech produced by older adults with presbycusis may show lower articulatory accuracy and greater variability than that of adults with normal hearing, a conservative decoding strategy was adopted to ensure consistency of the transcription results.
Specifically, beam search (beam size=5) was combined with greedy decoding (temperature=0.0). The best_of value was set to 5 so that the sequence with the highest log-probability among the candidate transcriptions would be selected. In addition, to account for cases in which the confidence criterion was not met during the initial decoding stage, a temperature fallback strategy was used in which low temperature values (0.0, 0.2, 0.4) were applied sequentially. This approach was intended to yield relatively stable transcription results even in segments with substantial speech distortion or acoustic uncertainty [12].
Four audiologists with MS- or PhD-level training independently transcribed all utterances. To evaluate transcription reliability, sentence-level agreement among the four transcribers was additionally calculated, and inter-transcriber agreement was assessed using percentage agreement. At the sentence level, the most frequent string among the five transcriptions was defined as the majority transcription, and sentence agreement rates (calculated both with and without strict whitespace matching) were based on agreement with this transcription. At the word level, substitutions, deletions, and insertions were distinguished through token alignment, and the word error rate (WER) was calculated. Whisper performance was defined as the mean of the expert-specific WER values (AI-to-human mean WER). At the character level, the character error rate (CER) was calculated using Levenshtein distance after removing whitespace [12]. The relationships among the KSA sentence and word recognition scores (WRSs), the speech reception threshold, the WRS, and the AI-derived metrics were compared descriptively at the case level.
At the sentence level, Whisper transcription showed a clear performance difference between the two participants (Table 1). For S-001, the sentence agreement rate was 38.5% when whitespace was ignored and 28.2% under strict whitespace matching, indicating that more than half of the presented sentences did not match the majority transcription. In contrast, S-002 showed corresponding values of 71.2% and 63.7%, indicating relatively stable sentence-level transcription performance for the same sentence set. Sentence-level agreement among the four audiologists exceeded 90% in both cases, demonstrating very high inter-transcriber agreement. This suggests that the expert transcriptions were highly reliable and that variability among human transcribers was minimal, whereas agreement with Whisper transcription depended strongly on speaker characteristics.
At the word level, a meaningful degree of error was observed in both cases, although the magnitude differed clearly. The WER for S-001 was 0.2985, indicating that approximately 30% of words contained at least one substitution, deletion, or insertion error, whereas the WER for S-002 was 0.1802, corresponding to approximately 18% (Table 2). In both cases, substitution errors accounted for most of the error profile, indicating that Whisper more often misrecognized words as other words than completely missed them or produced false insertions. Deletion and insertion errors were relatively less frequent, but they occurred more often in S-001 than in S-002. In the context of these two cases, this observation tentatively suggests that severe hearing loss and reduced speech clarity may also negatively affect word-level stability. In contrast, the WER of expert transcription was only 1%-2% in both cases, confirming that, for the same speech samples, Whisper had a WER approximately 5-10 times higher than that of the experts.
At the character level, Whisper transcription showed relatively good performance, but it still warrants consideration for clinical application. The CER was 0.1486 for S-001 and 0.0811 for S-002, which were lower than the word-level metrics but still corresponded to errors in approximately 1 out of every 10 characters. This suggests that, although Whisper preserved recognition of individual phonemes or graphemes to some extent, additional errors accumulated at higher linguistic levels, such as word boundary detection, whitespace handling, and compound-word segmentation. Given that the CER of expert transcription was essentially close to 0, these character-level errors might also be viewed as an ASR-specific structural limitation.
A consistent pattern was also observed with speech audiometric measures (Table 3). S-001 showed relatively poor speech recognition performance, with 40% sentence recognition on the KSA, 60% word recognition, and a WRS of 68%, and this pattern was accompanied by a low sentence agreement rate and high WER and CER in Whisper transcription. In contrast, S-002 had 80% sentence recognition, 88% word recognition, and a WRS of 84%, and likewise showed a high sentence agreement rate and low WER and CER in Whisper transcription. Although no statistical testing was performed because of the small sample size, both cases showed a parallel pattern in which better clinical speech recognition performance was associated with better ASR transcription performance, cautiously suggesting that conventional KSA indices may serve as indirect indicators of variation in ASR performance in older adults with hearing loss.
In the detailed error-pattern analysis, Whisper showed vulnerability in both speakers to high-frequency fricatives and affricates (e.g., /ㅅ, ㅆ, ㅈ, ㅊ/) and to word-final consonants. In particular, for S-001, the error rate for high-frequency consonants was several times higher than the mean error rate of expert transcribers, and the WER increased prominently for low-frequency words, long words of three or more syllables, and words containing high-frequency consonants. At the sentence level, the error rate also increased for long sentences containing many words and for sentences with greater syntactic complexity, including conditional clauses and directive or imperative forms, suggesting a tendency for Whisper transcription of speech from older adults with hearing loss to become less stable as linguistic load increased.

Discussion

In this study, we compared the KSA sentence test with automatic transcription by Whisper large-v3 in two speakers with presbycusis to explore the extent to which a general-purpose ASR system can support sentence recognition assessment in older adults with hearing loss. Although the CER was relatively favorable at around 10%, performance at the word and sentence levels was markedly poorer, particularly in S-001, who showed moderate-to-severe, steeply sloping hearing loss, with a WER of approximately 30% and a sentence agreement rate below 40%. This contrasted clearly with S-002, whose hearing was relatively better. These findings are consistent with previous reports showing that ASR recognition accuracy is lower in older adults and speakers with hearing loss than in typical adult speakers [9], and they suggest that Whisper also has structural limitations for older speakers with hearing loss.
The parallel pattern between conventional speech audiometric measures, including KSA sentence and word recognition and WRS, and Whisper-derived indices is clinically meaningful. In both cases, better speech recognition performance was accompanied by a higher sentence agreement rate and lower WER and CER in ASR transcription, whereas poorer speech recognition performance was accompanied by reduced AI transcription performance. This suggests that, in its current form, a general-purpose ASR system may be more useful as an auxiliary indicator that complements conventional speech audiometry than as an independent testing tool that replaces the KSA [12]. More specifically, it may help indirectly characterize speaker-specific speech clarity, vulnerability across frequency regions, and performance changes associated with increasing linguistic load [12]. For example, even among patients with the same KSA scores, comparison of ASR error patterns may allow more detailed identification of additional vulnerable areas, such as high-frequency consonants, low-frequency or long-syllable words, and complex sentences.
The detailed error patterns provide direct implications for the future design of ASR systems tailored to older adults with hearing loss and of AI-based sentence tests. Whisper showed concentrated errors in high-frequency fricatives and affricates, word-final consonants, low-frequency words, long words of three or more syllables, and long, syntactically complex sentences. This indicates that acoustic and articulatory characteristics typically observed in presbycusis, such as high-frequency hearing loss, reduced vocal intensity, and weakening of word-final consonants [13], may conflict with the assumptions of the acoustic and language models used in current general-purpose ASR systems [14]. Therefore, rather than applying the existing KSA sentence lists for older adults with hearing loss directly to AI-based evaluation, it may be necessary to reconfigure difficulty on the basis of lexical frequency, syllable structure, and syntactic complexity and to use words containing high-frequency consonants or word-final consonants as separate subindices [15]. Such structured, error-based indices may provide clinically richer information than simple accuracy rates.
This study has several limitations. First, the number of cases was very small, with only two participants. Accordingly, the present findings should be interpreted as exploratory observations at the hypothesis-generating stage rather than as findings generalizable to a broader population. In addition, because of the very limited sample size, we did not perform a statistical comparison between ASR transcription and expert transcription, and the interpretation of the results was based on exploratory descriptive comparisons. In particular, it is difficult to exclude the possibility that the difference in the degree of hearing loss between the two participants influenced the difference in Whisper’s transcription performance. Second, because the study was conducted under relatively idealized conditions, including a quiet test room, a standardized female speaker, and a single ASR model, performance is likely to decline further in real outpatient or home environments, where background noise, dialect or accent, hearing-aid or cochlear implant use, and changes in speaking rate may be present. Third, we did not apply domain-specific fine-tuning or additional pre- and post-processing to improve Whisper transcription performance. Exploring data augmentation strategies that incorporate speech from older adults or individuals with disabilities, along with domain-specific training, could be a potential direction for improving ASR performance. Future studies should therefore validate the clinical validity and reliability of an ASR model specifically designed for presbycusis by constructing corpora that include speech from older adults and speakers with hearing loss, fine-tuning models to reflect KSA-based sentence and lexical structures, and developing adaptive pre- and post-processing algorithms for changes in vocal intensity and speaking rate. Future research will also require quantitative and statistical comparisons between ASR transcription and expert transcription in larger samples with a wider range of hearing characteristics.
This study provides preliminary insights by directly combining the KSA sentence test with a large-scale general-purpose ASR system. Specifically, it quantitatively presents Whisper’s performance and error patterns for actual clinical speech produced by speakers with presbycusis. Future studies should include larger samples, diverse hearing-loss configurations, hearing-assistive device use, noisy environments, and multiple ASR systems to systematically clarify the relationship between KSA indices and AI-derived transcription metrics. In addition, several steps are necessary to develop an ASR-based sentence test tailored to presbycusis that reflects the structure of the KSA. These include constructing corpora that reflect speech from older adults and speakers with hearing loss, performing model fine-tuning, and developing pre- and post-processing algorithms that account for changes in vocal intensity, speaking rate, and intonation. If these technical foundations are established, such a system may eventually serve as a practical tool for the automation of clinical speech audiometry and for remote auditory assessment.

Supplementary Materials

Korean translation of this article is available with the Online-only Data Supplement at https://doi.org/10.3342/kjorl-hns.2026.00052.

Notes

Acknowledgments

None

Author Contribution

Conceptualization: Woojae Han. Data curation: Sunmi Ma and Sangmin Park. Formal analysis: Tae-Jin Yoon. Funding acquisition: Woojae Han and Tae-Jin Yoon. Methodology: all authors. Project administration: Woojae Han. Resources: Woojae Han. Writing—original draft: Woojae Han. Writing—review & editing: all authors.

Table 1.
Sentence-level transcription agreement for each participant
Participant Total number of sentences AI sentence match rate (ignoring whitespace, %) AI sentence match rate (strict whitespace, %) Mean expert sentence match rate (ignoring whitespace, %) Mean expert sentence match rate (strict whitespace, %)
S-001 78* 38.5 28.2 >90 >85
S-002 80 71.2 63.7 >95 >90

*two utterances were absent because the participant did not respond to some sentences;

†expert transcribers produced identical or whitespace-only-differing transcriptions for most sentences, and inter-transcriber sentence agreement was estimated to be very high (predominantly ≥90%).

Table 2.
Word error rate (WER) and error composition by participant
Participant Reference word count (n) WER Substitutions, S (n, %) Deletions, D (n, %) Insertions, I (n, %)
S-001 335 0.2985 65 (19.40) 16 (4.80) 19 (5.70)
S-002 344 0.1802 46 (13.40) 6 (1.70) 10 (2.90)
Table 3.
Speech audiometry measures and Whisper transcription metrics in two elderly listeners
Participant KSA sentence recognition (%) KSA word recognition (%) SRT (dB HL) WRS (%) AI sentence match rate (ignoring whitespace, %) AI sentence match rate (strict whitespace, %) AI WER AI CER
S-001 40 60 55 68 38.5 28.2 0.2985 0.1486
S-002 80 88 45 84 71.2 63.7 0.1802 0.0811

KSA, Korean Speech Audiometry; SRT, speech recognition threshold; WRS, word recognition score; WER, word error rate; CER, character error rate.

REFERENCES

1. Lin FR, Yaffe K, Xia J, Xue QL, Harris TB, Purchase-Helzner E, et al. Hearing loss and cognitive decline in older adults. JAMA Intern Med 2013;173(4):293-9.
crossref pmid pmc
2. Bugannim Y, Roziner I, Kishon-Rabin L. Speech recognition in noise across the life span with cognition and hearing sensitivity as mediators of age effects. Sci Rep 2025;15(1):20575.
crossref pmid pmc pdf
3. Kwak C, Han W. Age-related difficulty of listening effort in elderly. Int J Environ Res Public Health 2021;18(16):8845.
crossref pmid pmc
4. Allen JB. Consonant recognition and the articulation index. J Acoust Soc Am 2005;117(4 Pt 1):2212-23.
crossref pmid pdf
5. Jang H, Lee J, Lim D, Lee K, Jeon A, Jung E. [Development of Korean standard sentence lists for sentence recognition tests]. Audiology 2008;4(2):161-77, Korean.
crossref pdf
6. Meyer BT, Kollmeier B, Ooster J. Autonomous measurement of speech intelligibility utilizing automatic speech recognition [online] 2015 [cited 2025 December 28]. Available from: URL: https:// www.isca-archive.org/interspeech_2015/meyer15_interspeech.pdf.

7. Radford A, Kim JW, Xu T, Brockman G, McLeavey C, Sutskever I. Robust speech recognition via large-scale weak supervision. arXiv [Preprint] 2022;[cited 2025 December 28]. Available from: URL: https://doi.org/10.48550/arXiv.2212.04356.
crossref
8. Zhao R, Choi ASG, Koenecke A, Rameau A. Quantification of automatic speech recognition system performance on d/Deaf and hard of hearing speech. Laryngoscope 2025;135(1):191-7.
crossref pmid
9. Chen L, Asgari M. Refining automatic speech recognition system for older adults. Proc IEEE Int Conf Acoust Speech Signal Process 2021;2021:7003-7.
crossref pmid pmc
10. Xu M, Shao J, Wang L. Effects of aging and age-related hearing loss on talker discrimination. In: Proceedings of the 22nd Annual Conference of the International Speech Communication Association (INTERSPEECH 2021); 2021 Aug 30–Sep 3; Brno, Czech Republic. International Speech Communication Association; 2021:1728-32. https://doi.org/10.21437/Interspeech.2021-682.

11. Kang Y, Na DL, Hahn S. [A validity study on the Korean minimental state examination (K-MMSE) in dementia patients]. J Korean Neurol Assoc 1997;15(2):300-8, Korean.

12. Ney H, Haeb-Umbach R, Tran BH, Oerder M. Improvements in beam search for 10000-word continuous speech recognition [online] 1992 [cited 2026 January 2]. Available from: URL: http://doi. org/10.1109/ICASSP.1992.225985.

13. Lee SJ, Cho Y, Song JY, Lee D, Kim Y, Kim H. Aging effect on Korean female voice: acoustic and perceptual examinations of breathiness. Folia Phoniatr Logop 2015;67(6):300-7.
crossref pmid pdf
14. Hacking C, Verbeek H, Hamers JPH, Aarts S. The development of an automatic speech recognition model using interview data from long-term care for older adults. J Am Med Inform Assoc 2023;30(3):411-7.
crossref pmid pmc pdf
15. Mallaband LJ. The agreement of phonetic transcriptions between paediatric speech and language therapists transcribing a disordered speech sample. Int J Lang Commun Disord 2024;59(5):1981-95.
crossref pmid
Editorial Office
Korean Society of Otorhinolaryngology-Head and Neck Surgery
103-307 67 Seobinggo-ro, Yongsan-gu, Seoul 04385, Korea
TEL: +82-2-3487-6602    FAX: +82-2-3487-6603   E-mail: kjorl@korl.or.kr
About |  Browse Articles |  Current Issue |  For Authors and Reviewers
Copyright © Korean Society of Otorhinolaryngology-Head and Neck Surgery.                 Developed in M2PI
Close layer
prev next