A comparative study of deep End-to-End Automatic Speech Recognition models for doctor-patient conversations in Polish in a real-life acoustic environment

Authors

  • Karolina Pondel-Sycz Warsaw University of Technology, Faculty of Electronics and Information Technology
  • Piotr Bilski Warsaw University of Technology, Faculty of Electronics and Information Technology
  • Piotr Bobiński Warsaw University of Technology, Faculty of Electronics and Information Technology
  • Leszek Morzyński Central Institute For Labour Protection-National Research Institute
  • Marcin Lewandowski Warsaw University of Technology, Faculty of Electronics and Information Technology
  • Emil Kozłowski Central Institute For Labour Protection-National Research Institute
  • Grzegorz Szczepański Central Institute For Labour Protection-National Research Institute
  • Maciej Jasiński Warsaw University of Technology, Faculty of Electronics and Information Technology
  • Grzegorz Makarewicz Warsaw University of Technology, Faculty of Electronics and Information Technology
  • Agnieszka Paula Pietrzak Warsaw University of Technology, Faculty of Electronics and Information Technology
  • Andrzej Buchowicz Warsaw University of Technology, Faculty of Electronics and Information Technology
  • Paweł Mazurek Warsaw University of Technology, Faculty of Electronics and Information Technology
  • Adrian Bilski Warsaw University of Life Sciences, Faculty of Applied Informatics and Mathematics
  • Jacek Olejnik JAS Technologie Sp. z o.o
  • Iwona Olejnik JAS Technologie Sp. z o.o

Abstract

The following paper presents research on the Automatic Speech Recognition (ASR) methods for the construction of a system to automatically transcribe the medical interview in Polish language during a visit in the clinic. Performance of four ASR models based on Deep Neural Networks (DNN) was evaluated. The applied structures included XLSR-53 large, Quartznet15x5, FastConformer Hybrid Transducer-CTC and Whisper large. The study was conducted on a self-developed speech dataset. Models were evaluated using Word Error Rate (WER), Character Error Rate (CER), Match Error Rate (MER), Word Accuracy (WAcc), Word Information Preserved (WIP), Word Information Lost (WIL), Levenshtein distance, Jaro - Winkler similarity and Jaccard index. The results show that the Whisper model outperformed other tested solutions in the vast majority of the conducted tests. Whisper achieved a WER = 20.84%, where XLSR-53 WER = 67.96%, Quartznet15x5 WER = 76.25%, FastConformer WER = 46.30%. These results show that Whisper needs further adaptation for medical conversations, as current volume of transcription errors is not practically acceptable (too many mistakes in the description of the patient's health description).

Additional Files

Published

2025-07-09

Issue

Section

Acoustics