Domain adaptation for noisy speech recognition
Publish date: 2023-12-11
Report number: FOI-R--5523--SE
Pages: 46
Written in: Swedish
Keywords:
- speech-to-text
- ASR
- speech recognition
- speech processing
- wav2vec 2.0
- Whisper
- domain adaptation
- CycleGAN
- robust speech recognition
- robust ASR
Abstract
The Swedish Police Authority and the Swedish Security Service have a need to process large quantities of speech data which to some degree is collected through secret surveillance measures. In general, the standard procedure is to employ large pre-trained machine learning models for speech processing and fine-tune them to a particular task. However, the sources of data in the application at hand tend to produce signals with higher levels of noise and distortions than are present in speech data collected under more controlled circumstances. The question thus arises of how to adjust the speech processing system to be able to handle more challenging signals than those that the publicly available models are typically trained on. Hence, the problem is formulated as a form of domain adaptation, between a domain with clean, uncontaminated speech, and a noisy speech domain. This is done assuming a limited amount of computational resources for model training and a limited ability to annotate data. We study how adaptation training using different amounts of annotated vs unannotated data affects speech-to-text performance in the noisy domain. We also attempt training a generative model to transform data from the noisy to the clean domain, that is, to remove noise. The experiments are performed using a dataset containing hundreds of hours of conversational telephone speech in English to represent the domain that the models are to be adapted to. When adapting the self-supervised model wav2vec 2.0, the best results, as expected, are achieved when the entire training set is labelled. However, comparable results are also achieved with only a few hours of labelled data when the rest of the data is used for selfsupervised training. The weakly supervised model Whisper outperforms wav2vec 2.0 on the test data already without fine-tuning, and fine-tuning further improves its performance. Using so called parameter efficient fine-tuning, similar results are achieved, while reducing training time by more than half and reducing performance loss on other datasets. The study on noise reduction in data using a generative model was discontinued after initial experiments did not yield promising results.