Deepfake audio as a data augmentation technique for training automatic speech to text transcription models

Título: Deepfake audio as a data augmentation technique for training automatic speech to text transcription models

Autores: Alexandre R. Ferreira, Claudio E. C. Campelo

Resumo: To train transcriptor models that produce robust results, a large and diverse labeled dataset is required. Finding such data with the necessary characteristics is a challenging task, especially for languages less popular than English. Moreover, producing such data requires significant effort and often money. Therefore, a strategy to mitigate this problem is the use of data augmentation techniques. In this work, we propose a framework that approaches data augmentation based on deepfake audio. To validate the produced framework, experiments were conducted using existing deepfake and transcription models. A voice cloner and a dataset produced by Indians (in English) were selected, ensuring the presence of a single accent in the dataset. Subsequently, the augmented data was used to train speech to text models in various scenarios

Palavras-chave: data augmentation, deepfake audio, voice cloning, transcription models

Páginas: 9

Código DOI: 10.21528/CBIC2023-169

Artigo em pdf: CBIC_2023_paper169.pdf

Arquivo BibTeX: CBIC_2023_169.bib