Meta-Learning-Based Few-Shot Speaker Adaptation for Neural Text-to-Speech Synthesis

Authors

  • R. Rudevdagva Mongolian University of Science and Technology, Ulaanbaatar, Mongolia Author
  • Zakaria Rozman Faculty of Information Science and Technology, UniversitiKebangsaan Malaysia, Bangi, Selangor 43600, Malaysia Author

DOI:

https://doi.org/10.17051/NJSAP/01.02.05

Keywords:

Few-shot learning, speaker adaptation, text-to-speech synthesis, meta-learning, neural vocoder, low-resource TTS.

Abstract

Neural text-to-speech (TTS) synthesis has also seen the importance of speaker adaptation to produce personalized and natural-sounding speech in an increasingly diverse set of applications, such as voice assistants, audiobooks, or assistive technologies. Nevertheless, the majority of the modern neural TTS systems (like most neural text-to-speech systems) rely on large quantities of high-quality target speaker data and do necessitate retraining of the model, which is impractical in most low-resource conditions. This paper presents a few-shot speaker adaptation mechanism in a meta-learning-based approach that facilitates the high-fidelity voice cloning of processed target speaker voices, with few seconds of target speech. The method uses a model-agnostic meta-learning (MAML) paradigm to learn a universal multi-speaker TTS model that is explicitly optimized to fit, with a small number of fine-tuning steps, to new speakers that have never been seen during training. The given architecture combines a Transformer text encoder, duration and pitch prediction, a HiFi-GAN neural vocoder, and speaker conditioning using d-vector embeddings generated in the set of target speaker samples. The VCTK, LibriTTS and AISHELL-3 datasets are also extensively tested under different few-shot scenarios (5, 10, and 20 utterances) and compared to standard transfer learning and speaker embedding based adaptation baselines. Experimental evidence confirms that the suggested solution beats other current methods every time, resulting in smaller Mel Cepstral Distortion (MCD) and significantly higher Mean Opinion Scores (MOS), with adaptation times taking less than 60 percent smaller. The speaker similarity and naturalness have been confirmed to be higher, in harsh low-resource settings, using subjective listening tests. These results indicate how meta-learning performs at alleviating the data scarcity in TTS speaker adaptation and also reveals its potential in data-sparse settings with resource-limited speech synthesis needs on an ad-hoc basis.

Downloads

Download data is not yet available.

Additional Files

Published

2025-03-17

Issue

Section

Articles

How to Cite

[1]
R. Rudevdagva and Zakaria Rozman , Trans., “Meta-Learning-Based Few-Shot Speaker Adaptation for Neural Text-to-Speech Synthesis”, National Journal of Speech and Audio Processing , pp. 34–41, Mar. 2025, doi: 10.17051/NJSAP/01.02.05.