Zero-Shot Voice Conversion Using Diffusion Models and Cross-Speaker Embeddings

Prerna Dusi; F Rahman

doi:10.17051/NJSAP/01.03.05

Authors

Prerna Dusi Assistant Professor, Department of Information Technology, Kalinga University, Raipur, India. Author
F Rahman Assistant Professor, Department of CS & IT, Kalinga University, Raipur, India. Author

DOI:

https://doi.org/10.17051/NJSAP/01.03.05

Keywords:

Zero-Shot Voice Conversion; Diffusion Models; Cross-Speaker Embeddings; Denoising Diffusion Probabilistic Models (DDPM); Speaker Similarity; Voice Synthesis; Content Preservation; Speaker Identity; ASR Embeddings; Non-Parallel VC

Abstract

The paper proposes a brand new zero-shot voice conversion (VC) model that uses the denoising diffusion probabilistic models (DDPM) along with cross-speaker embeddings to produce high quality non-parallel voice conversion that does not involve any speaker specific training. Conventional VC systems, in turn, are traditionally based on parallel corpora or large volumes of speaker-specific data, restricting scalability and transports to unrestricted speakers. By comparison, our model takes advantage of a strong pretrained speaker encoder to learn an efficient representation of cross-speaker embeddings only after only a few seconds of a reference audio. These speaker embeddings are able to represent this speaker-specific prosody and timbre information in a disentangled latent space. At the same time, a content encoder, trained on a pretrained self-supervised automatic speech recognition (ASR) model, extracts speaker invariant is invariant linguistics. The DDPM can then simply produce high-quality audio samples, conditioned also on both content and speaker embeddings, achieved by sequential driving of the noise in the audio samples towards a Gaussian distribution using iteratively defined refinements. In contrast to GAN-based or autoregressive models, diffusion models provide high stability, naturalness and variability in speech generation. We test our model on VCTK and LibriTTS datasets based on both objective measures, including word error rate (WER), speaker verification accuracy, and subjective measures, i.e., tests in terms of mean opinion score (MOS). The performance of our system dramatically improves on both speaker similarity and speech naturalness / intelligibility over previous zero-shot VC baselines, with a MOS of 4.46 and a speaker verification accuracy of 89.7%. Moreover, the given method has high resistance to noise and will be effective even in the case of the perturbations of reference utterances, since it can capture content and voice identity. These findings confirm that cross-speaker embeddings and diffusion-based generation are a viable combination framework to enable zero-shot VC, which is a scalable approach to high quality voice conversion to be applied to precision text-to-speech (TTS), multi-speaker voice dubbing, voice style transfer, and anonymity-preserving voice generation. The suggested architecture is a substantial step on the way to generalizable, data-efficient, and high fidelity voice conversion systems without retraining on new speakers.

Downloads

Download data is not yet available.

Zero-Shot Voice Conversion Using Diffusion Models and Cross-Speaker Embeddings

Authors

DOI:

Keywords:

Abstract

Downloads

Additional Files

Published

Issue

Section

How to Cite

Latest publications

Information

Language