Zero-Shot Voice Conversion Using Diffusion Models and Cross-Speaker Embeddings
DOI:
https://doi.org/10.17051/NJSAP/01.03.05Keywords:
Zero-Shot Voice Conversion; Diffusion Models; Cross-Speaker Embeddings; Denoising Diffusion Probabilistic Models (DDPM); Speaker Similarity; Voice Synthesis; Content Preservation; Speaker Identity; ASR Embeddings; Non-Parallel VCAbstract
The paper proposes a brand new zero-shot voice conversion (VC) model that uses the denoising diffusion probabilistic models (DDPM) along with cross-speaker embeddings to produce high quality non-parallel voice conversion that does not involve any speaker specific training. Conventional VC systems, in turn, are traditionally based on parallel corpora or large volumes of speaker-specific data, restricting scalability and transports to unrestricted speakers. By comparison, our model takes advantage of a strong pretrained speaker encoder to learn an efficient representation of cross-speaker embeddings only after only a few seconds of a reference audio. These speaker embeddings are able to represent this speaker-specific prosody and timbre information in a disentangled latent space. At the same time, a content encoder, trained on a pretrained self-supervised automatic speech recognition (ASR) model, extracts speaker invariant is invariant linguistics. The DDPM can then simply produce high-quality audio samples, conditioned also on both content and speaker embeddings, achieved by sequential driving of the noise in the audio samples towards a Gaussian distribution using iteratively defined refinements. In contrast to GAN-based or autoregressive models, diffusion models provide high stability, naturalness and variability in speech generation. We test our model on VCTK and LibriTTS datasets based on both objective measures, including word error rate (WER), speaker verification accuracy, and subjective measures, i.e., tests in terms of mean opinion score (MOS). The performance of our system dramatically improves on both speaker similarity and speech naturalness / intelligibility over previous zero-shot VC baselines, with a MOS of 4.46 and a speaker verification accuracy of 89.7%. Moreover, the given method has high resistance to noise and will be effective even in the case of the perturbations of reference utterances, since it can capture content and voice identity. These findings confirm that cross-speaker embeddings and diffusion-based generation are a viable combination framework to enable zero-shot VC, which is a scalable approach to high quality voice conversion to be applied to precision text-to-speech (TTS), multi-speaker voice dubbing, voice style transfer, and anonymity-preserving voice generation. The suggested architecture is a substantial step on the way to generalizable, data-efficient, and high fidelity voice conversion systems without retraining on new speakers.Downloads
Download data is not yet available.
Additional Files
Published
2025-06-16
Issue
Section
Articles
How to Cite
[1]
Prerna Dusi and F Rahman , Trans., “Zero-Shot Voice Conversion Using Diffusion Models and Cross-Speaker Embeddings”, National Journal of Speech and Audio Processing , pp. 37–45, Jun. 2025, doi: 10.17051/NJSAP/01.03.05.