Self-Supervised Audio Representation Learning for Robust Speaker Verification

Authors

  • F. de Mindonça Departamento de Engenharia Elétrica, Universidade Federal de Pernambuco - UFPE Recife, Brazil Author
  • O.L.M. Smith Departamento de Engenharia Elétrica, Universidade Federal de Pernambuco - UFPE Recife, Brazil Author

DOI:

https://doi.org/10.17051/NJSAP/01.04.04

Keywords:

Self-Supervised Learning (SSL), Speaker Verification, Contrastive Learning, Audio Representation Learning, Voice Biometrics, Deep Neural Networks, Transformer Encoder, Equal Error Rate (EER), Noise Robustness, Unlabeled Audio Data

Abstract

A new self-supervised learning (SSL) framework of robust speaker verification capable of overcoming the drawbacks of the outdated models of supervised systems, especially when dealing with noisy and domain-mismatched data are introduced in this paper. Experimenting with the large number of unlabeled audio data we use a contrastive learning paradigm to learn extremely discriminative, speaker-specific embeddings with no need to resort to explicit identity labels. This framework includes a convolutional encoder and a transformer-based context network trained over temporally augmented segments of audio to drive the model to learn invariant features to time, noise, and signal Newer to the model. To be more specific, positive updates are done by dynamic telescopic masking, cropping, and augmentation strategies, whereas negative ones are sampled over the batch, thus allowing the system to gain subtle discrimination over various speakers. In contrast to prior SSL frameworks most directly applicable to automatic speech recognition (ASR), our model is explicitly optimized to learn a feature representation that is useful to speaker verification by adding an embedding-level projection and fine-tuning stage that tunes the learnt representations so as to be relevant to speaker identities. Our experiments are performed at large scale on public datasets of speaker verification, such as VoxCeleb1 and VoxCeleb2, both under clean and in the noisy environment. Our approach continually has lower Equal Error Rate (EER) and minimum Detection Cost Function (minDCF) than supervised baselines, such as x-vectors as well as new SSL models, such as wav2vec 2.0 and HuBERT. In addition, the model has high generalization abilities on domain shifts and also retains high performance at low signal-to-noise ratio (SNR) which renders it very appropriate to be applied in real life applications in secure authentication tasks, forensic applications and also in telecommunications. These findings show that self-supervised contrastive learning can be significantly more effective at improving the state-of-the-art in speaker verification with speaker characteristics-sensitive adaptation and less label-intensive. The current work paves the way to exploit large corpora of unlabeled audio data in voice biometric systems and it also provides a firm background to create more research on low-resource speaker recognition systems with privacy awareness.

Downloads

Download data is not yet available.

Additional Files

Published

2025-10-16

Issue

Section

Articles

How to Cite

[1]
F. de Mindonça and O.L.M. Smith , Trans., “Self-Supervised Audio Representation Learning for Robust Speaker Verification”, National Journal of Speech and Audio Processing , pp. 26–33, Oct. 2025, doi: 10.17051/NJSAP/01.04.04.