Self-Supervised Learning for Speech and Audio Analytics: A Comprehensive Review of Methods, Applications, and Future Research Directions

Authors

  • K.N. Kantor Departamento de Engenharia Elétrica, Universidade Federal de Pernambuco - UFPE Recife, Brazil Author
  • K.P. Sikalu Electrical and Electronic Engineering Department, University of Ibadan , Nigeria Author

DOI:

https://doi.org/10.17051/NJSAP/01.02.02

Keywords:

Self-Supervised Learning, Speech Analytics, Audio Representation Learning, wav2vec, HuBERT, Edge AI, Multimodal Learning

Abstract

Self-Supervised Learning (SSL) has developed fast as a revolutionary method in speech and audio analyses to ensure that supervised learning drawbacks, such as the requirement of sizable, labeled training sets, are overcome. Through pretext tasks, such as masked prediction, contrastive learning, and reconstruction, this review critically discusses methods of SSL that take advantage of inherent structures in the audio signal. We critically evaluate state-of-art paradigms, including wav2vec 2.0, HuBERT, BYOL-A and data2vec, explaining their design specifications in architecture, training and evaluation benchmark results in tasks like automatic speech recognition, speaker verification, emotion recognition, and music information retrieval. A comparative analysis points at the trade-offs between the accuracy, computation performance, and domain adaptability. New directions are also discussed, including multimodal SSL to combine audio with visual and textual input and federated SSL to allow privacy-preserving learning and edge-optimized SSL to run on low-power devices. The review finally proposes the strategic directions to follow on improving the SSL in the real world application by noting the research challenges (such as scalability, cross-lingual generalization, and interpretability) which are a major concern. The current synthesis is to help scholars and practitioners to pursue the ultimate goal of creating efficient, effective, and ethically consistent SSL machines in response to the maturing world of speech and audio.

Downloads

Download data is not yet available.

Additional Files

Published

2025-03-13

Issue

Section

Articles

How to Cite

[1]
K.N. Kantor and K.P. Sikalu , Trans., “Self-Supervised Learning for Speech and Audio Analytics: A Comprehensive Review of Methods, Applications, and Future Research Directions”, National Journal of Speech and Audio Processing , pp. 10–19, Mar. 2025, doi: 10.17051/NJSAP/01.02.02.