Lightweight Speaker Identification Framework Using Deep Embeddings for Real-Time Voice Biometrics

Robbi Rahim

doi:10.17051/NJSAP/01.01.03

Authors

Robbi Rahim Sekolah Tinggi Ilmu Manajemen Sukma, Medan, Indonesia Author

DOI:

https://doi.org/10.17051/NJSAP/01.01.03

Keywords:

Speaker identification, voice biometrics, deep embeddings, real-time inference, lightweight CNN, quantization, edge deployment

Abstract

Voice biometric systems are based on speaker identification that is important for the secure and personalized human–machine interaction. But deploying reactive and reliable speaker recognition models on memory, delay, and computationally constrained edge devices still have been a problem. In this paper, we propose a lightweight speaker identification framework built on top of deep speaker embeddings, produced by a considerably smaller convolutional neural network (CNN) architecture. We use a time delayed CNN front end to extract a fixed length embedding from a variable length utterance and use average pooling and cosine similarity based classifier for low latency inference. Additionally, quantization aware training and pruning methods are used to optimize for performance at runtime, significantly reducing the model size and over 60% while retaining accuracy. The proposed model is evaluated on VoxCeleb1 and a custom low resource dataset where identified with a Top-1 accuracy of 94.2% and 92.5% respectively; with inference latency under 30 milliseconds on Raspberry Pi 4. In these results show that it is practical to run deep embedding speaker ID on embedded platforms, but with the robustness and precision retained.

Downloads

Download data is not yet available.

Lightweight Speaker Identification Framework Using Deep Embeddings for Real-Time Voice Biometrics

Authors

DOI:

Keywords:

Abstract

Downloads

Additional Files

Published

Issue

Section

How to Cite

Latest publications

Information

Language