Embedded Implementation of Speech-to-Text Translation Using Compressed Deep Neural Networks

Authors

  • J. Btia Faculty of Engineering Ain Shams University & Arab Academy for Science and Technology Cairo, Egypt Author
  • Gichoya David Department of computing and information technology, kenyatta university, Nairobi, Kenya Author

DOI:

https://doi.org/10.17051/NJSIP/01.03.06

Keywords:

Embedded Systems, Speech-to-Text Translation, Model Compression, Deep Neural Networks, Quantization, Pruning, Knowledge Distillation, TinyML, Low-Power AI, Edge Inference.

Abstract

Speech-to-text (STT) translation in real-time has now become a key component of voice-controlled embedded applications which now apply to everything smart and connected (IoT devices), wearable medical devices and voice-enabled industrial and digital control systems. Nonetheless, integrating correct and responsive STTs on-board systems are hampered by extreme processing power, memory and power limitations. The proposed compressed deep neural network (DNN) architecture in this paper is a low-powered embedded model that is best-suited to realize the efficient and large-scale STT translation. The peculiarity of our solution is based on combining three mutually complementary components of model compression speeding up: magnitude-based pruning with the aim of discarding redundant weights, post-training quantization to cut the computational accuracy and memory usage, and knowledge distillation as a strategy to port the performance of a large model to its smaller, lightweight, counterpart. These strategies have been integrated into one that has a modular STT pipeline that consists of an MFCCs based feature extractor, compact acoustic model (CNN-RNN-based or Transformer-based), quantized language model, and a decoder that can be efficiently optimized to run on fixed-point operations. The deployment targets include popular embedded processors such as the ARM Cortex-M7-based STM32F746 and RISC-V-based Kendryte K210 that can also be easily deployed using toolchains like TensorFlow Lite Micro, CMSIS-NN, and the Kendryte SDK. Experimental analysis on benchmarking data LibriSpeech and Mozilla Common Voice revealed that our optimized models can reduce memory size by up to 4.2x and save more than 35 percent energy with less than 2 percent drop in word error rate (WER) against baseline models that use full-precision training. It is typical that, where transcription is emulated using cloud processing capabilities, it will be faster on edge devices with an average inference latency of less than 100 milliseconds. This means that it can be emulated close to real-time. The present work shows that low-latency, energy-efficient, and multilingual STT translation systems can be deployed to embedded targets where voice-only, privacy-preserving, offline voice interfaces are expected to become a reality in next-generation smart devices.

Additional Files

Published

2025-07-24

Issue

Section

Articles