Case Study: Multimodal Context-Aware Voice Command System for Real-Time Drone Navigation

Authors

  • Nisha Milind Shrirao Department Of Electrical And Electronics Engineering, Kalinga University, Raipur, India. Author
  • Nidhi Mishra Assistant Professor, Department of CS & IT, Kalinga University, Raipur, India. Author

DOI:

https://doi.org/10.17051/NJSAP/01.02.03

Keywords:

Voice-Controlled UAV, Automatic Speech Recognition (ASR), Multimodal Fusion, Context-Aware Navigation, Obstacle Detection, Edge Computing

Abstract

The paper offers an extensive real life case study on design, development and evaluation of a multi modal context aware voice command based system to aid real-time navigation of unmanned aerial vehicle (UAV). The framework that has been proposed resolves the many drawbacks of traditional drone control systems that incorporate only the use of speech, which has been known to perform poorly when exposed to noisy or visually chaotic conditions. Our system contains an automated speech recognition (ASR) pipeline designed to accommodate the UAV command sets, noise reduction of the environmental acoustic audio via Wiener filtering and spectral embedding, and the use of obstacles detection based on corpus recognition of vision using a YOLOv5 deep learning model. The modalities are integrated into a reasoning engine with a context that can dynamically reconcile voice commands in real-time with sensor data to make safe, contextually accurate navigation decisions. A quadcopter platform with onboard edge-computing hardware implemented the solution and was tested in both indoor and outdoor testbeds at different acoustic (35 75 dB) and lighting environments. The experimental findings show that multimodal system exceeds a speech-only baseline by an average of 6.3 percentage points with voice command recognition accuracies of 92.6 percent when quietness is used and 86.4 percent when high-noise recordings are used, in addition to showing an overall average recognition accuracy of 90.5 percent, compared to recognition of 86.4 percent when only quietness is used and 84.1 percent when only high-noise recordings are used. Additionally, multimodal fusion minimized navigation errors by 32 percent more especially in cases where obstacles were dynamically placed, and the environment were subjected to interference. The results support the conclusion on the usefulness of multimodal levels of speech, vision, and sensors to provide efficient, high-performance, low-latency, and contextual awareness navigation of UAVs. The proposed system presents much promise in its mission-critical applications of search and rescue, field inspection, and human robot working team where common and dependable voice-based control is important.

Downloads

Download data is not yet available.

Additional Files

Published

2025-04-06

Issue

Section

Articles

How to Cite

[1]
Nisha Milind Shrirao and Nidhi Mishra , Trans., “Case Study: Multimodal Context-Aware Voice Command System for Real-Time Drone Navigation”, National Journal of Speech and Audio Processing , pp. 20–26, Apr. 2025, doi: 10.17051/NJSAP/01.02.03.