Edge-ASR: Towards Low-Bit Quantization of Automatic Speech Recognition Models

AI-generated keywords: Edge-ASR Low-Bit Quantization Automatic Speech Recognition Resource-Constrained Edge Devices Post-Training Quantization

AI-generated Key Points

⚠The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

Recent advancements in ASR have shown high accuracy and reliability in applications like live transcription and voice command processing.
Deploying ASR models on resource-constrained edge devices poses challenges due to limitations in memory, computing power, and energy consumption.
Post-training quantization (PTQ) is highlighted as a solution for reducing model size and inference costs without retraining.
The performance implications of different advanced quantization methods and bit-width configurations on ASR models are not fully understood.
Researchers conducted a thorough evaluation by benchmarking eight PTQ techniques on Whisper and Moonshine edge-ASR model families across seven diverse datasets.
Analysis focused on the impact of quantization on both weights and activations within the models to understand efficiency versus accuracy trade-offs.
Results showed that even 3-bit quantization can be successful with advanced PTQ techniques on high-capacity models for low-power edge devices.
The study contributes significantly to enhancing the efficiency and effectiveness of speech recognition systems in real-world applications.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Chen Feng, Yicheng Lin, Shaojie Zhuo, Chenzheng Su, Ramchalam Kinattinkara Ramakrishnan, Zhaocong Yuan, Xiaopeng Zhang

arXiv: 2507.07877v2 - DOI (cs.SD)

License: CC BY-NC-ND 4.0

Abstract: Recent advances in Automatic Speech Recognition (ASR) have demonstrated remarkable accuracy and robustness in diverse audio applications, such as live transcription and voice command processing. However, deploying these models on resource-constrained edge devices (e.g., IoT device, wearables) still presents substantial challenges due to strict limits on memory, compute and power. Quantization, particularly Post-Training Quantization (PTQ), offers an effective way to reduce model size and inference cost without retraining. Despite its importance, the performance implications of various advanced quantization methods and bit-width configurations on ASR models remain unclear. In this work, we present a comprehensive benchmark of eight state-of-the-art (SOTA) PTQ methods applied to two leading edge-ASR model families, Whisper and Moonshine. We systematically evaluate model performances (i.e., accuracy, memory I/O and bit operations) across seven diverse datasets from the open ASR leader-board, analyzing the impact of quantization and various configurations on both weights and activations. Built on an extension of the LLM compression toolkit, our framework integrates edge-ASR models, diverse advanced quantization algorithms, a unified calibration and evaluation data pipeline, with detailed analysis tools. Our results characterize the trade-offs between efficiency and accuracy, demonstrating that even $3$-bit quantization can succeed on high capacity models when using advanced PTQ techniques. These findings provide valuable insights for optimizing ASR models on low-power, always-on edge devices.

Submitted to arXiv on 10 Jul. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

⚠The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2507.07877v2

⚠This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

Comprehensive Summary
Key points
Layman's Summary
Blog article

In their paper titled "Edge-ASR: Towards Low-Bit Quantization of Automatic Speech Recognition Models," authors Chen Feng, Yicheng Lin, Shaojie Zhuo, Chenzheng Su, Ramchalam Kinattinkara Ramakrishnan, Zhaocong Yuan, and Xiaopeng Zhang delve into the realm of technology. They highlight recent advancements in ASR that have showcased impressive accuracy and reliability across various audio applications like live transcription and voice command processing. Despite these achievements, deploying ASR models on resource-constrained edge devices such as IoT devices and wearables remains challenging due to limitations in memory, computing power, and energy consumption. The authors emphasize the significance of as a viable solution for reducing model size and inference costs without the need for retraining. However, the performance implications of different advanced quantization methods and bit-width configurations on ASR models are not fully understood. To address this gap in knowledge, the researchers conduct a thorough evaluation by benchmarking eight state-of-the-art PTQ techniques on two prominent edge-ASR model families known as Whisper and Moonshine. Their study involves a systematic assessment of model performances in terms of accuracy, memory I/O operations, and bit operations across seven diverse datasets sourced from the open ASR leaderboard. By analyzing the impact of quantization on both weights and activations within the models, the authors aim to shed light on the trade-offs between efficiency and accuracy. Leveraging an extension of the LLM compression toolkit, along with a unified calibration and evaluation data pipeline equipped with detailed analysis tools. The results obtained from their research showcase how even 3-bit quantization can yield successful outcomes on high-capacity models when coupled with advanced PTQ techniques. These findings offer valuable insights for optimizing ASR models specifically tailored for low-power always-on edge devices. By providing a comprehensive exploration of cutting-edge quantization methods applied to ASR technology, this study contributes significantly to enhancing the efficiency and effectiveness of speech recognition systems in real-world applications.

- Recent advancements in ASR have shown high accuracy and reliability in applications like live transcription and voice command processing.
- Deploying ASR models on resource-constrained edge devices poses challenges due to limitations in memory, computing power, and energy consumption.
- Post-training quantization (PTQ) is highlighted as a solution for reducing model size and inference costs without retraining.
- The performance implications of different advanced quantization methods and bit-width configurations on ASR models are not fully understood.
- Researchers conducted a thorough evaluation by benchmarking eight PTQ techniques on Whisper and Moonshine edge-ASR model families across seven diverse datasets.
- Analysis focused on the impact of quantization on both weights and activations within the models to understand efficiency versus accuracy trade-offs.
- Results showed that even 3-bit quantization can be successful with advanced PTQ techniques on high-capacity models for low-power edge devices.
- The study contributes significantly to enhancing the efficiency and effectiveness of speech recognition systems in real-world applications.

Summary1. New improvements in speech recognition technology have made it very good at understanding and processing spoken words for things like writing down what someone is saying or following voice commands. 2. Putting these speech recognition models on small devices with limited resources, like phones or smart speakers, can be difficult because these devices don't have a lot of memory, power, or energy. 3. There's a way called post-training quantization that helps make the speech recognition models smaller and use less energy without needing to train them again. 4. People are still trying to figure out how different ways of making the models smaller and using fewer bits affect how well they work. 5. Some researchers tested eight different techniques for making the models smaller on specific types of devices and found that even using just 3 bits can work well on powerful models for small devices. Definitions- Speech Recognition (ASR): Technology that understands and processes spoken words. - Quantization: Making something smaller by reducing its size or complexity. - Inference: Drawing conclusions based on evidence or reasoning. - Benchmarking: Comparing performance against a standard to evaluate efficiency. - Efficiency: Achieving maximum productivity with minimum wasted effort or expense.

Introduction

Automatic speech recognition (ASR) has made significant advancements in recent years, enabling accurate and reliable transcription of live audio and voice commands. However, deploying ASR models on resource-constrained edge devices such as IoT devices and wearables remains a challenge due to limitations in memory, computing power, and energy consumption. To address this issue, researchers have turned to low-bit quantization techniques as a viable solution for reducing model size and inference costs without the need for retraining. In their paper titled "Edge-ASR: Towards Low-Bit Quantization of Automatic Speech Recognition Models," authors Chen Feng, Yicheng Lin, Shaojie Zhuo, Chenzheng Su, Ramchalam Kinattinkara Ramakrishnan, Zhaocong Yuan, and Xiaopeng Zhang delve into the realm of technology to explore the performance implications of different advanced quantization methods and bit-width configurations on ASR models.

The Need for Low-Bit Quantization in Edge-ASR

The rise of edge computing has led to an increased demand for efficient speech recognition systems that can be deployed on low-power always-on devices. Traditional ASR models are often too large and computationally intensive to run effectively on these devices. This is where low-bit quantization comes into play – by reducing the precision of weights and activations within the model while maintaining acceptable levels of accuracy. However, there is still limited research available on the impact of quantization methods on ASR models specifically designed for edge devices. This gap in knowledge motivated the authors to conduct a thorough evaluation by benchmarking eight state-of-the-art post-training quantization (PTQ) techniques on two prominent edge-ASR model families known as Whisper and Moonshine.

Methodology

To evaluate the performance implications of different PTQ techniques on ASR models designed for edge devices, the authors used a systematic approach. They benchmarked their models across seven diverse datasets sourced from the open ASR leaderboard and analyzed the impact of quantization on both weights and activations within the models. The researchers leveraged an extension of the LLM compression toolkit, along with a unified calibration and evaluation data pipeline equipped with detailed analysis tools. This allowed them to compare the efficiency and accuracy of different quantization methods in a controlled environment.

Quantization Techniques Used

The eight PTQ techniques used in this study were: 1. Uniform Quantization 2. Linear Quantization 3. Logarithmic Quantization 4. Symmetric Uniform Quantization 5. Symmetric Linear Quantization 6. Symmetric Logarithmic Quantization 7. Power-of-Two (PoT) Quantization 8. Clipped Power-of-Two (CPT) Quantization Each technique was applied to both weights and activations within the Whisper and Moonshine models, resulting in 16 total configurations for evaluation.

Results

The results obtained from this research showcase how even 3-bit quantization can yield successful outcomes on high-capacity ASR models when coupled with advanced PTQ techniques. In terms of accuracy, all eight PTQ methods showed comparable performance to full-precision baseline models on most datasets, with only slight decreases observed on some datasets for certain configurations. In terms of memory I/O operations, symmetric uniform quantized models showed significant improvements compared to other techniques, while PoT quantized models had lower bit operations across all datasets. Overall, these findings offer valuable insights for optimizing ASR models specifically tailored for low-power always-on edge devices without sacrificing too much accuracy or increasing computational costs significantly.

Conclusion

By providing a comprehensive exploration of cutting-edge quantization methods applied to ASR technology, this study contributes significantly to enhancing the efficiency and effectiveness of speech recognition systems in real-world applications. The results obtained from benchmarking eight PTQ techniques on two prominent edge-ASR model families showcase the potential for low-bit quantization to reduce model size and inference costs without sacrificing too much accuracy. This research highlights the importance of considering different quantization methods and configurations when designing ASR models for edge devices, as each technique has its own trade-offs between efficiency and accuracy. With further advancements in PTQ techniques, we can expect even more efficient and accurate ASR models to be deployed on resource-constrained edge devices in the future.

Created on 22 Aug. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

68.3%

On TasNet for Low-Latency Single-Speaker Speech Enhancement

cs.SD

67.4%

Encoding Speaker-Specific Latent Speech Feature for Speech Synthesis

cs.SD

66.6%

UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-…

cs.SD

66.5%

Classifying Autism from Crowdsourced Semi-Structured Speech Recordings: A Mac…

cs.SD

65.9%

Towards Fine-Grained Prosody Control for Voice Conversion

cs.SD

65.7%

WaveNet: A Generative Model for Raw Audio

cs.SD

65.2%

Scenario Aware Speech Recognition: Advancements for Apollo Fearless Steps & C…

cs.SD

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.