A Streaming On-Device End-to-End Model Surpassing Server-Side Conventional Model Quality and Latency

AI-generated keywords: End-to-end Speech Recognition RNN-T LAS Latency

AI-generated Key Points

⚠The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

End-to-end (E2E) models have not surpassed conventional models in speech recognition quality and latency
A recent study by Tara N. Sainath and her team developed an E2E model that surpasses conventional models in both aspects
The researchers incorporated a large number of utterances from varied domains to increase acoustic diversity and vocabulary exposure
They trained the model with accented English speech to make it more robust to different pronunciations
Varied learning rate schedule was explored due to increased training data
The team used end-of-sentence decision emitted by the RNN-T model to close the microphone and introduced various optimizations to improve LAS rescoring speed for latency improvement
RNN-T+LAS model offers better tradeoff between word error rate (WER) and latency compared to conventional models, with an 8% relative improvement in WER while being over 400 times smaller in model size for the same latency.
This study presents an on-device E2E speech recognition system that outperforms server-side conventional models in terms of both quality and latency, which could have significant implications for real-time applications such as virtual assistants or voice-controlled devices where low latency performance is crucial for user experience.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Tara N. Sainath, Yanzhang He, Bo Li, Arun Narayanan, Ruoming Pang, Antoine Bruguier, Shuo-yiin Chang, Wei Li, Raziel Alvarez, Zhifeng Chen, Chung-Cheng Chiu, David Garcia, Alex Gruenstein, Ke Hu, Minho Jin, Anjuli Kannan, Qiao Liang, Ian McGraw, Cal Peyser, Rohit Prabhavalkar, Golan Pundak, David Rybach, Yuan Shangguan, Yash Sheth, Trevor Strohman, Mirko Visontai, Yonghui Wu, Yu Zhang, Ding Zhao

arXiv: 2003.12710v2 - DOI (cs.CL)

In Proceedings of IEEE ICASSP 2020

License: NONEXCLUSIVE-DISTRIB 1.0

Abstract: Thus far, end-to-end (E2E) models have not been shown to outperform state-of-the-art conventional models with respect to both quality, i.e., word error rate (WER), and latency, i.e., the time the hypothesis is finalized after the user stops speaking. In this paper, we develop a first-pass Recurrent Neural Network Transducer (RNN-T) model and a second-pass Listen, Attend, Spell (LAS) rescorer that surpasses a conventional model in both quality and latency. On the quality side, we incorporate a large number of utterances across varied domains to increase acoustic diversity and the vocabulary seen by the model. We also train with accented English speech to make the model more robust to different pronunciations. In addition, given the increased amount of training data, we explore a varied learning rate schedule. On the latency front, we explore using the end-of-sentence decision emitted by the RNN-T model to close the microphone, and also introduce various optimizations to improve the speed of LAS rescoring. Overall, we find that RNN-T+LAS offers a better WER and latency tradeoff compared to a conventional model. For example, for the same latency, RNN-T+LAS obtains a 8% relative improvement in WER, while being more than 400-times smaller in model size.

Submitted to arXiv on 28 Mar. 2020

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

⚠The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2003.12710v2

⚠This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

Comprehensive Summary
Key points
Layman's Summary
Blog article

In the field of speech recognition, end-to-end (E2E) models have not yet surpassed state-of-the-art conventional models in terms of both quality and latency. However, a recent study by Tara N. Sainath and her team has developed a first-pass Recurrent Neural Network Transducer (RNN-T) model and a second-pass Listen, Attend, Spell (LAS) rescorer that surpasses conventional models in both aspects. To improve the quality of their model, the researchers incorporated a large number of utterances from varied domains to increase acoustic diversity and vocabulary exposure. They also trained the model with accented English speech to make it more robust to different pronunciations. Additionally, they explored using a varied learning rate schedule due to the increased amount of training data. On the latency front, the team explored using the end-of-sentence decision emitted by the RNN-T model to close the microphone and introduced various optimizations to improve LAS rescoring speed. The resulting RNN-T+LAS model offers a better tradeoff between word error rate (WER) and latency compared to conventional models. For example, for the same latency, RNN-T+LAS obtains an 8% relative improvement in WER while being over 400 times smaller in model size. Overall, this study presents an on-device E2E speech recognition system that outperforms server-side conventional models in terms of both quality and latency. This could have significant implications for real-time applications such as virtual assistants or voice controlled devices where low latency performance is crucial for user experience.

- End-to-end (E2E) models have not surpassed conventional models in speech recognition quality and latency
- A recent study by Tara N. Sainath and her team developed an E2E model that surpasses conventional models in both aspects
- The researchers incorporated a large number of utterances from varied domains to increase acoustic diversity and vocabulary exposure
- They trained the model with accented English speech to make it more robust to different pronunciations
- Varied learning rate schedule was explored due to increased training data
- The team used end-of-sentence decision emitted by the RNN-T model to close the microphone and introduced various optimizations to improve LAS rescoring speed for latency improvement
- RNN-T+LAS model offers better tradeoff between word error rate (WER) and latency compared to conventional models, with an 8% relative improvement in WER while being over 400 times smaller in model size for the same latency.
- This study presents an on-device E2E speech recognition system that outperforms server-side conventional models in terms of both quality and latency, which could have significant implications for real-time applications such as virtual assistants or voice-controlled devices where low latency performance is crucial for user experience.

1. A study found a new way to make computers understand what people say better and faster. 2. They used many different types of talking to teach the computer how to recognize words better. 3. They also made the computer learn from people who talk differently, so it can understand more accents. 4. They tried different ways of teaching the computer, depending on how much talking they used. 5. The new way is much better than the old way for making computers understand speech quickly and accurately. Definitions- End-to-end (E2E) models: A type of computer program that tries to do everything at once, instead of breaking it down into smaller steps. - Latency: The time it takes for a computer to respond after someone talks or does something. - Utterances: Things that people say out loud. - Acoustic diversity: Different types of sounds and voices that the computer hears when people talk. - Vocabulary exposure: How many different words and phrases the computer hears when people talk.

Exploring the Benefits of an End-to-End Speech Recognition System

The field of speech recognition has seen a great deal of progress in recent years, with conventional models leading the way. However, end-to-end (E2E) models have not yet surpassed state-of-the-art conventional models in terms of both quality and latency. A recent study by Tara N. Sainath and her team has developed a first-pass Recurrent Neural Network Transducer (RNN-T) model and a second pass Listen, Attend, Spell (LAS) rescorer that surpasses conventional models in both aspects.

Improving Quality Through Acoustic Diversity

To improve the quality of their model, the researchers incorporated a large number of utterances from varied domains to increase acoustic diversity and vocabulary exposure. They also trained the model with accented English speech to make it more robust to different pronunciations. Additionally, they explored using a varied learning rate schedule due to the increased amount of training data.

Reducing Latency With Optimizations

On the latency front, the team explored using the end-of sentence decision emitted by RNNT model to close microphone and introduced various optimizations to improve LAS rescoring speed. The resulting RNNT+LAS model offers better tradeoff between word error rate (WER) and latency compared to conventional models. For example for same latency RNNT+LAS obtains 8% relative improvement in WER while being over 400 times smaller in size than server side conventional models .

Implications for Real Time Applications

Overall this study presents an on device E2E speech recognition system that outperforms server side conventional models in terms of both quality and latency which could have significant implications for real time applications such as virtual assistants or voice controlled devices where low latency performance is crucial for user experience .

Created on 22 Apr. 2023

Assess the quality of the AI-generated content by voting

Score: 0

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

⚠The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Similar papers summarized with our AI tools

70.8%

Emergent autonomous scientific research capabilities of large language models

physics.chem-ph

69.2%

WaveNet: A Generative Model for Raw Audio

cs.SD

68.6%

Large language models effectively leverage document-level context for literar…

cs.CL

68.3%

Selective Data Augmentation for Robust Speech Translation

cs.CL

67.8%

Attention is All You Need? Good Embeddings with Statistics are enough:Large S…

cs.SD

67.6%

TextMI: Textualize Multimodal Information for Integrating Non-verbal Cues in …

cs.CL

67.5%

Using Language Models For Knowledge Acquisition in Natural Language Reasoning…

cs.AI

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.