A Streaming On-Device End-to-End Model Surpassing Server-Side Conventional Model Quality and Latency

AI-generated keywords: End-to-end Speech Recognition RNN-T LAS Latency

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • End-to-end (E2E) models have not surpassed conventional models in speech recognition quality and latency
  • A recent study by Tara N. Sainath and her team developed an E2E model that surpasses conventional models in both aspects
  • The researchers incorporated a large number of utterances from varied domains to increase acoustic diversity and vocabulary exposure
  • They trained the model with accented English speech to make it more robust to different pronunciations
  • Varied learning rate schedule was explored due to increased training data
  • The team used end-of-sentence decision emitted by the RNN-T model to close the microphone and introduced various optimizations to improve LAS rescoring speed for latency improvement
  • RNN-T+LAS model offers better tradeoff between word error rate (WER) and latency compared to conventional models, with an 8% relative improvement in WER while being over 400 times smaller in model size for the same latency.
  • This study presents an on-device E2E speech recognition system that outperforms server-side conventional models in terms of both quality and latency, which could have significant implications for real-time applications such as virtual assistants or voice-controlled devices where low latency performance is crucial for user experience.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Tara N. Sainath, Yanzhang He, Bo Li, Arun Narayanan, Ruoming Pang, Antoine Bruguier, Shuo-yiin Chang, Wei Li, Raziel Alvarez, Zhifeng Chen, Chung-Cheng Chiu, David Garcia, Alex Gruenstein, Ke Hu, Minho Jin, Anjuli Kannan, Qiao Liang, Ian McGraw, Cal Peyser, Rohit Prabhavalkar, Golan Pundak, David Rybach, Yuan Shangguan, Yash Sheth, Trevor Strohman, Mirko Visontai, Yonghui Wu, Yu Zhang, Ding Zhao

In Proceedings of IEEE ICASSP 2020

Abstract: Thus far, end-to-end (E2E) models have not been shown to outperform state-of-the-art conventional models with respect to both quality, i.e., word error rate (WER), and latency, i.e., the time the hypothesis is finalized after the user stops speaking. In this paper, we develop a first-pass Recurrent Neural Network Transducer (RNN-T) model and a second-pass Listen, Attend, Spell (LAS) rescorer that surpasses a conventional model in both quality and latency. On the quality side, we incorporate a large number of utterances across varied domains to increase acoustic diversity and the vocabulary seen by the model. We also train with accented English speech to make the model more robust to different pronunciations. In addition, given the increased amount of training data, we explore a varied learning rate schedule. On the latency front, we explore using the end-of-sentence decision emitted by the RNN-T model to close the microphone, and also introduce various optimizations to improve the speed of LAS rescoring. Overall, we find that RNN-T+LAS offers a better WER and latency tradeoff compared to a conventional model. For example, for the same latency, RNN-T+LAS obtains a 8% relative improvement in WER, while being more than 400-times smaller in model size.

Submitted to arXiv on 28 Mar. 2020

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2003.12710v2

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

In the field of speech recognition, end-to-end (E2E) models have not yet surpassed state-of-the-art conventional models in terms of both quality and latency. However, a recent study by Tara N. Sainath and her team has developed a first-pass Recurrent Neural Network Transducer (RNN-T) model and a second-pass Listen, Attend, Spell (LAS) rescorer that surpasses conventional models in both aspects. To improve the quality of their model, the researchers incorporated a large number of utterances from varied domains to increase acoustic diversity and vocabulary exposure. They also trained the model with accented English speech to make it more robust to different pronunciations. Additionally, they explored using a varied learning rate schedule due to the increased amount of training data. On the latency front, the team explored using the end-of-sentence decision emitted by the RNN-T model to close the microphone and introduced various optimizations to improve LAS rescoring speed. The resulting RNN-T+LAS model offers a better tradeoff between word error rate (WER) and latency compared to conventional models. For example, for the same latency, RNN-T+LAS obtains an 8% relative improvement in WER while being over 400 times smaller in model size. Overall, this study presents an on-device E2E speech recognition system that outperforms server-side conventional models in terms of both quality and latency. This could have significant implications for real-time applications such as virtual assistants or voice controlled devices where low latency performance is crucial for user experience.
Created on 22 Apr. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.