Speech Prefix-Tuning with RNNT Loss for Improving LLM Predictions

AI-generated keywords: Large Language Models

AI-generated Key Points

  • Recent studies have utilized prefixLM-type models that incorporate speech as a prefix to Large Language Models (LLMs) for Automatic Speech Recognition (ASR) tasks.
  • Optimization of speech prefixes is crucial for improving ASR performance.
  • Proposal to use RNNT loss for speech prefix-tuning, showing promising results without increasing model complexity or altering the inference pipeline.
  • Introduction of language-based soft prompting to further enhance ASR performance with frozen LLMs.
  • Empirical analysis on 10 Indic languages shows that speech prefix-tuning with RNNT loss leads to improvements in both frozen and fine-tuned LLMs, resulting in a 12% relative improvement in Word Error Rate (WER) compared to baseline with a fine-tuned LLM.
  • Utilizing these approaches with frozen LLMs yields a significant 31% relative improvement over basic soft-prompting prefixLM techniques.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Murali Karthick Baskar, Andrew Rosenberg, Bhuvana Ramabhadran, Neeraj Gaur, Zhong Meng

License: CC BY 4.0

Abstract: In this paper, we focus on addressing the constraints faced when applying LLMs to ASR. Recent works utilize prefixLM-type models, which directly apply speech as a prefix to LLMs for ASR. We have found that optimizing speech prefixes leads to better ASR performance and propose applying RNNT loss to perform speech prefix-tuning. This is a simple approach and does not increase the model complexity or alter the inference pipeline. We also propose language-based soft prompting to further improve with frozen LLMs. Empirical analysis on realtime testset from 10 Indic languages demonstrate that our proposed speech prefix-tuning yields improvements with both frozen and fine-tuned LLMs. Our recognition results on an average of 10 Indics show that the proposed prefix-tuning with RNNT loss results in a 12\% relative improvement in WER over the baseline with a fine-tuned LLM. Our proposed approches with the frozen LLM leads to a 31\% relative improvement over basic soft-prompting prefixLM.

Submitted to arXiv on 20 Jun. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2406.14701v1

, , , , This paper focuses on addressing constraints in applying Large Language Models (LLMs) to Automatic Speech Recognition (ASR). Recent studies have utilized prefixLM-type models that incorporate speech as a prefix to LLMs for ASR tasks. The optimization of speech prefixes has been identified as crucial for improving ASR performance. To address this, the authors propose using RNNT loss for speech prefix-tuning, which has shown promising results without increasing model complexity or altering the inference pipeline. Additionally, language-based soft prompting is introduced to further enhance ASR performance with frozen LLMs. Empirical analysis on a real-time test set of 10 Indic languages demonstrates that the proposed speech prefix-tuning approach leads to improvements in both frozen and fine-tuned LLMs. Specifically, implementing prefix-tuning with RNNT loss results in a 12% relative improvement in Word Error Rate (WER) compared to the baseline with a fine-tuned LLM. Utilizing these approaches with frozen LLMs yields a significant 31% relative improvement over basic soft-prompting prefixLM techniques. Insights into training and testing data statistics for various languages including Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Tamil, Telugu, and Urdu are provided. The authors use universal speech models (USM) with different model complexities (300M and 600M parameters) for their speech encoder. These USM architectures leverage chunk-wise bi-directional attention and are trained using multilingual data from over 100 languages. Details about the large language model used in the study are also provided. The LLM builds upon JAX based M4 multipod models with different parameter sizes (128M and 500M). Both models are trained using vast amounts of text tokens and utilize relative positional embeddings and GELU activations for efficient training and inference. In conclusion, detailed experimentation and analysis presented in this paper demonstrate that incorporating speech prefix-tuning with RNNT loss can significantly enhance ASR performance when utilizing LLMs for various languages.
Created on 15 Aug. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.