AST: Audio Spectrogram Transformer
AI-generated Key Points
- The Audio Spectrogram Transformer (AST) is a convolution-free and attention-based model for audio classification.
- AST captures long-range global context directly from audio spectrograms, even in the lowest layers.
- Knowledge transfer from the Vision Transformer (ViT) pretrained on ImageNet enhances AST's performance.
- AST outperforms other models on various audio classification tasks and datasets such as AudioSet, ESC-50, and Speech Commands.
- AST can handle variable-length inputs without requiring architectural changes, making it adaptable to different tasks.
- The architecture of AST involves splitting the 2D audio spectrogram into patches and projecting them linearly into 1-D embeddings with positional encodings.
- A classification token at the beginning of the sequence feeds into a Transformer for processing, with the output used for final classification through a linear layer.
- AST is a compelling choice for audio tasks due to its exceptional performance and versatility in handling diverse input lengths and tasks without compromising accuracy or efficiency.
Authors: Yuan Gong, Yu-An Chung, James Glass
Abstract: In the past decade, convolutional neural networks (CNNs) have been widely adopted as the main building block for end-to-end audio classification models, which aim to learn a direct mapping from audio spectrograms to corresponding labels. To better capture long-range global context, a recent trend is to add a self-attention mechanism on top of the CNN, forming a CNN-attention hybrid model. However, it is unclear whether the reliance on a CNN is necessary, and if neural networks purely based on attention are sufficient to obtain good performance in audio classification. In this paper, we answer the question by introducing the Audio Spectrogram Transformer (AST), the first convolution-free, purely attention-based model for audio classification. We evaluate AST on various audio classification benchmarks, where it achieves new state-of-the-art results of 0.485 mAP on AudioSet, 95.6% accuracy on ESC-50, and 98.1% accuracy on Speech Commands V2.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.