Perceiver: General Perception with Iterative Attention

AI-generated keywords: Perceiver model

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • The Perceiver model addresses the challenge of processing high-dimensional inputs from diverse modalities such as vision, audition, touch, and proprioception.
  • It leverages Transformers to make fewer architectural assumptions about input relationships compared to traditional deep learning perception models.
  • The model is scalable to handle hundreds of thousands of inputs similar to Convolutional Neural Networks (ConvNets) through an asymmetric attention mechanism that iteratively distills inputs into a compact latent bottleneck.
  • The Perceiver demonstrates competitive performance on classification tasks involving images, point clouds, audio, video, and combined audio-video data without being constrained by modality-specific priors.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, Joao Carreira

ICML 2021

Abstract: Biological systems perceive the world by simultaneously processing high-dimensional inputs from modalities as diverse as vision, audition, touch, proprioception, etc. The perception models used in deep learning on the other hand are designed for individual modalities, often relying on domain-specific assumptions such as the local grid structures exploited by virtually all existing vision models. These priors introduce helpful inductive biases, but also lock models to individual modalities. In this paper we introduce the Perceiver - a model that builds upon Transformers and hence makes few architectural assumptions about the relationship between its inputs, but that also scales to hundreds of thousands of inputs, like ConvNets. The model leverages an asymmetric attention mechanism to iteratively distill inputs into a tight latent bottleneck, allowing it to scale to handle very large inputs. We show that this architecture is competitive with or outperforms strong, specialized models on classification tasks across various modalities: images, point clouds, audio, video, and video+audio. The Perceiver obtains performance comparable to ResNet-50 and ViT on ImageNet without 2D convolutions by directly attending to 50,000 pixels. It is also competitive in all modalities in AudioSet.

Submitted to arXiv on 04 Mar. 2021

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2103.03206v2

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

, , , , The Perceiver model, introduced in the paper "Perceiver: General Perception with Iterative Attention" by Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, and Joao Carreira, addresses the challenge of processing high-dimensional inputs from diverse modalities such as vision, audition, touch, and proprioception. The model leverages Transformers to make fewer architectural assumptions about input relationships compared to traditional deep learning perception models. One key feature is its scalability to handle hundreds of thousands of inputs similar to Convolutional Neural Networks (ConvNets) through an asymmetric attention mechanism that iteratively distills inputs into a compact latent bottleneck. This design allows for effective processing of large multi-modal inputs without being constrained by modality-specific priors. In evaluations, the Perceiver demonstrates competitive performance on classification tasks involving images, point clouds, audio, video and combined audio-video data. Notably achieving results comparable to ResNet-50 and Vision Transformer (ViT) on ImageNet without using 2D convolutions by directly attending to 50,000 pixels. Additionally showing competitiveness across all modalities in AudioSet datasets. Overall presenting a promising approach for general perception tasks through combining the flexibility of Transformers with scalable processing capabilities for multi-modal input processing without being limited by modality-specific assumptions.
Created on 21 Apr. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.