Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval

AI-generated keywords: Protein sequences Deep generative models Tranception UniRef ProteinGym

AI-generated Key Points

  • Accurate modeling of protein sequence fitness landscapes is crucial for various applications
  • Deep generative models trained on multiple sequence alignments have shown success in addressing challenges
  • Large language models developed using non-aligned protein sequences show promise in bridging performance gap
  • Tranception, a novel transformer architecture, leverages autoregressive predictions and retrieval of homologous sequences for state-of-the-art fitness prediction performance
  • Tranception is trained on UniRef, a large-scale protein sequence database, with 700M parameters and approximately 250 million sequences
  • Sequences in the multiple sequence alignment are re-weighted to address biases in protein databases
  • ProteinGym is introduced as an extensive set of multiplexed assays for variant effects to facilitate more rigorous testing across a wider range of protein families
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Pascal Notin, Mafalda Dias, Jonathan Frazer, Javier Marchena-Hurtado, Aidan Gomez, Debora S. Marks, Yarin Gal

ICML 2022
License: CC BY 4.0

Abstract: The ability to accurately model the fitness landscape of protein sequences is critical to a wide range of applications, from quantifying the effects of human variants on disease likelihood, to predicting immune-escape mutations in viruses and designing novel biotherapeutic proteins. Deep generative models of protein sequences trained on multiple sequence alignments have been the most successful approaches so far to address these tasks. The performance of these methods is however contingent on the availability of sufficiently deep and diverse alignments for reliable training. Their potential scope is thus limited by the fact many protein families are hard, if not impossible, to align. Large language models trained on massive quantities of non-aligned protein sequences from diverse families address these problems and show potential to eventually bridge the performance gap. We introduce Tranception, a novel transformer architecture leveraging autoregressive predictions and retrieval of homologous sequences at inference to achieve state-of-the-art fitness prediction performance. Given its markedly higher performance on multiple mutants, robustness to shallow alignments and ability to score indels, our approach offers significant gain of scope over existing approaches. To enable more rigorous model testing across a broader range of protein families, we develop ProteinGym -- an extensive set of multiplexed assays of variant effects, substantially increasing both the number and diversity of assays compared to existing benchmarks.

Submitted to arXiv on 27 May. 2022

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2205.13760v1

The accurate modeling of protein sequence fitness landscapes is crucial for various applications such as assessing the impact of human variants on disease susceptibility and predicting immune-escape mutations in viruses. Deep generative models trained on multiple sequence alignments have shown significant success in addressing these challenges. However, their performance heavily relies on the availability of comprehensive and diverse alignments for effective training. To overcome this limitation, large language models have been developed using vast amounts of non-aligned protein sequences from diverse families. These models show promise in bridging the performance gap by offering a broader scope of application. One such advancement is Tranception, a novel transformer architecture that leverages autoregressive predictions and retrieval of homologous sequences during inference to achieve state-of-the-art fitness prediction performance. To enhance the training process, Tranception is trained on UniRef, a large-scale protein sequence database. Thorough ablations are conducted during its development to optimize its performance. Unlike previous observations with masked-language model architectures, it has been found that maintaining the granularity available in the dataset benefits downstream task performance. Furthermore, extensive data processing and augmentations are performed to refine the training dataset used for developing Tranception. The final model with 700M parameters is trained on UniRef100 after preprocessing, resulting in a training dataset comprising approximately 250 million sequences. To address biases in protein databases due to human and evolutionary sampling, sequences in the multiple sequence alignment (MSA) are re-weighted using a specific scheme. Additionally, an estimation method is employed to calculate log likelihoods for protein sequences based on weighted arithmetic averages derived from autoregressive and retrieval inference modes. Overall, Tranception demonstrates superior performance on multiple mutants and shows robustness to shallow alignments while also being capable of scoring indels compared to existing approaches. In order to facilitate more rigorous testing across a wider range of protein families, ProteinGym is introduced as an extensive set of multiplexed assays for variant effects that significantly increases both the number and diversity of assays compared to current benchmarks.
Created on 21 Feb. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.