Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval

AI-generated keywords: Protein sequences Deep generative models Tranception UniRef ProteinGym

AI-generated Key Points

Accurate modeling of protein sequence fitness landscapes is crucial for various applications
Deep generative models trained on multiple sequence alignments have shown success in addressing challenges
Large language models developed using non-aligned protein sequences show promise in bridging performance gap
Tranception, a novel transformer architecture, leverages autoregressive predictions and retrieval of homologous sequences for state-of-the-art fitness prediction performance
Tranception is trained on UniRef, a large-scale protein sequence database, with 700M parameters and approximately 250 million sequences
Sequences in the multiple sequence alignment are re-weighted to address biases in protein databases
ProteinGym is introduced as an extensive set of multiplexed assays for variant effects to facilitate more rigorous testing across a wider range of protein families

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Pascal Notin, Mafalda Dias, Jonathan Frazer, Javier Marchena-Hurtado, Aidan Gomez, Debora S. Marks, Yarin Gal

arXiv: 2205.13760v1 - DOI (cs.LG)

ICML 2022

License: CC BY 4.0

Abstract: The ability to accurately model the fitness landscape of protein sequences is critical to a wide range of applications, from quantifying the effects of human variants on disease likelihood, to predicting immune-escape mutations in viruses and designing novel biotherapeutic proteins. Deep generative models of protein sequences trained on multiple sequence alignments have been the most successful approaches so far to address these tasks. The performance of these methods is however contingent on the availability of sufficiently deep and diverse alignments for reliable training. Their potential scope is thus limited by the fact many protein families are hard, if not impossible, to align. Large language models trained on massive quantities of non-aligned protein sequences from diverse families address these problems and show potential to eventually bridge the performance gap. We introduce Tranception, a novel transformer architecture leveraging autoregressive predictions and retrieval of homologous sequences at inference to achieve state-of-the-art fitness prediction performance. Given its markedly higher performance on multiple mutants, robustness to shallow alignments and ability to score indels, our approach offers significant gain of scope over existing approaches. To enable more rigorous model testing across a broader range of protein families, we develop ProteinGym -- an extensive set of multiplexed assays of variant effects, substantially increasing both the number and diversity of assays compared to existing benchmarks.

Submitted to arXiv on 27 May. 2022

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2205.13760v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

The accurate modeling of protein sequence fitness landscapes is crucial for various applications such as assessing the impact of human variants on disease susceptibility and predicting immune-escape mutations in viruses. Deep generative models trained on multiple sequence alignments have shown significant success in addressing these challenges. However, their performance heavily relies on the availability of comprehensive and diverse alignments for effective training. To overcome this limitation, large language models have been developed using vast amounts of non-aligned protein sequences from diverse families. These models show promise in bridging the performance gap by offering a broader scope of application. One such advancement is Tranception, a novel transformer architecture that leverages autoregressive predictions and retrieval of homologous sequences during inference to achieve state-of-the-art fitness prediction performance. To enhance the training process, Tranception is trained on UniRef, a large-scale protein sequence database. Thorough ablations are conducted during its development to optimize its performance. Unlike previous observations with masked-language model architectures, it has been found that maintaining the granularity available in the dataset benefits downstream task performance. Furthermore, extensive data processing and augmentations are performed to refine the training dataset used for developing Tranception. The final model with 700M parameters is trained on UniRef100 after preprocessing, resulting in a training dataset comprising approximately 250 million sequences. To address biases in protein databases due to human and evolutionary sampling, sequences in the multiple sequence alignment (MSA) are re-weighted using a specific scheme. Additionally, an estimation method is employed to calculate log likelihoods for protein sequences based on weighted arithmetic averages derived from autoregressive and retrieval inference modes. Overall, Tranception demonstrates superior performance on multiple mutants and shows robustness to shallow alignments while also being capable of scoring indels compared to existing approaches. In order to facilitate more rigorous testing across a wider range of protein families, ProteinGym is introduced as an extensive set of multiplexed assays for variant effects that significantly increases both the number and diversity of assays compared to current benchmarks.

- Accurate modeling of protein sequence fitness landscapes is crucial for various applications
- Deep generative models trained on multiple sequence alignments have shown success in addressing challenges
- Large language models developed using non-aligned protein sequences show promise in bridging performance gap
- Tranception, a novel transformer architecture, leverages autoregressive predictions and retrieval of homologous sequences for state-of-the-art fitness prediction performance
- Tranception is trained on UniRef, a large-scale protein sequence database, with 700M parameters and approximately 250 million sequences
- Sequences in the multiple sequence alignment are re-weighted to address biases in protein databases
- ProteinGym is introduced as an extensive set of multiplexed assays for variant effects to facilitate more rigorous testing across a wider range of protein families

Summary1. Understanding how proteins work is very important for many things. 2. Some computer programs can learn from lots of protein sequences and do a good job. 3. Other programs use different methods to help improve how well they work. 4. A new type of program called Tranception is really good at predicting how well proteins will function. 5. Scientists have created ProteinGym to test different protein changes more thoroughly. Definitions- Protein: A type of molecule that helps our bodies work properly. - Sequence: The order in which things are arranged or written down. - Fitness landscapes: Describes how well a protein sequence works in different situations. - Generative models: Computer programs that can create new data based on what they've learned. - Autoregressive predictions: Making guesses about future data points based on previous ones. - Homologous sequences: Proteins that are similar in structure and function due to shared ancestry. - Database: A collection of information stored on a computer for easy access and retrieval. - Parameters: Values used by computer programs to make decisions or calculations effectively.

Proteins are essential molecules that play crucial roles in various biological processes, such as catalyzing chemical reactions, transporting molecules, and providing structural support. The sequence of amino acids in a protein determines its structure and function, making it a key factor in understanding the molecular basis of diseases and predicting their potential impact on human health. In recent years, there has been significant progress in accurately modeling protein sequence fitness landscapes using deep generative models trained on multiple sequence alignments (MSAs). These models have shown great promise in addressing challenges such as assessing the impact of human variants on disease susceptibility and predicting immune-escape mutations in viruses. However, their performance heavily relies on the availability of comprehensive and diverse alignments for effective training. To overcome this limitation, researchers have developed large language models using vast amounts of non-aligned protein sequences from diverse families. These models offer a broader scope of application by bridging the performance gap between existing approaches. One such advancement is Tranception - a novel transformer architecture specifically designed for protein sequence modeling. Tranception leverages autoregressive predictions and retrieval of homologous sequences during inference to achieve state-of-the-art fitness prediction performance. This approach not only improves accuracy but also allows for more efficient use of computational resources compared to traditional methods that require full MSAs during inference. The model is trained on UniRef - a large-scale protein sequence database containing over 250 million sequences after preprocessing. To enhance its performance further, thorough ablations were conducted during its development to optimize its architecture. Unlike previous observations with masked-language model architectures, it was found that maintaining the granularity available in the dataset benefits downstream task performance. Furthermore, extensive data processing and augmentations were performed to refine the training dataset used for developing Tranception. This included re-weighting sequences within MSAs to address biases caused by human and evolutionary sampling. Additionally, an estimation method was employed to calculate log likelihoods for protein sequences based on weighted arithmetic averages derived from autoregressive and retrieval inference modes. The final Tranception model, with 700M parameters, demonstrates superior performance on multiple mutants and shows robustness to shallow alignments. It is also capable of scoring indels (insertions and deletions) compared to existing approaches. These features make it a valuable tool for predicting the effects of genetic variations in proteins across a wide range of protein families. To facilitate more rigorous testing across diverse protein families, the researchers introduced ProteinGym - an extensive set of multiplexed assays for variant effects. This benchmark significantly increases both the number and diversity of assays compared to current benchmarks, allowing for more comprehensive evaluation of protein sequence fitness landscapes. In conclusion, accurate modeling of protein sequence fitness landscapes is crucial for various applications in understanding disease susceptibility and predicting mutations in viruses. Tranception offers a significant advancement in this field by leveraging large language models trained on non-aligned sequences and incorporating autoregressive predictions and retrieval during inference. Its superior performance on multiple mutants and robustness to shallow alignments make it a valuable tool for studying the effects of genetic variations in proteins. Additionally, ProteinGym provides a comprehensive benchmark for evaluating these models across diverse protein families, further advancing our understanding of protein sequence fitness landscapes.

Created on 21 Feb. 2024

Assess the quality of the AI-generated content by voting

Score: 0

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.