The accurate modeling of protein sequence fitness landscapes is crucial for various applications such as assessing the impact of human variants on disease susceptibility and predicting immune-escape mutations in viruses. Deep generative models trained on multiple sequence alignments have shown significant success in addressing these challenges. However, their performance heavily relies on the availability of comprehensive and diverse alignments for effective training. To overcome this limitation, large language models have been developed using vast amounts of non-aligned protein sequences from diverse families. These models show promise in bridging the performance gap by offering a broader scope of application. One such advancement is Tranception, a novel transformer architecture that leverages autoregressive predictions and retrieval of homologous sequences during inference to achieve state-of-the-art fitness prediction performance. To enhance the training process, Tranception is trained on UniRef, a large-scale protein sequence database. Thorough ablations are conducted during its development to optimize its performance. Unlike previous observations with masked-language model architectures, it has been found that maintaining the granularity available in the dataset benefits downstream task performance. Furthermore, extensive data processing and augmentations are performed to refine the training dataset used for developing Tranception. The final model with 700M parameters is trained on UniRef100 after preprocessing, resulting in a training dataset comprising approximately 250 million sequences. To address biases in protein databases due to human and evolutionary sampling, sequences in the multiple sequence alignment (MSA) are re-weighted using a specific scheme. Additionally, an estimation method is employed to calculate log likelihoods for protein sequences based on weighted arithmetic averages derived from autoregressive and retrieval inference modes. Overall, Tranception demonstrates superior performance on multiple mutants and shows robustness to shallow alignments while also being capable of scoring indels compared to existing approaches. In order to facilitate more rigorous testing across a wider range of protein families, ProteinGym is introduced as an extensive set of multiplexed assays for variant effects that significantly increases both the number and diversity of assays compared to current benchmarks.
- - Accurate modeling of protein sequence fitness landscapes is crucial for various applications
- - Deep generative models trained on multiple sequence alignments have shown success in addressing challenges
- - Large language models developed using non-aligned protein sequences show promise in bridging performance gap
- - Tranception, a novel transformer architecture, leverages autoregressive predictions and retrieval of homologous sequences for state-of-the-art fitness prediction performance
- - Tranception is trained on UniRef, a large-scale protein sequence database, with 700M parameters and approximately 250 million sequences
- - Sequences in the multiple sequence alignment are re-weighted to address biases in protein databases
- - ProteinGym is introduced as an extensive set of multiplexed assays for variant effects to facilitate more rigorous testing across a wider range of protein families
Summary1. Understanding how proteins work is very important for many things.
2. Some computer programs can learn from lots of protein sequences and do a good job.
3. Other programs use different methods to help improve how well they work.
4. A new type of program called Tranception is really good at predicting how well proteins will function.
5. Scientists have created ProteinGym to test different protein changes more thoroughly.
Definitions- Protein: A type of molecule that helps our bodies work properly.
- Sequence: The order in which things are arranged or written down.
- Fitness landscapes: Describes how well a protein sequence works in different situations.
- Generative models: Computer programs that can create new data based on what they've learned.
- Autoregressive predictions: Making guesses about future data points based on previous ones.
- Homologous sequences: Proteins that are similar in structure and function due to shared ancestry.
- Database: A collection of information stored on a computer for easy access and retrieval.
- Parameters: Values used by computer programs to make decisions or calculations effectively.
Proteins are essential molecules that play crucial roles in various biological processes, such as catalyzing chemical reactions, transporting molecules, and providing structural support. The sequence of amino acids in a protein determines its structure and function, making it a key factor in understanding the molecular basis of diseases and predicting their potential impact on human health.
In recent years, there has been significant progress in accurately modeling protein sequence fitness landscapes using deep generative models trained on multiple sequence alignments (MSAs). These models have shown great promise in addressing challenges such as assessing the impact of human variants on disease susceptibility and predicting immune-escape mutations in viruses. However, their performance heavily relies on the availability of comprehensive and diverse alignments for effective training.
To overcome this limitation, researchers have developed large language models using vast amounts of non-aligned protein sequences from diverse families. These models offer a broader scope of application by bridging the performance gap between existing approaches. One such advancement is Tranception - a novel transformer architecture specifically designed for protein sequence modeling.
Tranception leverages autoregressive predictions and retrieval of homologous sequences during inference to achieve state-of-the-art fitness prediction performance. This approach not only improves accuracy but also allows for more efficient use of computational resources compared to traditional methods that require full MSAs during inference.
The model is trained on UniRef - a large-scale protein sequence database containing over 250 million sequences after preprocessing. To enhance its performance further, thorough ablations were conducted during its development to optimize its architecture. Unlike previous observations with masked-language model architectures, it was found that maintaining the granularity available in the dataset benefits downstream task performance.
Furthermore, extensive data processing and augmentations were performed to refine the training dataset used for developing Tranception. This included re-weighting sequences within MSAs to address biases caused by human and evolutionary sampling. Additionally, an estimation method was employed to calculate log likelihoods for protein sequences based on weighted arithmetic averages derived from autoregressive and retrieval inference modes.
The final Tranception model, with 700M parameters, demonstrates superior performance on multiple mutants and shows robustness to shallow alignments. It is also capable of scoring indels (insertions and deletions) compared to existing approaches. These features make it a valuable tool for predicting the effects of genetic variations in proteins across a wide range of protein families.
To facilitate more rigorous testing across diverse protein families, the researchers introduced ProteinGym - an extensive set of multiplexed assays for variant effects. This benchmark significantly increases both the number and diversity of assays compared to current benchmarks, allowing for more comprehensive evaluation of protein sequence fitness landscapes.
In conclusion, accurate modeling of protein sequence fitness landscapes is crucial for various applications in understanding disease susceptibility and predicting mutations in viruses. Tranception offers a significant advancement in this field by leveraging large language models trained on non-aligned sequences and incorporating autoregressive predictions and retrieval during inference. Its superior performance on multiple mutants and robustness to shallow alignments make it a valuable tool for studying the effects of genetic variations in proteins. Additionally, ProteinGym provides a comprehensive benchmark for evaluating these models across diverse protein families, further advancing our understanding of protein sequence fitness landscapes.