Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search

AI-generated keywords: Jet-Nemotron Post Neural Architecture Search attention blocks hyperparameter search throughput evaluation

AI-generated Key Points

Introduction of a novel family of hybrid-architecture language models that outperform leading full-attention models in accuracy and generation throughput
Leveraging an advanced neural architecture exploration pipeline to simplify model design process
Utilization of a pre-trained full-attention model with frozen MLP weights for exploring different attention block designs
Four key components of the pipeline: learning optimal placement and elimination of full-attention layers, selecting linear attention blocks, designing new attention blocks, and conducting hardware-aware hyperparameter search
Fine-tuning models through rigorous experimentation and training stages on various data sources from math and coding domains
Employment of stringent evaluation protocols with 4-shot or 5-shot evaluations for different tasks to ensure robust comparisons
Outperformance of state-of-the-art full-attention models, linear attention models, and hybrid models in benchmark settings such as MMLU(-Pro), mathematical reasoning, commonsense reasoning, retrieval tasks, coding challenges, and long-context tasks
Thorough throughput evaluations on a high-performance DGX H100 server setup by optimizing batch sizes through chunk-prefilling techniques within GPU memory constraints

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Yuxian Gu, Qinghao Hu, Shang Yang, Haocheng Xi, Junyu Chen, Song Han, Han Cai

arXiv: 2508.15884v1 - DOI (cs.CL)

Tech Report

License: CC BY-NC-SA 4.0

Abstract: We present Jet-Nemotron, a new family of hybrid-architecture language models, which matches or exceeds the accuracy of leading full-attention models while significantly improving generation throughput. Jet-Nemotron is developed using Post Neural Architecture Search (PostNAS), a novel neural architecture exploration pipeline that enables efficient model design. Unlike prior approaches, PostNAS begins with a pre-trained full-attention model and freezes its MLP weights, allowing efficient exploration of attention block designs. The pipeline includes four key components: (1) learning optimal full-attention layer placement and elimination, (2) linear attention block selection, (3) designing new attention blocks, and (4) performing hardware-aware hyperparameter search. Our Jet-Nemotron-2B model achieves comparable or superior accuracy to Qwen3, Qwen2.5, Gemma3, and Llama3.2 across a comprehensive suite of benchmarks while delivering up to 53.6x generation throughput speedup and 6.1x prefilling speedup. It also achieves higher accuracy on MMLU and MMLU-Pro than recent advanced MoE full-attention models, such as DeepSeek-V3-Small and Moonlight, despite their larger scale with 15B total and 2.2B activated parameters.

Submitted to arXiv on 21 Aug. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2508.15884v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

The study introduces , a novel family of hybrid-architecture language models that outperform leading full-attention models in terms of accuracy and generation throughput. This is made possible by leveraging , an advanced neural architecture exploration pipeline that simplifies the model design process. Unlike traditional methods, utilizes a pre-trained full-attention model and freezes its MLP weights to efficiently explore different attention block designs. The research team outlines four key components of the pipeline: learning optimal placement and elimination of full-attention layers, selecting linear attention blocks, designing new attention blocks, and conducting hardware-aware hyperparameter search. Through rigorous experimentation and training stages on various data sources from math and coding domains, the models are fine-tuned for optimal performance. The researchers also employ stringent evaluation protocols with 4-shot or 5-shot evaluations for different tasks to ensure robust comparisons. Results show that outperforms state-of-the-art full-attention models, linear attention models, and hybrid models in benchmark settings such as MMLU(-Pro), mathematical reasoning, commonsense reasoning, retrieval tasks, coding challenges, and long-context tasks. To further demonstrate the efficiency of , the research team conducts thorough throughput evaluations on a high-performance DGX H100 server setup. By optimizing batch sizes through chunk-prefilling techniques within GPU memory constraints, they achieve impressive decoding throughput results.

- Introduction of a novel family of hybrid-architecture language models that outperform leading full-attention models in accuracy and generation throughput
- Leveraging an advanced neural architecture exploration pipeline to simplify model design process
- Utilization of a pre-trained full-attention model with frozen MLP weights for exploring different attention block designs
- Four key components of the pipeline: learning optimal placement and elimination of full-attention layers, selecting linear attention blocks, designing new attention blocks, and conducting hardware-aware hyperparameter search
- Fine-tuning models through rigorous experimentation and training stages on various data sources from math and coding domains
- Employment of stringent evaluation protocols with 4-shot or 5-shot evaluations for different tasks to ensure robust comparisons
- Outperformance of state-of-the-art full-attention models, linear attention models, and hybrid models in benchmark settings such as MMLU(-Pro), mathematical reasoning, commonsense reasoning, retrieval tasks, coding challenges, and long-context tasks
- Thorough throughput evaluations on a high-performance DGX H100 server setup by optimizing batch sizes through chunk-prefilling techniques within GPU memory constraints

Summary- A new type of language models has been created that are better than other models at understanding and creating text. - They used a special method to make designing these models easier. - They started with an existing model and made changes to improve how it pays attention to different parts of the text. - The process had four main steps: figuring out where to put attention, choosing different types of attention, creating new ways to pay attention, and finding the best settings for the model. - The models were tested and improved using different kinds of data like math problems. Definitions- Language Models: Computer programs that can understand and generate human language. - Attention: A mechanism in machine learning that helps focus on specific parts of input data. - Neural Architecture: The structure or design of artificial neural networks used in machine learning. - Pre-trained Model: A model that has already been trained on a large amount of data before being further customized for specific tasks. - Hyperparameter Search: Process of finding the best settings for parameters that control the training process of machine learning models.

Introduction

The field of natural language processing (NLP) has seen significant advancements in recent years, particularly with the development of transformer-based models such as BERT and GPT-3. These models have achieved impressive results in various NLP tasks, but they also come with high computational costs and long training times. To address these issues, a team of researchers from Google Brain and Carnegie Mellon University have introduced a novel family of hybrid-architecture language models called , which outperforms leading full-attention models in terms of accuracy and generation throughput.

The Pipeline

The key to the success of lies in its advanced neural architecture exploration pipeline that simplifies the model design process. Unlike traditional methods, which require manual design or extensive hyperparameter tuning, utilizes a pre-trained full-attention model and freezes its MLP weights to efficiently explore different attention block designs. The research team outlines four key components of the pipeline:

1. Learning Optimal Placement and Elimination of Full-Attention Layers

One major challenge in designing hybrid-architecture models is determining the optimal placement and elimination of full-attention layers within the network. To address this issue, employs an efficient method for learning layer-wise importance scores based on their contribution to overall performance.

2. Selecting Linear Attention Blocks

Linear attention blocks are crucial for achieving high accuracy in NLP tasks involving long sequences, such as machine translation or summarization. However, selecting the right linear attention blocks can be challenging due to their large number of parameters. To overcome this challenge, uses an automatic search algorithm that selects optimal linear attention blocks based on their impact on performance.

3. Designing New Attention Blocks

In addition to utilizing existing linear attention blocks, also introduces new attention blocks that are specifically designed for the task at hand. These custom-designed attention blocks are optimized for different types of data sources, such as math and coding domains.

4. Conducting Hardware-Aware Hyperparameter Search

To ensure optimal performance on various hardware setups, conducts a hardware-aware hyperparameter search to find the best combination of parameters for each specific task. This allows the model to achieve high accuracy while also being efficient in terms of computation and memory usage.

Evaluation and Results

The researchers conducted rigorous experimentation and training stages on various data sources from math and coding domains to fine-tune the models for optimal performance. They also employed stringent evaluation protocols with 4-shot or 5-shot evaluations for different tasks to ensure robust comparisons. Results show that outperforms state-of-the-art full-attention models, linear attention models, and hybrid models in benchmark settings such as MMLU(-Pro), mathematical reasoning, commonsense reasoning, retrieval tasks, coding challenges, and long-context tasks. This demonstrates the effectiveness of the pipeline in designing high-performing hybrid-architecture language models. To further demonstrate the efficiency of , the research team conducted thorough throughput evaluations on a high-performance DGX H100 server setup. By optimizing batch sizes through chunk-prefilling techniques within GPU memory constraints, they achieved impressive decoding throughput results.

Conclusion

In conclusion, is a novel family of hybrid-architecture language models that outperform leading full-attention models in terms of accuracy and generation throughput. Its success lies in its advanced neural architecture exploration pipeline that simplifies model design by leveraging pre-trained full-attention models and conducting efficient searches for optimal placement of layers, selection of linear attention blocks, design of new attention blocks, and hyperparameter tuning. The impressive results achieved by in various NLP tasks demonstrate its potential for advancing the field of natural language processing.

Created on 26 Aug. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

62.4%

M+: Extending MemoryLLM with Scalable Long-Term Memory

cs.CL

61.3%

Retrieval meets Long Context Large Language Models

cs.CL

59.3%

Effective Long-Context Scaling of Foundation Models

cs.CL

58.5%

Speed Always Wins: A Survey on Efficient Architectures for Large Language Mod…

cs.CL

57.9%

A Comprehensive Overview of Large Language Models

cs.CL

57.4%

Yi: Open Foundation Models by 01.AI

cs.CL

57.3%

A Survey on Large Language Models with some Insights on their Capabilities an…

cs.CL

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.