Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search

AI-generated keywords: Jet-Nemotron Post Neural Architecture Search attention blocks hyperparameter search throughput evaluation

AI-generated Key Points

  • Introduction of a novel family of hybrid-architecture language models that outperform leading full-attention models in accuracy and generation throughput
  • Leveraging an advanced neural architecture exploration pipeline to simplify model design process
  • Utilization of a pre-trained full-attention model with frozen MLP weights for exploring different attention block designs
  • Four key components of the pipeline: learning optimal placement and elimination of full-attention layers, selecting linear attention blocks, designing new attention blocks, and conducting hardware-aware hyperparameter search
  • Fine-tuning models through rigorous experimentation and training stages on various data sources from math and coding domains
  • Employment of stringent evaluation protocols with 4-shot or 5-shot evaluations for different tasks to ensure robust comparisons
  • Outperformance of state-of-the-art full-attention models, linear attention models, and hybrid models in benchmark settings such as MMLU(-Pro), mathematical reasoning, commonsense reasoning, retrieval tasks, coding challenges, and long-context tasks
  • Thorough throughput evaluations on a high-performance DGX H100 server setup by optimizing batch sizes through chunk-prefilling techniques within GPU memory constraints
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Yuxian Gu, Qinghao Hu, Shang Yang, Haocheng Xi, Junyu Chen, Song Han, Han Cai

Tech Report
License: CC BY-NC-SA 4.0

Abstract: We present Jet-Nemotron, a new family of hybrid-architecture language models, which matches or exceeds the accuracy of leading full-attention models while significantly improving generation throughput. Jet-Nemotron is developed using Post Neural Architecture Search (PostNAS), a novel neural architecture exploration pipeline that enables efficient model design. Unlike prior approaches, PostNAS begins with a pre-trained full-attention model and freezes its MLP weights, allowing efficient exploration of attention block designs. The pipeline includes four key components: (1) learning optimal full-attention layer placement and elimination, (2) linear attention block selection, (3) designing new attention blocks, and (4) performing hardware-aware hyperparameter search. Our Jet-Nemotron-2B model achieves comparable or superior accuracy to Qwen3, Qwen2.5, Gemma3, and Llama3.2 across a comprehensive suite of benchmarks while delivering up to 53.6x generation throughput speedup and 6.1x prefilling speedup. It also achieves higher accuracy on MMLU and MMLU-Pro than recent advanced MoE full-attention models, such as DeepSeek-V3-Small and Moonlight, despite their larger scale with 15B total and 2.2B activated parameters.

Submitted to arXiv on 21 Aug. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2508.15884v1

The study introduces , a novel family of hybrid-architecture language models that outperform leading full-attention models in terms of accuracy and generation throughput. This is made possible by leveraging , an advanced neural architecture exploration pipeline that simplifies the model design process. Unlike traditional methods, utilizes a pre-trained full-attention model and freezes its MLP weights to efficiently explore different attention block designs. The research team outlines four key components of the pipeline: learning optimal placement and elimination of full-attention layers, selecting linear attention blocks, designing new attention blocks, and conducting hardware-aware hyperparameter search. Through rigorous experimentation and training stages on various data sources from math and coding domains, the models are fine-tuned for optimal performance. The researchers also employ stringent evaluation protocols with 4-shot or 5-shot evaluations for different tasks to ensure robust comparisons. Results show that outperforms state-of-the-art full-attention models, linear attention models, and hybrid models in benchmark settings such as MMLU(-Pro), mathematical reasoning, commonsense reasoning, retrieval tasks, coding challenges, and long-context tasks. To further demonstrate the efficiency of , the research team conducts thorough throughput evaluations on a high-performance DGX H100 server setup. By optimizing batch sizes through chunk-prefilling techniques within GPU memory constraints, they achieve impressive decoding throughput results.
Created on 26 Aug. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.