Nova$^+$: Generative Language Models for Binaries

AI-generated keywords: Nova$^+$ Generative Language Models Binary Code Pre-training Strategies Downstream Tasks

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Challenges faced by large language models (LLMs) in modeling and learning binary code:
  • Handling hex-decimal values
  • Complex global dependencies
  • Compiler optimization levels
  • Introduction of Nova and Nova$^+$:
  • Nova pre-trained on binary corpora
  • Initially pre-trained using standard language modeling tasks
  • Demonstrates superior performance on BCSD, BCT, and BCR compared to GPT-3.5 and other techniques
  • Nova$^+ developed as an enhancement to Nova
  • Incorporates two new pre-training tasks: optimization generation and optimization level prediction
  • Performance comparison:
  • Nova outperforms existing methods on five benchmarks across three downstream tasks
  • Nova$^+$ outperforms all other methods across all three downstream tasks on five benchmarks
  • Significance of new pre-training strategies:
  • Specialized training for LLMs when dealing with binary code is crucial
  • Innovative approaches like Nova$^+ significantly improve LLM capabilities in various binary-related tasks
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Nan Jiang, Chengxiao Wang, Kevin Liu, Xiangzhe Xu, Lin Tan, Xiangyu Zhang

License: CC BY-NC-ND 4.0

Abstract: Generative large language models (LLMs) pre-trained on code have shown impressive effectiveness in code generation, program repair, and document analysis. However, existing generative LLMs focus on source code and are not specialized for binaries. There are three main challenges for LLMs to model and learn binary code: hex-decimal values, complex global dependencies, and compiler optimization levels.To bring the benefit of LLMs to the binary domain, we develop Nova and Nova$^+$, which are LLMs pre-trained on binary corpora. Nova is pre-trained with the standard language modeling task, showing significantly better capability on five benchmarks for three downstream tasks: binary code similarity detection (BCSD), binary code translation (BCT), and binary code recovery (BCR), over GPT-3.5 and other existing techniques. We build Nova$^+$ to further boost Nova using two new pre-training tasks, i.e., optimization generation and optimization level prediction, which are designed to learn binary optimization and align equivalent binaries. Nova$^+$ shows overall the best performance for all three downstream tasks on five benchmarks, demonstrating the contributions of the new pre-training tasks.

Submitted to arXiv on 22 Nov. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2311.13721v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

In their paper titled "Nova$^+$: Generative Language Models for Binaries," authors Nan Jiang, Chengxiao Wang, Kevin Liu, Xiangzhe Xu, Lin Tan, and Xiangyu Zhang discuss the challenges faced by large language models (LLMs) in modeling and learning binary code. Existing LLMs have shown effectiveness in tasks like code generation and program repair but are primarily focused on source code. The main obstacles for LLMs when it comes to binary code include handling hex-decimal values, complex global dependencies, and compiler optimization levels. To address these challenges and bring the benefits of LLMs to the binary domain, the authors introduce Nova and Nova$^+$ - LLMs pre-trained on binary corpora. Nova is initially pre-trained using standard language modeling tasks and demonstrates superior performance on five benchmarks across three downstream tasks: binary code similarity detection (BCSD), binary code translation (BCT), and binary code recovery (BCR) compared to GPT-3.5 and other existing techniques. Furthermore, Nova$^+$ is developed as an enhancement to Nova by incorporating two new pre-training tasks: optimization generation and optimization level prediction. These tasks are specifically designed to improve the model's understanding of binary optimization techniques and align equivalent binaries. The results show that Nova$^+$ outperforms all other methods across all three downstream tasks on five benchmarks, highlighting the significance of these new pre-training strategies in enhancing LLM performance for binary-related applications. Overall, the study underscores the importance of specialized training for LLMs when dealing with binary code and showcases how innovative approaches like Nova$^+ can significantly improve their capabilities in various binary-related tasks.
Created on 21 Nov. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.