Efficient Decoding Methods for Language Models on Encrypted Data

AI-generated keywords: Efficient Decoding Methods Homomorphic Encryption Privacy-Preserving Inference Cutmax Algorithm Nucleus Sampling

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Authors: Matan Avitan, Moran Baruch, Nir Drucker, Itamar Zimerman, Yoav Goldberg
  • Topic: Efficient Decoding Methods for Language Models on Encrypted Data
  • Privacy Concerns: Addressed in processing sensitive data on untrusted servers using large language models (LLMs)
  • Homomorphic Encryption (HE): Potential for enabling secure inference on encrypted data
  • Cutmax Algorithm: Novel HE-friendly argmax algorithm to reduce computational complexity under encryption
  • Nucleus Sampling Method: HE-compatible method leveraging cutmax for efficient stochastic decoding with privacy guarantees
  • Benefits: Support efficient inference in privacy-preserving settings and facilitate gradient-based optimization
  • Theoretical Guarantees: Strong theoretical guarantees for cutmax with global convergence to a unique fixed point
  • Latency Reductions: Significant latency reductions of 24x-35x demonstrated compared to baseline methods
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Matan Avitan, Moran Baruch, Nir Drucker, Itamar Zimerman, Yoav Goldberg

Abstract: Large language models (LLMs) power modern AI applications, but processing sensitive data on untrusted servers raises privacy concerns. Homomorphic encryption (HE) enables computation on encrypted data for secure inference. However, neural text generation requires decoding methods like argmax and sampling, which are non-polynomial and thus computationally expensive under encryption, creating a significant performance bottleneck. We introduce cutmax, an HE-friendly argmax algorithm that reduces ciphertext operations compared to prior methods, enabling practical greedy decoding under encryption. We also propose the first HE-compatible nucleus (top-p) sampling method, leveraging cutmax for efficient stochastic decoding with provable privacy guarantees. Both techniques are polynomial, supporting efficient inference in privacy-preserving settings. Moreover, their differentiability facilitates gradient-based sequence-level optimization as a polynomial alternative to straight-through estimators. We further provide strong theoretical guarantees for cutmax, proving it converges globally to a unique two-level fixed point, independent of the input values beyond the identity of the maximizer, which explains its rapid convergence in just a few iterations. Evaluations on realistic LLM outputs show latency reductions of 24x-35x over baselines, advancing secure text generation.

Submitted to arXiv on 10 Sep. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2509.08383v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

In their paper titled "Efficient Decoding Methods for Language Models on Encrypted Data," authors Matan Avitan, Moran Baruch, Nir Drucker, Itamar Zimerman, and Yoav Goldberg address the privacy concerns associated with processing sensitive data on untrusted servers using large language models (LLMs). They highlight the potential of homomorphic encryption (HE) in enabling secure inference on encrypted data. The authors introduce a novel HE-friendly argmax algorithm called cutmax to overcome the non-polynomial and computationally expensive nature of traditional decoding methods under encryption. This reduces the number of ciphertext operations and makes greedy decoding feasible. Additionally, they propose a HE-compatible nucleus (top-p) sampling method that leverages cutmax for efficient stochastic decoding while ensuring provable privacy guarantees. These techniques support efficient inference in privacy-preserving settings and facilitate gradient-based sequence-level optimization as an alternative to straight-through estimators. The authors provide strong theoretical guarantees for cutmax and demonstrate its global convergence to a unique two-level fixed point independent of input values beyond the maximizer's identity. Through evaluations on realistic LLM outputs, significant latency reductions of 24x-35x are demonstrated compared to baseline methods. These contributions advance secure text generation by enabling efficient and secure inference processes in privacy-sensitive applications.
Created on 10 Oct. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.