Training LLMs with Fault Tolerant HSDP on 100,000 GPUs

AI-generated keywords: Fault Tolerance

AI-generated Key Points

  • Synchronous training in large-scale systems can be inefficient with O(100K) GPUs due to frequent failures and extended recovery times
  • Fault Tolerant Hybrid-Shared Data Parallelism (FT-HSDP) introduced as a groundbreaking training paradigm for fault tolerance
  • FT-HSDP uses data parallel replicas for fault tolerance, allowing the training process to continue even if one replica fails
  • Techniques like Fault Tolerant All Reduce (FTAR) protocol and non-blocking catch-up protocol are implemented in FT-HSDP for efficient recovery and seamless rejoining of replicas
  • FT-HSDP reduces stall time during failure recovery from 10 minutes to 3 minutes, increasing effective training time from 44% to 80%
  • Extensive testing confirms that FT-HSDP's asynchronous recovery process does not compromise model accuracy
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Omkar Salpekar, Rohan Varma, Kenny Yu, Vladimir Ivanov, Yang Wang, Ahmed Sharif, Min Si, Shawn Xu, Feng Tian, Shengbao Zheng, Tristan Rice, Ankush Garg, Shangfu Peng, Shreyas Siravara, Wenyin Fu, Rodrigo de Castro, Adithya Gangidi, Andrey Obraztsov, Sharan Narang, Sergey Edunov, Maxim Naumov, Chunqiang Tang, Mathew Oldham

License: CC BY 4.0

Abstract: Large-scale training systems typically use synchronous training, requiring all GPUs to be healthy simultaneously. In our experience training on O(100K) GPUs, synchronous training results in a low efficiency due to frequent failures and long recovery time. To address this problem, we propose a novel training paradigm, Fault Tolerant Hybrid-Shared Data Parallelism (FT-HSDP). FT-HSDP uses data parallel replicas as units of fault tolerance. When failures occur, only a single data-parallel replica containing the failed GPU or server is taken offline and restarted, while the other replicas continue training. To realize this idea at scale, FT-HSDP incorporates several techniques: 1) We introduce a Fault Tolerant All Reduce (FTAR) protocol for gradient exchange across data parallel replicas. FTAR relies on the CPU to drive the complex control logic for tasks like adding or removing participants dynamically, and relies on GPU to perform data transfer for best performance. 2) We introduce a non-blocking catch-up protocol, allowing a recovering replica to join training with minimal stall. Compared with fully synchronous training at O(100K) GPUs, FT-HSDP can reduce the stall time due to failure recovery from 10 minutes to 3 minutes, increasing effective training time from 44\% to 80\%. We further demonstrate that FT-HSDP's asynchronous recovery does not bring any meaning degradation to the accuracy of the result model.

Submitted to arXiv on 30 Jan. 2026

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2602.00277v1

, , , , In the realm of large-scale training systems, synchronous training has been the conventional approach, necessitating all GPUs to be operational simultaneously. However, when dealing with a vast number of GPUs in the order of O(100K), this synchronous training method often proves to be inefficient due to frequent failures and extended recovery times. To combat this challenge, a groundbreaking training paradigm known as Fault Tolerant Hybrid-Shared Data Parallelism (FT-HSDP) has been introduced. FT-HSDP operates by utilizing data parallel replicas as units of fault tolerance. In the event of failures, only a single data-parallel replica containing the faulty GPU or server is temporarily taken offline and restarted, while the other replicas continue with the training process. To implement this innovative concept at scale, FT-HSDP incorporates various techniques such as the introduction of a Fault Tolerant All Reduce (FTAR) protocol for gradient exchange across data parallel replicas. This protocol leverages the CPU to manage intricate control logic tasks such as dynamically adding or removing participants, while relying on GPU for optimal data transfer performance. Additionally, FT-HSDP implements a non-blocking catch-up protocol that enables a recovering replica to seamlessly rejoin training with minimal interruption. Compared to fully synchronous training on O(100K) GPUs, FT-HSDP demonstrates significant improvements by reducing stall time during failure recovery from 10 minutes to just 3 minutes. This enhancement results in an increase in effective training time from 44% to an impressive 80%. Furthermore, extensive testing confirms that FT-HSDP's asynchronous recovery process does not compromise the accuracy of the resulting model. The research conducted by Omkar Salpekar, Rohan Varma, Kenny Yu, Vladimir Ivanov, Yang Wang, Ahmed Sharif, Min Si, Shawn Xu, Feng Tian, Shengbao Zheng and their team sheds light on how FT-HSDP revolutionizes large-scale training systems by enhancing fault tolerance and overall efficiency without sacrificing model accuracy.
Created on 04 Feb. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.