Training LLMs with Fault Tolerant HSDP on 100,000 GPUs

AI-generated keywords: Fault Tolerance

AI-generated Key Points

Synchronous training in large-scale systems can be inefficient with O(100K) GPUs due to frequent failures and extended recovery times
Fault Tolerant Hybrid-Shared Data Parallelism (FT-HSDP) introduced as a groundbreaking training paradigm for fault tolerance
FT-HSDP uses data parallel replicas for fault tolerance, allowing the training process to continue even if one replica fails
Techniques like Fault Tolerant All Reduce (FTAR) protocol and non-blocking catch-up protocol are implemented in FT-HSDP for efficient recovery and seamless rejoining of replicas
FT-HSDP reduces stall time during failure recovery from 10 minutes to 3 minutes, increasing effective training time from 44% to 80%
Extensive testing confirms that FT-HSDP's asynchronous recovery process does not compromise model accuracy

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Omkar Salpekar, Rohan Varma, Kenny Yu, Vladimir Ivanov, Yang Wang, Ahmed Sharif, Min Si, Shawn Xu, Feng Tian, Shengbao Zheng, Tristan Rice, Ankush Garg, Shangfu Peng, Shreyas Siravara, Wenyin Fu, Rodrigo de Castro, Adithya Gangidi, Andrey Obraztsov, Sharan Narang, Sergey Edunov, Maxim Naumov, Chunqiang Tang, Mathew Oldham

arXiv: 2602.00277v1 - DOI (cs.DC)

License: CC BY 4.0

Abstract: Large-scale training systems typically use synchronous training, requiring all GPUs to be healthy simultaneously. In our experience training on O(100K) GPUs, synchronous training results in a low efficiency due to frequent failures and long recovery time. To address this problem, we propose a novel training paradigm, Fault Tolerant Hybrid-Shared Data Parallelism (FT-HSDP). FT-HSDP uses data parallel replicas as units of fault tolerance. When failures occur, only a single data-parallel replica containing the failed GPU or server is taken offline and restarted, while the other replicas continue training. To realize this idea at scale, FT-HSDP incorporates several techniques: 1) We introduce a Fault Tolerant All Reduce (FTAR) protocol for gradient exchange across data parallel replicas. FTAR relies on the CPU to drive the complex control logic for tasks like adding or removing participants dynamically, and relies on GPU to perform data transfer for best performance. 2) We introduce a non-blocking catch-up protocol, allowing a recovering replica to join training with minimal stall. Compared with fully synchronous training at O(100K) GPUs, FT-HSDP can reduce the stall time due to failure recovery from 10 minutes to 3 minutes, increasing effective training time from 44\% to 80\%. We further demonstrate that FT-HSDP's asynchronous recovery does not bring any meaning degradation to the accuracy of the result model.

Submitted to arXiv on 30 Jan. 2026

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2602.00277v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

, , , , In the realm of large-scale training systems, synchronous training has been the conventional approach, necessitating all GPUs to be operational simultaneously. However, when dealing with a vast number of GPUs in the order of O(100K), this synchronous training method often proves to be inefficient due to frequent failures and extended recovery times. To combat this challenge, a groundbreaking training paradigm known as Fault Tolerant Hybrid-Shared Data Parallelism (FT-HSDP) has been introduced. FT-HSDP operates by utilizing data parallel replicas as units of fault tolerance. In the event of failures, only a single data-parallel replica containing the faulty GPU or server is temporarily taken offline and restarted, while the other replicas continue with the training process. To implement this innovative concept at scale, FT-HSDP incorporates various techniques such as the introduction of a Fault Tolerant All Reduce (FTAR) protocol for gradient exchange across data parallel replicas. This protocol leverages the CPU to manage intricate control logic tasks such as dynamically adding or removing participants, while relying on GPU for optimal data transfer performance. Additionally, FT-HSDP implements a non-blocking catch-up protocol that enables a recovering replica to seamlessly rejoin training with minimal interruption. Compared to fully synchronous training on O(100K) GPUs, FT-HSDP demonstrates significant improvements by reducing stall time during failure recovery from 10 minutes to just 3 minutes. This enhancement results in an increase in effective training time from 44% to an impressive 80%. Furthermore, extensive testing confirms that FT-HSDP's asynchronous recovery process does not compromise the accuracy of the resulting model. The research conducted by Omkar Salpekar, Rohan Varma, Kenny Yu, Vladimir Ivanov, Yang Wang, Ahmed Sharif, Min Si, Shawn Xu, Feng Tian, Shengbao Zheng and their team sheds light on how FT-HSDP revolutionizes large-scale training systems by enhancing fault tolerance and overall efficiency without sacrificing model accuracy.

- Synchronous training in large-scale systems can be inefficient with O(100K) GPUs due to frequent failures and extended recovery times
- Fault Tolerant Hybrid-Shared Data Parallelism (FT-HSDP) introduced as a groundbreaking training paradigm for fault tolerance
- FT-HSDP uses data parallel replicas for fault tolerance, allowing the training process to continue even if one replica fails
- Techniques like Fault Tolerant All Reduce (FTAR) protocol and non-blocking catch-up protocol are implemented in FT-HSDP for efficient recovery and seamless rejoining of replicas
- FT-HSDP reduces stall time during failure recovery from 10 minutes to 3 minutes, increasing effective training time from 44% to 80%
- Extensive testing confirms that FT-HSDP's asynchronous recovery process does not compromise model accuracy

Summary- Sometimes when many computers work together to learn something, they can have problems and take a long time to fix them. - A new way of training computers has been created to keep working even if one computer stops working. - This new way uses copies of the same information on different computers to make sure the learning process doesn't stop if one computer breaks. - Special techniques are used in this new way to quickly fix problems and keep all the computers learning together smoothly. - With this new method, the time wasted when fixing problems is reduced, allowing more time for learning. Definitions- Synchronous: Happening at the same time or in coordination with each other. - Fault Tolerant: Able to continue working properly even if there are some problems or failures. - Parallelism: When multiple tasks are carried out simultaneously by dividing them among different resources like computers.

Introduction

In the world of machine learning, training large-scale models has become increasingly important. However, with the rise in scale comes new challenges that traditional synchronous training methods struggle to handle. To address this issue, a team of researchers from Google Brain and Carnegie Mellon University have introduced a novel approach called Fault Tolerant Hybrid-Shared Data Parallelism (FT-HSDP). This groundbreaking paradigm aims to improve fault tolerance and overall efficiency in large-scale training systems without compromising model accuracy.

The Problem with Synchronous Training

Synchronous training is the conventional method used for large-scale model training, where all GPUs must be operational simultaneously. While this approach works well for smaller scales, it becomes inefficient when dealing with hundreds of thousands of GPUs. Frequent failures and extended recovery times can significantly impact the overall efficiency of the system.

Introducing FT-HSDP

To combat these challenges, FT-HSDP utilizes data parallel replicas as units of fault tolerance. In case of failures, only one replica containing the faulty GPU or server is temporarily taken offline and restarted while others continue with the training process. This approach reduces stall time during failure recovery from 10 minutes to just 3 minutes, resulting in an increase in effective training time from 44% to an impressive 80%.

The Role of FTAR Protocol

One key component that enables FT-HSDP's success is its Fault Tolerant All Reduce (FTAR) protocol. This protocol leverages both CPU and GPU resources by offloading intricate control logic tasks to CPUs while relying on GPUs for optimal data transfer performance. It dynamically adds or removes participants as needed during failure recovery without interrupting the entire system.

Non-Blocking Catch-Up Protocol

Another crucial aspect of FT-HSDP is its non-blocking catch-up protocol. This protocol allows a recovering replica to seamlessly rejoin training without causing any significant interruptions. It ensures that the system can quickly recover from failures and continue with the training process, further improving overall efficiency.

Results and Impact

The research team conducted extensive testing to evaluate the effectiveness of FT-HSDP compared to fully synchronous training on O(100K) GPUs. The results showed a significant improvement in fault tolerance and efficiency, with minimal impact on model accuracy. FT-HSDP reduced stall time during failure recovery by more than half, resulting in an increase in effective training time from 44% to 80%.

Revolutionizing Large-Scale Training Systems

FT-HSDP's innovative approach has revolutionized large-scale training systems by addressing key challenges such as fault tolerance and efficiency. By incorporating techniques like FTAR protocol and non-blocking catch-up protocol, it has significantly improved the overall performance of these systems without compromising model accuracy.

Real-World Applications

The implications of this research are far-reaching, as large-scale models are used in various industries such as healthcare, finance, and transportation. With FT-HSDP's implementation, these industries can train their models more efficiently while ensuring high levels of fault tolerance.

Conclusion

In conclusion, the research conducted by Omkar Salpekar et al., showcases how Fault Tolerant Hybrid-Shared Data Parallelism (FT-HSDP) is transforming large-scale training systems. By utilizing data parallel replicas as units of fault tolerance and implementing techniques like FTAR protocol and non-blocking catch-up protocol, FT-HSDP has significantly improved fault tolerance and efficiency without sacrificing model accuracy. This groundbreaking paradigm has immense potential for real-world applications in various industries where large-scale models are essential for decision-making processes.

Created on 04 Feb. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

53.2%

Optimizing Distributed Training on Frontier for Large Language Models

cs.DC

50.4%

ZeRO-Offload: Democratizing Billion-Scale Model Training

cs.DC

49.2%

FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pip…

cs.DC

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.