Moshpit SGD: Communication-Efficient Decentralized Training on Heterogeneous Unreliable Devices

AI-generated keywords: Moshpit SGD

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Authors address the challenge of training deep neural networks on large datasets using multiple compute nodes
  • Distributed training with specialized message-passing protocols like Ring All-Reduce accelerates the process by leveraging hundreds of computers
  • Real-world applications often operate on unreliable devices with unstable network bandwidth, limiting scalability of existing protocols
  • Proposal of Moshpit All-Reduce, an iterative averaging protocol designed to exponentially converge to the global average, showcasing speedups in experiments
  • Offers strong theoretical guarantees for distributed optimization and enables decentralized training on heterogeneous unreliable devices without traditional methods
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Max Ryabinin, Eduard Gorbunov, Vsevolod Plokhotnyuk, Gennady Pekhimenko

Accepted to Conference on Neural Information Processing Systems (NeurIPS) 2021. Code: https://github.com/yandex-research/moshpit-sgd

Abstract: Training deep neural networks on large datasets can often be accelerated by using multiple compute nodes. This approach, known as distributed training, can utilize hundreds of computers via specialized message-passing protocols such as Ring All-Reduce. However, running these protocols at scale requires reliable high-speed networking that is only available in dedicated clusters. In contrast, many real-world applications, such as federated learning and cloud-based distributed training, operate on unreliable devices with unstable network bandwidth. As a result, these applications are restricted to using parameter servers or gossip-based averaging protocols. In this work, we lift that restriction by proposing Moshpit All-Reduce - an iterative averaging protocol that exponentially converges to the global average. We demonstrate the efficiency of our protocol for distributed optimization with strong theoretical guarantees. The experiments show 1.3x speedup for ResNet-50 training on ImageNet compared to competitive gossip-based strategies and 1.5x speedup when training ALBERT-large from scratch using preemptible compute nodes.

Submitted to arXiv on 04 Mar. 2021

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2103.03239v4

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

, , , , In their paper titled "Moshpit SGD: Communication-Efficient Decentralized Training on Heterogeneous Unreliable Devices," authors Max Ryabinin, Eduard Gorbunov, Vsevolod Plokhotnyuk, and Gennady Pekhimenko address the challenge of training deep neural networks on large datasets using multiple compute nodes. Distributed training is a promising approach that utilizes specialized message-passing protocols like Ring All-Reduce to significantly accelerate this process by leveraging hundreds of computers. However, the scalability of these protocols is limited by the requirement for reliable high-speed networking only available in dedicated clusters. Real-world applications such as federated learning and cloud-based distributed training often operate on unreliable devices with unstable network bandwidth. As a result, these applications are typically constrained to using parameter servers or gossip-based averaging protocols due to the lack of robust communication infrastructure. To overcome this limitation, the authors propose Moshpit All-Reduce - an iterative averaging protocol designed to exponentially converge to the global average. The efficiency of Moshpit All-Reduce is demonstrated through experiments showcasing a 1.3x speedup for ResNet-50 training on ImageNet compared to competitive gossip-based strategies. Additionally, when training ALBERT-large from scratch using preemptible compute nodes, a 1.5x speedup is achieved. The proposed protocol offers strong theoretical guarantees for distributed optimization and enables decentralized training on heterogeneous unreliable devices without relying on traditional parameter servers or gossip-based methods. Accepted at the Conference on Neural Information Processing Systems (NeurIPS) 2021, this work provides a novel solution for improving communication efficiency in distributed deep learning scenarios where reliable high-speed networking is not readily available. The code for Moshpit SGD is openly accessible on GitHub for further exploration and implementation in practical settings.
Created on 15 Jul. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.