Uber's Failover Architecture: Reconciling Reliability and Efficiency in Hyperscale Microservice Infrastructure

AI-generated keywords: Global platform Uber Failover Architecture Service-Level Agreements (SLAs) Automated safeguards

AI-generated Key Points

Operating a global, real-time platform at Uber's scale requires resilient and cost-efficient infrastructure
Uber's Failover Architecture (UFA) replaces the costly 2x capacity model with a differentiated architecture aligned to business criticality
Critical services retain failover guarantees, while non-critical services opportunistically use failover buffer capacity reserved for critical services during steady state
UFA reduces steady-state provisioning from 2x to 1.3x, raising utilization from around 20% to approximately 30% while maintaining an availability rate of 99.97%
UFA has hardened over 4,000 unsafe dependencies and eliminated over one million CPU cores
Future extensions of UFA will expand beyond stateless services to offer differentiated SLAs for stateful services
Open directions include combining static analysis with generative AI, developing tools for certifying fail-open behavior at scale, and collaborating with cloud providers towards guaranteed elastic capacity at hyperscale

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Mayank Bansal, Milind Chabbi, Kenneth Bogh, Srikanth Prodduturi, Kevin Xu, Amit Kumar, David Bell, Ranjib Dey, Yufei Ren, Sachin Sharma, Juan Marcano, Shriniket Kale, Subhav Pradhan, Ivan Beschastnikh, Miguel Covarrubias, Chien-Chih Liao, Sandeep Koushik Sheshadri, Wen Luo, Kai Song, Ashish Samant, Sahil Rihan, Nimish Sheth, Uday Kiran Medisetty

arXiv: 2603.07345v1 - DOI (cs.DC)

License: CC BY 4.0

Abstract: Operating a global, real-time platform at Uber's scale requires infrastructure that is both resilient and cost-efficient. Historically, reliability was ensured through a costly 2x capacity model--each service provisioned to handle global traffic independently across two regions--leaving half the fleet idle. We present Uber's Failover Architecture (UFA), which replaces the uniform 2x model with a differentiated architecture aligned to business criticality. Critical services retain failover guarantees, while non-critical services opportunistically use failover buffer capacity reserved for critical services during steady state. During rare "full-peak" failovers, non-critical services are selectively preempted and rapidly restored, with differentiated Service-Level Agreements (SLAs) using on-demand capacity. Automated safeguards, including dependency analysis and regression gates, ensure critical services continue to function even while non-critical services are unavailable. The quantitative impact is significant: UFA reduces steady-state provisioning from 2x to 1.3x, raising utilization from ~20% to ~30% while sustaining 99.97% availability. To date, UFA has hardened over 4,000 unsafe dependencies, eliminated over one million CPU cores from a baseline of about four million cores.

Submitted to arXiv on 07 Mar. 2026

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2603.07345v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

Operating a global, real-time platform at Uber's scale requires infrastructure that is both resilient and cost-efficient. Historically, reliability was ensured through a costly 2x capacity model--each service provisioned to handle global traffic independently across two regions--leaving half the fleet idle. We present Uber's Failover Architecture (UFA), which replaces the uniform 2x model with a differentiated architecture aligned to business criticality. Critical services retain failover guarantees, while non-critical services opportunistically use failover buffer capacity reserved for critical services during steady state. During rare "full-peak" failovers, non-critical services are selectively preempted and rapidly restored, with differentiated Service-Level Agreements (SLAs) using on-demand capacity. Automated safeguards, including dependency analysis and regression gates, ensure critical services continue to function even while non-critical services are unavailable. The quantitative impact of UFA is significant: it reduces steady-state provisioning from 2x to 1.3x, raising utilization from around 20% to approximately 30% while sustaining an impressive availability rate of 99.97%. To date, UFA has hardened over 4,000 unsafe dependencies and eliminated over one million CPU cores from a baseline of about four million cores. Looking towards the future, UFA extensions will expand beyond stateless services to offer differentiated SLAs for stateful services. Several open directions remain in this space: combining static analysis with generative AI to automatically fix fail-close issues, developing general-purpose tools for certifying fail-open behavior at scale, and collaborating with cloud providers towards guaranteed elastic capacity at hyperscale. We would like to express our gratitude to the numerous individuals from various teams at Uber who have made invaluable contributions to this project. Special thanks go out to Abhishek Jha, Aditya Jain, Albert Greenberg, Arturo Bravo Rovirosa, Arun Krishnan, Christoffer Hansen, Darshil Kapadia, Deepanker Sachdeva, Egor Grishechko, Eric Chin and many others for their dedication and support throughout this endeavor. Additionally we extend our appreciation to David A. Maltz for his feedback and guidance as our paper shepherd. Source: from High Availability Architecture --- to --- in this space.

- Operating a global, real-time platform at Uber's scale requires resilient and cost-efficient infrastructure
- Uber's Failover Architecture (UFA) replaces the costly 2x capacity model with a differentiated architecture aligned to business criticality
- Critical services retain failover guarantees, while non-critical services opportunistically use failover buffer capacity reserved for critical services during steady state
- UFA reduces steady-state provisioning from 2x to 1.3x, raising utilization from around 20% to approximately 30% while maintaining an availability rate of 99.97%
- UFA has hardened over 4,000 unsafe dependencies and eliminated over one million CPU cores
- Future extensions of UFA will expand beyond stateless services to offer differentiated SLAs for stateful services
- Open directions include combining static analysis with generative AI, developing tools for certifying fail-open behavior at scale, and collaborating with cloud providers towards guaranteed elastic capacity at hyperscale

Summary1. Uber uses a big, strong system to run its services all over the world. 2. They made a new way called Uber's Failover Architecture to save money and make things work better. 3. Important services always have backup plans in case something goes wrong, while less important ones can use extra space from backups when everything is normal. 4. The new way helps them use resources better and be available almost all the time. 5. They fixed many problems and saved lots of computer power with this new way, and they plan to make it even better in the future. Definitions- Resilient: Strong and able to keep working even if there are problems - Infrastructure: The basic systems needed for something to work - Failover: A backup plan that takes over if the main plan fails - Provisioning: Getting ready or preparing resources for use - Availability rate: How often something is ready and working - Stateless services: Programs that don't need to remember past information - SLAs (Service Level Agreements): Promises about how well a service will work - Stateful services: Programs that need to remember past information

Operating a global, real-time platform at Uber's scale is no easy feat. With millions of users relying on the app for transportation services every day, it is crucial that the infrastructure supporting this platform is both resilient and cost-efficient. In order to achieve this, Uber has implemented a new failover architecture called UFA (Uber's Failover Architecture). This architecture replaces the traditional 2x capacity model with a differentiated approach that aligns with business criticality. The Need for Resilience and Cost-Efficiency In the past, ensuring reliability in such a large-scale operation required a costly 2x capacity model. This meant that each service had to be provisioned to handle global traffic independently across two regions, resulting in half of the fleet being idle at any given time. While this approach did guarantee high availability, it was not sustainable from a cost perspective. Introducing UFA: A Differentiated Approach To address these challenges, Uber developed UFA as an alternative to the uniform 2x model. This new architecture takes into account the varying levels of criticality among different services and allocates resources accordingly. Critical Services vs Non-Critical Services Under UFA, critical services are those that are essential for maintaining the core functionality of Uber's platform. These include services related to ride requests, trip management, payment processing and more. On the other hand, non-critical services refer to those that are not directly involved in providing transportation services but still play an important role in supporting them. Failover Guarantees for Critical Services One key aspect of UFA is its focus on ensuring failover guarantees for critical services. This means that even during rare "full-peak" failovers when one region goes down completely due to unforeseen circumstances or maintenance activities, these critical services will continue functioning without interruption. Opportunistic Use of Failover Buffer Capacity In contrast to critical services which have dedicated failover guarantees, non-critical services opportunistically use failover buffer capacity reserved for critical services during steady state. This means that during normal operations, these non-critical services can utilize the extra resources available from critical services to improve their performance. Selective Preemption and Rapid Restoration In the event of a full-peak failover, non-critical services may need to be preempted in order to allocate resources to critical services. However, this is done selectively and rapidly so as not to disrupt the overall functionality of Uber's platform. Once the failover is complete, these non-critical services are restored with differentiated Service-Level Agreements (SLAs) using on-demand capacity. Automated Safeguards for Continuous Functionality To ensure that critical services continue functioning even while non-critical ones are unavailable, UFA has implemented automated safeguards such as dependency analysis and regression gates. These safeguards help identify any potential issues or dependencies between different services and take appropriate actions to mitigate them. Quantitative Impact of UFA The impact of UFA has been significant for Uber's operations. By reducing steady-state provisioning from 2x to 1.3x, utilization has increased from around 20% to approximately 30%. This means that more resources are being utilized efficiently without compromising on availability rates which remain at an impressive 99.97%. Hardening Unsafe Dependencies and Eliminating CPU Cores Since its implementation, UFA has successfully hardened over 4,000 unsafe dependencies and eliminated over one million CPU cores from a baseline of about four million cores. This not only improves efficiency but also reduces costs for Uber in terms of infrastructure maintenance. Future Extensions: Differentiated SLAs for Stateful Services While UFA currently focuses on stateless services (services that do not store data), there are plans to expand it further by offering differentiated SLAs for stateful services (services that do store data). This will provide even more flexibility and efficiency in managing resources for different types of services. Open Directions for Further Improvement Despite the success of UFA, there are still open directions for further improvement. These include combining static analysis with generative AI to automatically fix fail-close issues, developing general-purpose tools for certifying fail-open behavior at scale, and collaborating with cloud providers towards guaranteed elastic capacity at hyperscale. Acknowledgements The development and implementation of UFA would not have been possible without the contributions of numerous individuals from various teams at Uber. Special thanks go out to Abhishek Jha, Aditya Jain, Albert Greenberg, Arturo Bravo Rovirosa, Arun Krishnan, Christoffer Hansen, Darshil Kapadia, Deepanker Sachdeva, Egor Grishechko, Eric Chin and many others who have dedicated their time and effort to this project. Additionally we extend our appreciation to David A. Maltz for his feedback and guidance as our paper shepherd. Conclusion In conclusion, Uber's Failover Architecture (UFA) has revolutionized the way in which reliability is ensured in a global real-time platform. By implementing a differentiated approach that takes into account business criticality and utilizing automated safeguards and selective preemption techniques during rare full-peak failovers, UFA has significantly improved resource utilization while maintaining high availability rates. With plans for future extensions and continuous efforts towards improvement in this space, UFA is set to further enhance Uber's operations on a global scale.

Created on 10 Mar. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

48.2%

Kubernetes in Action: Exploring the Performance of Kubernetes Distributions i…

cs.DC

46.7%

Training LLMs with Fault Tolerant HSDP on 100,000 GPUs

cs.DC

45.6%

Let's Trace It: Fine-Grained Serverless Benchmarking using Synchronous and As…

cs.DC

44.3%

Cloud Cost Optimization: A Comprehensive Review of Strategies and Case Studies

cs.DC

43.1%

Shoal: Improving DAG-BFT Latency And Robustness

cs.DC

42.3%

Towards Efficient and Reliable LLM Serving: A Real-World Workload Study

cs.DC

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.