Operating a global, real-time platform at Uber's scale requires infrastructure that is both resilient and cost-efficient. Historically, reliability was ensured through a costly 2x capacity model--each service provisioned to handle global traffic independently across two regions--leaving half the fleet idle. We present Uber's Failover Architecture (UFA), which replaces the uniform 2x model with a differentiated architecture aligned to business criticality. Critical services retain failover guarantees, while non-critical services opportunistically use failover buffer capacity reserved for critical services during steady state. During rare "full-peak" failovers, non-critical services are selectively preempted and rapidly restored, with differentiated Service-Level Agreements (SLAs) using on-demand capacity. Automated safeguards, including dependency analysis and regression gates, ensure critical services continue to function even while non-critical services are unavailable. The quantitative impact of UFA is significant: it reduces steady-state provisioning from 2x to 1.3x, raising utilization from around 20% to approximately 30% while sustaining an impressive availability rate of 99.97%. To date, UFA has hardened over 4,000 unsafe dependencies and eliminated over one million CPU cores from a baseline of about four million cores. Looking towards the future, UFA extensions will expand beyond stateless services to offer differentiated SLAs for stateful services. Several open directions remain in this space: combining static analysis with generative AI to automatically fix fail-close issues, developing general-purpose tools for certifying fail-open behavior at scale, and collaborating with cloud providers towards guaranteed elastic capacity at hyperscale. We would like to express our gratitude to the numerous individuals from various teams at Uber who have made invaluable contributions to this project. Special thanks go out to Abhishek Jha, Aditya Jain, Albert Greenberg, Arturo Bravo Rovirosa, Arun Krishnan, Christoffer Hansen, Darshil Kapadia, Deepanker Sachdeva, Egor Grishechko, Eric Chin and many others for their dedication and support throughout this endeavor. Additionally we extend our appreciation to David A. Maltz for his feedback and guidance as our paper shepherd. Source: from High Availability Architecture --- to --- in this space.
- - Operating a global, real-time platform at Uber's scale requires resilient and cost-efficient infrastructure
- - Uber's Failover Architecture (UFA) replaces the costly 2x capacity model with a differentiated architecture aligned to business criticality
- - Critical services retain failover guarantees, while non-critical services opportunistically use failover buffer capacity reserved for critical services during steady state
- - UFA reduces steady-state provisioning from 2x to 1.3x, raising utilization from around 20% to approximately 30% while maintaining an availability rate of 99.97%
- - UFA has hardened over 4,000 unsafe dependencies and eliminated over one million CPU cores
- - Future extensions of UFA will expand beyond stateless services to offer differentiated SLAs for stateful services
- - Open directions include combining static analysis with generative AI, developing tools for certifying fail-open behavior at scale, and collaborating with cloud providers towards guaranteed elastic capacity at hyperscale
Summary1. Uber uses a big, strong system to run its services all over the world.
2. They made a new way called Uber's Failover Architecture to save money and make things work better.
3. Important services always have backup plans in case something goes wrong, while less important ones can use extra space from backups when everything is normal.
4. The new way helps them use resources better and be available almost all the time.
5. They fixed many problems and saved lots of computer power with this new way, and they plan to make it even better in the future.
Definitions- Resilient: Strong and able to keep working even if there are problems
- Infrastructure: The basic systems needed for something to work
- Failover: A backup plan that takes over if the main plan fails
- Provisioning: Getting ready or preparing resources for use
- Availability rate: How often something is ready and working
- Stateless services: Programs that don't need to remember past information
- SLAs (Service Level Agreements): Promises about how well a service will work
- Stateful services: Programs that need to remember past information
Operating a global, real-time platform at Uber's scale is no easy feat. With millions of users relying on the app for transportation services every day, it is crucial that the infrastructure supporting this platform is both resilient and cost-efficient. In order to achieve this, Uber has implemented a new failover architecture called UFA (Uber's Failover Architecture). This architecture replaces the traditional 2x capacity model with a differentiated approach that aligns with business criticality.
The Need for Resilience and Cost-Efficiency
In the past, ensuring reliability in such a large-scale operation required a costly 2x capacity model. This meant that each service had to be provisioned to handle global traffic independently across two regions, resulting in half of the fleet being idle at any given time. While this approach did guarantee high availability, it was not sustainable from a cost perspective.
Introducing UFA: A Differentiated Approach
To address these challenges, Uber developed UFA as an alternative to the uniform 2x model. This new architecture takes into account the varying levels of criticality among different services and allocates resources accordingly.
Critical Services vs Non-Critical Services
Under UFA, critical services are those that are essential for maintaining the core functionality of Uber's platform. These include services related to ride requests, trip management, payment processing and more. On the other hand, non-critical services refer to those that are not directly involved in providing transportation services but still play an important role in supporting them.
Failover Guarantees for Critical Services
One key aspect of UFA is its focus on ensuring failover guarantees for critical services. This means that even during rare "full-peak" failovers when one region goes down completely due to unforeseen circumstances or maintenance activities, these critical services will continue functioning without interruption.
Opportunistic Use of Failover Buffer Capacity
In contrast to critical services which have dedicated failover guarantees, non-critical services opportunistically use failover buffer capacity reserved for critical services during steady state. This means that during normal operations, these non-critical services can utilize the extra resources available from critical services to improve their performance.
Selective Preemption and Rapid Restoration
In the event of a full-peak failover, non-critical services may need to be preempted in order to allocate resources to critical services. However, this is done selectively and rapidly so as not to disrupt the overall functionality of Uber's platform. Once the failover is complete, these non-critical services are restored with differentiated Service-Level Agreements (SLAs) using on-demand capacity.
Automated Safeguards for Continuous Functionality
To ensure that critical services continue functioning even while non-critical ones are unavailable, UFA has implemented automated safeguards such as dependency analysis and regression gates. These safeguards help identify any potential issues or dependencies between different services and take appropriate actions to mitigate them.
Quantitative Impact of UFA
The impact of UFA has been significant for Uber's operations. By reducing steady-state provisioning from 2x to 1.3x, utilization has increased from around 20% to approximately 30%. This means that more resources are being utilized efficiently without compromising on availability rates which remain at an impressive 99.97%.
Hardening Unsafe Dependencies and Eliminating CPU Cores
Since its implementation, UFA has successfully hardened over 4,000 unsafe dependencies and eliminated over one million CPU cores from a baseline of about four million cores. This not only improves efficiency but also reduces costs for Uber in terms of infrastructure maintenance.
Future Extensions: Differentiated SLAs for Stateful Services
While UFA currently focuses on stateless services (services that do not store data), there are plans to expand it further by offering differentiated SLAs for stateful services (services that do store data). This will provide even more flexibility and efficiency in managing resources for different types of services.
Open Directions for Further Improvement
Despite the success of UFA, there are still open directions for further improvement. These include combining static analysis with generative AI to automatically fix fail-close issues, developing general-purpose tools for certifying fail-open behavior at scale, and collaborating with cloud providers towards guaranteed elastic capacity at hyperscale.
Acknowledgements
The development and implementation of UFA would not have been possible without the contributions of numerous individuals from various teams at Uber. Special thanks go out to Abhishek Jha, Aditya Jain, Albert Greenberg, Arturo Bravo Rovirosa, Arun Krishnan, Christoffer Hansen, Darshil Kapadia, Deepanker Sachdeva, Egor Grishechko, Eric Chin and many others who have dedicated their time and effort to this project. Additionally we extend our appreciation to David A. Maltz for his feedback and guidance as our paper shepherd.
Conclusion
In conclusion, Uber's Failover Architecture (UFA) has revolutionized the way in which reliability is ensured in a global real-time platform. By implementing a differentiated approach that takes into account business criticality and utilizing automated safeguards and selective preemption techniques during rare full-peak failovers, UFA has significantly improved resource utilization while maintaining high availability rates. With plans for future extensions and continuous efforts towards improvement in this space, UFA is set to further enhance Uber's operations on a global scale.