United Airlines implement enterprise-wide resilience program with AWS | Amazon Web Services – Amazon Web Services (AWS)
This blog is co-authored with Jenny Zhou, Principal Enterprise Architect at United Airlines
In this blog, we will explore how United Airlines implemented an enterprise-wide resilience program using Amazon Web Services (AWS).
United Airlines, a major U.S. airline headquartered in Chicago, Illinois, announced its United Next plan in 2021. United Next is the airline’s plan to improve its network and enhance the customer experience. As the company transitions hundreds of applications to AWS and modernizes its critical digital systems, it must ensure 100% availability of these business-critical applications.
To meet the business requirement of lower recovery time objective (RTO), application teams began designing multi-Region application architectures. United Airlines teams identified the need for a flexible, repeatable, and robust platform that could scale quickly. As application teams modernized on AWS, United Airlines Platform Engineering, Database Administration (DBA), and application teams managed complex and time-consuming failover runbooks and procedures. They often relied on manual failover processes requiring human intervention. These processes were inefficient and error-prone, potentially causing downtime and disrupting critical business services.
To address these challenges, the United Airlines leadership tasked the Enterprise Architecture (EA) team with building a more robust, repeatable, and automated solution.
In April 2023, the EA team began rolling out Rapid Recovery solution. Rapid Recovery is a central platform developed to enable rapid cross-Region recovery capabilities for critical applications hosted on AWS. This platform automates common recovery steps such as 1. Switching application between Regions by using Amazon Application Recovery Controller (ARC), 2. Automating database failover tasks like promoting a secondary DB cluster to be the primary DB cluster, and 3. Providing templates to create an observability dashboard. Rapid Recovery aims to provide enhanced business continuity and disaster recovery (BCDR) protection compared to the high availability provided in a single AWS Region. To date, 70+ business critical services running on AWS are using this platform.
Figure 1: Architecture for Rapid Recovery Solution
A small team of five people built Rapid Recovery solution in less than six months.
The above architecture has the following key features:

Figure 2: Enterprise-wide dashboard with failover history
Implementing automation allowed United Airlines to expedite the recovery process and reduce recovery time objective (RTO) during service impairment.
Besides providing Rapid Recovery, the EA team also standardized application onboarding by providing detailed guidance on disaster recovery design, performing Well-Architected reviews and helping with starter kits (documentation and code) with example runbooks for standard application architecture patterns. These runbooks outlined the preparation steps, recovery procedures, and post-failover testing requirements.
Before an application moves to production, application teams are required to perform a full application failover drill into another Region. This mandatory step validated the failover runbook and built confidence in the team’s ability to execute a failover when needed. The EA team leads sessions with application teams providing guidance and training to ensure the success of this initiative.
Most application architectures at United Airlines rely on human intervention to trigger cross-Region failover processes. Switching between Regions typically involves deliberate human assessment and decision-making. This approach prioritizes human oversight and control over automated failover mechanisms based on observability signals. This human-in-the-loop approach ensures careful consideration of potential impacts before executing a Regional failover, maintaining a balance between system resilience and operational control.
United Airlines has a well-defined event management process to handle critical service disruptions. This process includes an incident management team, application owners, and senior leadership to assess the impact and define next steps.
Failover process
Figure 3: workflow interface provided to the application team
Resilience is a continuous process. Periodically evaluating and practicing your disaster recovery plan is essential to ensure its effectiveness and to build confidence in its implementation when needed.
To understand its enterprise-wide resilience posture, United Airlines decided to capture and monitor automated and manual process signals (including monitoring of failure mode) into an operational dashboard called Application Reliability Dashboard (ARD). ARD is a custom application with a dedicated software development team. It’s goal is to enhance customer satisfaction by ensuring that applications meet high standards of quality and dependability.
ARD serves as a comprehensive overview of an application’s health and reliability. It provides a unified interface where each application service is assigned a resiliency score, with a target pass criterion set at 80% or higher. This reliability score is calculated using United Airlines specific metrics that Gartner, a leading research and advisory company has reviewed and endorsed. The scoring model is based on a customized service reliability engineering framework, specifically tailored to meet United Airlines’ unique needs and requirements.
Figure 4: Reliability score metrics
ARD serves three primary functions:
By focusing on these areas, ARD enables application teams to deliver services that are reliable (consistently performing as expected), stable (resistant to unexpected failures or downtime), and high-performing (Operating efficiently and responsively).
Figure 5: ARD Dashboard view
Striving for shorter recovery time objectives (RTO) and recovery point objectives (RPO) typically leads to increased costs in both resource allocation and operational complexity. As such, it’s advisable to select RTO and RPO targets that strike an optimal balance between recovery capabilities and cost-effectiveness for your specific workload.
When United Airlines’ application teams initially explored multi-Region deployment, their primary worry was a potential doubling of application costs. To mitigate this concern, it’s essential to select the most appropriate disaster recovery (DR) strategy for each application, as this plays a pivotal role in managing overall application cost.
To further maintain cost-effectiveness, United Airlines implemented:
Figure 6: Application Recovery Controller cluster sharing using AWS RAM
United Airlines has improved its operational resilience by implementing a comprehensive, enterprise-wide program on AWS. These initiatives has enhanced the reliability of the airline’s critical applications. To date, the program has showed impressive results, with over 1,000 successful cross-Region application failovers and over 400 automated database failovers. The airline has also achieved a notable 7% reduction in MTTR in 2024 which led to a 5% increase in Net Promoter Score (NPS) in Q3 2024 compared to 2023. These accomplishments highlight United Airlines’ commitment to robust, uninterrupted service delivery and illustrate the effectiveness of their cloud-based resilience strategy.
AWS Well Architected Framework – Resilience Pillar
AWS Multi-Region Fundamentals whitepaper
Disaster Recovery (DR) Architecture on AWS, Architecture Blog series
AWS Cloud Resilience
AWS Multi-Region Capabilities

Hemal Jani is a Solutions Architect with Amazon Web Services (AWS) based out of Chicago, IL. His area of focus is Enterprise Migrations & Resilience. He has 20+ years of technology leadership experience and currently works with Travel & Hospitality customers.

Jenny Zhou is a Chicago-based Principal Enterprise Architect at United Airlines. She has 20+ years of experience in Airlines industry and 10+ years leading enterprise architecture initiatives. Specialized in application architecture, cloud migration & resilience, and enterprise governance.
source
This is a newsfeed from leading technology publications. No additional editorial review has been performed before posting.

