How can we ensure the continuous availability of critical digital infrastructure?
Last week Optus in Australia suffered a massive network outage impacting more than 10 million customers for more than 8 hours, including disruptions to emergency services, trains and digital payments. In April 2022, Atlassian experienced a full product outage that impacted 775 customer organisations for up to 2 weeks. In December 2022, Southwest Airlines experienced a scheduling crisis due in large part to technical debt and a lack of investment.
These crises inevitably result in significant financial losses and multi-year investments to restore customer trust and brand reputation.
But resiliency is hard.
Resilience is the ability to provide and maintain an acceptable level of service in the face of faults and challenges to normal operation.
We can consider digital infrastructure as the underlying hardware, networks, and software that support the functioning of digital services. That could include a broad range of telecom & network infrastructure, cloud computing, private data centres and applications that underpin operations and essential business functions.
The causes of digital infrastructure service disruptions are typically:
Human error underpins the majority of technology failures in one form or another. Uptime Institute research indicates that human error plays a role in more than two-thirds of all data centre outages. This is clearly where the biggest improvements in resiliency can be achieved.
Digital infrastructure resilience is a complex topic, but there are essentially 5 key steps.
1. Define service performance targets
What is an acceptable level of service? Without a performance target, it’s hard to know if what is being designed will meet expectations.
As a basis for defining service performance, Site Reliability Engineering provides a good framework to differentiate between:
A common Service Level Agreement (SLA) for all in-scope digital services would include:
The SLA may include other elements such as hours of support coverage, scheduled maintenance, critical and non-critical operations periods and backup schedules.
2. Design the digital infrastructure
Now that the service performance targets are clear, the design can proceed to meet expectations with a focus on:
The language of service availability and performance differs depending on the type of infrastructure and often vendors have their own spin.
For data centre design The Uptime Institute's Tier Classification System (Tier I to IV) and the ANSI/TIA-942 (Rating 1 to 4) are the most widely recognised and accepted standards.
For public cloud computing, AWS published uptime SLAs typically range from a target of 99.9% to 99.99% depending on the service. Interestingly, the Route 53 DNS service has a published SLA of 100% as does Cloudflare. Similar uptime targets exist for Azure and Google Cloud .
With telecom & network design high levels of design resilience are achieved through:
For software infrastructure design resiliency is an extensive field and Site Reliability Engineering provides a good basis for design and operations practices.
To show the power of redundancy, assume that you have an internet connection from a supplier that commits to 99 % uptime monthly (potentially more than 7 hours of downtime per month). Adding a second internet connection from a completely independent provider with the same commitment translates to an uptime of 99.99% (4 minutes downtime per month!) for internet connectivity.
3. Resilient operations design
Given that most service disruptions are due to human error, operations design focuses primarily on the prevention, detection and resolution of incidents.
Service management processes provide a framework to guide teams on “how” things should be done such as the onboarding of new services, and the management of incidents, changes, requests and problems. Team procedures or checklists describe the support steps in detail.
Designing the support model provides clarity on who is responsible for what so that the right skills are in place to monitor services and provide support. This may include physical operations centre, on-call staff and remote support teams operating as part of an integrated support model.
A competent, well-trained, and tightly integrated support team is a critical success factor for resilient operations.
Communications design involves the channels and methods for communication between support teams, management and end users. For critical outages, a tested crisis communications process will help to minimise reputation damage. There is nothing worse for customer trust and brand reputation than no communication at all.
In addition to monitoring and alerting, operations tools are needed for team collaboration, communication, service management and knowledge management.
Training is needed on the operations processes, procedures, support model, communications and tools.
4. Test the resiliency of the design
Assuming that functional tests have been performed, the resilience testing will focus on the following for each service, although the language may differ by type of digital infrastructure and vendor:
Testing mission-critical systems requires test environments that are as close as possible to the production environment. This is difficult to achieve but essential if the goal is to prevent failures in live operations.
5. Test the resiliency of operations
The operations design can be tested in several ways:
Tabletop scenarios - where knowledge of operating processes and procedures is tested in a group setting. By reviewing normal and service disruption scenarios jointly the team can learn, challenge, question and finally come to a shared understanding of how to deal with technology issues.
Readiness rehearsals - to test that the people and processes are in place to handle operational situations such as incidents, changes and disaster recovery. A rehearsal simulates the real operations with team members taking their assigned operational roles and dealing with both normal and abnormal situations that may occur. Readiness rehearsals achieve the following aims:
Chaos Engineering , aka failure injection testing, simulates stress by systematically disrupting different elements of the infrastructure, including:
Chaos engineering can verify that the design is robust and can gracefully handle faults. It also aids in the early detection and resolution of issues before they impact users, ultimately enhancing overall Service Level Agreements (SLAs).
With clear performance targets and a well-designed and tested infrastructure and operations, high levels of reliability and resilience can be achieved, even when unexpected disruptions occur.
To maintain resilience, organisations should be paranoid that the next critical outage may be just around the corner. Consider again that most outages are caused by human error. In an increasingly software-defined world, this becomes visible through code and configuration changes. So operations teams need to become competent at deploying changes successfully at speed, and if needed, quickly detecting and resolving incidents.