Tag Archives: Azure

Will Regulators Mandate Multi-Region over Multi-AZ in Cloud ?

Multi AZ datacenters

There was a time when one could simply point to the a,b,c of the three data centres in a Multi-AZ (Availability Zone) configuration as a satisfactory replacement of traditional Data Center–Disaster Recovery (DC-DR) requirements even for the regulated sectors. However, as cloud providers face more frequent outages, it’s becoming harder to consider this setup a reliable solution.

Just recently, Google Cloud (GCP) experienced a significant outage lasting about 12 hours, which took down the entire europe-west3 region in Frankfurt, Germany. The root cause was traced to a power and cooling failure, which forced a portion of a zone offline, affecting a wide range of services from Compute to Storage.

Traditionally, disaster recovery setups have involved multiple data centers designated as Primary and Secondary (or DR centers) with data replication. Often, these DC and DR are in separate cities. In the companies that I worked, these were atleast 300 kms apart. While effective, this approach is costly and challenging to maintain, with lot of administrative overhead.

This raises questions about how fintechs and other regulated sectors meet Business Continuity and Disaster Recovery (BCPDR) requirements in the cloud.

In cloud environments, redundancy can be implemented in several ways, with Multi-AZ setups being one of the simplest. Multi-AZ configurations typically involve three data centers within a city, usually spread 30-100 kilometers apart and identified as zones “a,” “b,” and “c.” Data is replicated in real-time across these zones, allowing this setup to be marketed as an alternative to traditional DC/DR arrangements. However, Multi-AZ is not a default feature; it requires opting in and incurs extra costs.

Another approach is Multi-Region, where data is replicated across geographically distinct zones, often in different seismic areas. This setup helps mitigate the risk of a single region being impacted by events such as severe flooding, prolonged power outages, political unrest, and similar disruptions.

Interestingly, some financial institutions, especially Banks are not entirely sold on Multi-AZ setups, pushing their technology partners to adopt multi-region architectures across separate seismic zones. While no regulatory body in India has mandated multi-region setups yet, it’s worth considering the distance between zones within each cloud provider:

  • Amazon AWS states that its Availability Zones are up to 60 miles (~100 kilometers) apart. Interestingly, AWS keeps the exact locations confidential, even from most employees. Source: https://docs.aws.amazon.com/whitepapers/latest/aws-fault-isolation-boundaries/availability-zones.html#:~:text=Availability%20Zones%20in%20a%20Region,with%20single%2Ddigit%20millisecond%20latency
  • Microsoft Azure mentions a minimum of ~400 kilometers between its AZs, which is the greatest distance among major cloud providers. Source:https://learn.microsoft.com/en-us/azure/reliability/cross-region-replication-azure#what-are-paired-regions)).
  • Google Cloud (GCP) does not publicly disclose the distances between its AZs, keeping this detail confidential. surprising!

Given these developments, regulators may eventually require multi-region setups for regulated entities and fintechs using cloud services instead of relying solely on Multi-AZ. So far, Multi-AZ has offered a relatively straightforward solution to meet compliance and audit requirements for BCPDR. However, it might be time to reconsider RTO (Recovery Time Objective) and RPO (Recovery Point Objective) expectations in cloud environments.

Any thoughts on these developments?

  • Link to the RCA: GCP Outage Incident Report https://status.cloud.google.com/incidents/e3yQSE1ysCGjCVEn2q1h)
  • Article on 12 hour outage: https://www.theregister.com/2024/10/25/google_cloud_frankfurt_outage/