Tag Archives: Disaster Recovery

DC 5 levels below ground

Five Floors Underground: The Wake-Up Call for Cloud Resilience

 

A few years ago, I was auditing one of the largest financial institutions in the Middle East for their cyber resiliency. During the walkthrough, I discovered their primary data center was five floors underground. Five!

Naturally curious, I asked their CISO if there was any specific reason for going this deep? Although this was a legacy he had inherited, his answer was simple:

“Regional constraints.”

I didn’t fully get it then. I do now.

Last week, Iranian drone strikes took out AWS data centers in UAE and Bahrain. Banking apps went dark. Payment systems froze. Enterprise software across the region just… stopped.

The cloud, it turns out, has a very physical address. And that address can be hit.

That CISO and his predecessors weren’t being paranoid. They were being realistic. They understood something that a lot of us in infosec and cloud governance are only now waking up to, that in certain parts of the world, your DR strategy isn’t just about ransomware and config drift. It’s about missiles and drones also.

This changes the conversation around cloud concentration risk entirely. When we evaluate third-party cloud providers, how many of us are factoring in geopolitical threat vectors against the physical infrastructure? How many risk registers account for kinetic attacks on a hyperscaler’s availability zone? Also, the lack of transparency by the cloud computing companies on their Multi-Availability zones distance isn’t helping the cause.

The Gulf has cheap energy, massive funding, and ambitious AI plans. But the same geography that makes it attractive also makes it a target. The $2 trillion in tech investment commitments from last year look a lot different today.

A while ago this news about Iran targeting the financial institutions in Israel and surrounding regions, the DC being in a secure physical location makes even more sense:  https://www.reuters.com/world/middle-east/iran-will-target-us-israeli-economic-banking-interests-region-state-media-2026-03-11/

For those of us in GRC and infosec this is a wake-up call. Cyber Resilience isn’t just a checkbox on a compliance framework. Sometimes it means putting your data center five floors underground and not explaining why to auditors who ask too many questions.

That CISO’s predecessors knew the assignment.

#cyberresilience #DC #DR #BCP #Geopolitics #physicalsecurity

Will Regulators Mandate Multi-Region over Multi-AZ in Cloud ?

Multi AZ datacenters

There was a time when one could simply point to the a,b,c of the three data centres in a Multi-AZ (Availability Zone) configuration as a satisfactory replacement of traditional Data Center–Disaster Recovery (DC-DR) requirements even for the regulated sectors. However, as cloud providers face more frequent outages, it’s becoming harder to consider this setup a reliable solution.

Just recently, Google Cloud (GCP) experienced a significant outage lasting about 12 hours, which took down the entire europe-west3 region in Frankfurt, Germany. The root cause was traced to a power and cooling failure, which forced a portion of a zone offline, affecting a wide range of services from Compute to Storage.

Traditionally, disaster recovery setups have involved multiple data centers designated as Primary and Secondary (or DR centers) with data replication. Often, these DC and DR are in separate cities. In the companies that I worked, these were atleast 300 kms apart. While effective, this approach is costly and challenging to maintain, with lot of administrative overhead.

This raises questions about how fintechs and other regulated sectors meet Business Continuity and Disaster Recovery (BCPDR) requirements in the cloud.

In cloud environments, redundancy can be implemented in several ways, with Multi-AZ setups being one of the simplest. Multi-AZ configurations typically involve three data centers within a city, usually spread 30-100 kilometers apart and identified as zones “a,” “b,” and “c.” Data is replicated in real-time across these zones, allowing this setup to be marketed as an alternative to traditional DC/DR arrangements. However, Multi-AZ is not a default feature; it requires opting in and incurs extra costs.

Another approach is Multi-Region, where data is replicated across geographically distinct zones, often in different seismic areas. This setup helps mitigate the risk of a single region being impacted by events such as severe flooding, prolonged power outages, political unrest, and similar disruptions.

Interestingly, some financial institutions, especially Banks are not entirely sold on Multi-AZ setups, pushing their technology partners to adopt multi-region architectures across separate seismic zones. While no regulatory body in India has mandated multi-region setups yet, it’s worth considering the distance between zones within each cloud provider:

  • Amazon AWS states that its Availability Zones are up to 60 miles (~100 kilometers) apart. Interestingly, AWS keeps the exact locations confidential, even from most employees. Source: https://docs.aws.amazon.com/whitepapers/latest/aws-fault-isolation-boundaries/availability-zones.html#:~:text=Availability%20Zones%20in%20a%20Region,with%20single%2Ddigit%20millisecond%20latency
  • Microsoft Azure mentions a minimum of ~400 kilometers between its AZs, which is the greatest distance among major cloud providers. Source:https://learn.microsoft.com/en-us/azure/reliability/cross-region-replication-azure#what-are-paired-regions)).
  • Google Cloud (GCP) does not publicly disclose the distances between its AZs, keeping this detail confidential. surprising!

Given these developments, regulators may eventually require multi-region setups for regulated entities and fintechs using cloud services instead of relying solely on Multi-AZ. So far, Multi-AZ has offered a relatively straightforward solution to meet compliance and audit requirements for BCPDR. However, it might be time to reconsider RTO (Recovery Time Objective) and RPO (Recovery Point Objective) expectations in cloud environments.

Any thoughts on these developments?

  • Link to the RCA: GCP Outage Incident Report https://status.cloud.google.com/incidents/e3yQSE1ysCGjCVEn2q1h)
  • Article on 12 hour outage: https://www.theregister.com/2024/10/25/google_cloud_frankfurt_outage/