There was a time when one could simply point to the a,b,c of the three data centres in a Multi-AZ (Availability Zone) configuration as a satisfactory replacement of traditional Data Center–Disaster Recovery (DC-DR) requirements even for the regulated sectors. However, as cloud providers face more frequent outages, it’s becoming harder to consider this setup a reliable solution.
Just recently, Google Cloud (GCP) experienced a significant outage lasting about 12 hours, which took down the entire europe-west3 region in Frankfurt, Germany. The root cause was traced to a power and cooling failure, which forced a portion of a zone offline, affecting a wide range of services from Compute to Storage.
Traditionally, disaster recovery setups have involved multiple data centers designated as Primary and Secondary (or DR centers) with data replication. Often, these DC and DR are in separate cities. In the companies that I worked, these were atleast 300 kms apart. While effective, this approach is costly and challenging to maintain, with lot of administrative overhead.
This raises questions about how fintechs and other regulated sectors meet Business Continuity and Disaster Recovery (BCPDR) requirements in the cloud.
In cloud environments, redundancy can be implemented in several ways, with Multi-AZ setups being one of the simplest. Multi-AZ configurations typically involve three data centers within a city, usually spread 30-100 kilometers apart and identified as zones “a,” “b,” and “c.” Data is replicated in real-time across these zones, allowing this setup to be marketed as an alternative to traditional DC/DR arrangements. However, Multi-AZ is not a default feature; it requires opting in and incurs extra costs.
Another approach is Multi-Region, where data is replicated across geographically distinct zones, often in different seismic areas. This setup helps mitigate the risk of a single region being impacted by events such as severe flooding, prolonged power outages, political unrest, and similar disruptions.
Interestingly, some financial institutions, especially Banks are not entirely sold on Multi-AZ setups, pushing their technology partners to adopt multi-region architectures across separate seismic zones. While no regulatory body in India has mandated multi-region setups yet, it’s worth considering the distance between zones within each cloud provider:
Amazon AWS states that its Availability Zones are up to 60 miles (~100 kilometers) apart. Interestingly, AWS keeps the exact locations confidential, even from most employees. Source: https://docs.aws.amazon.com/whitepapers/latest/aws-fault-isolation-boundaries/availability-zones.html#:~:text=Availability%20Zones%20in%20a%20Region,with%20single%2Ddigit%20millisecond%20latency
Microsoft Azure mentions a minimum of ~400 kilometers between its AZs, which is the greatest distance among major cloud providers. Source:https://learn.microsoft.com/en-us/azure/reliability/cross-region-replication-azure#what-are-paired-regions)).
Google Cloud (GCP) does not publicly disclose the distances between its AZs, keeping this detail confidential. surprising!
Given these developments, regulators may eventually require multi-region setups for regulated entities and fintechs using cloud services instead of relying solely on Multi-AZ. So far, Multi-AZ has offered a relatively straightforward solution to meet compliance and audit requirements for BCPDR. However, it might be time to reconsider RTO (Recovery Time Objective) and RPO (Recovery Point Objective) expectations in cloud environments.
Any thoughts on these developments?
Link to the RCA: GCP Outage Incident Report https://status.cloud.google.com/incidents/e3yQSE1ysCGjCVEn2q1h)
Article on 12 hour outage: https://www.theregister.com/2024/10/25/google_cloud_frankfurt_outage/
The above phrase has endured through the ages, conveying the notion that while challenges are an unavoidable part of life, our response to them can determine the extent of our distress.
In a similar vein, not too long ago, security breaches were infrequent, primarily driven by a quest for fame rather than financial gain. However, the landscape has shifted dramatically. Companies across the spectrum, from hot startups to Fortune 500 giants, from those meticulously adhering to ISO 27001 and PCI DSS standards to unregulated entities, spanning industries such as healthcare and fintech, find themselves vulnerable to cyber threats.
Given the inevitability of breaches, a fundamental question emerges: What should organizations prioritize? I posed this question to peers, friends, and numerous professionals within our industry, and a singular response echoed throughout:
“Resilience”
But what exactly is Cyber Resilience?
Cyber resilience denotes an organization’s capacity to anticipate, endure, recover from, and adapt to adverse circumstances, stresses, attacks, or compromises on systems reliant on or enabled by digital resources. In essence, it revolves around preparedness for the inevitable breach.
Can we quantify resilience?
The answer is Yes, and various frameworks exist to assist in this journey. Several months ago, I had the privilege of conducting a Cyber Resiliency Assessment for a large financial institution in the Middle East. Instead of solely concentrating on detection and incident response capabilities, I sought to ascertain whether any frameworks could aid in the process. It was during this quest that I encountered the Cyber Resiliency Review (CRR).
The CRR is derived from the CERT Resilience Management Model (CERT-RMM), a process improvement model developed by Carnegie Mellon University’s Software Engineering Institute for managing operational resilience. Although CRR is meant to be an instructor lead or self assessment module based on series of Questions and Answers, the process in itself generates thought provoking questions and answers.
The principles and recommended practices within the CRR align closely with the Cybersecurity Framework (CSF) developed by the National Institute of Standards and Technology (NIST). After performing a CRR, you can compare the results to the criteria of the NIST CSF to identify gaps and, where appropriate, recommended improvement efforts.
The CRR is based on the premise that an organization deploys its assets (people, information, technology, and facilities) to support specific critical services or products. Based on this principle, the CRR evaluates the maturity of your organisation’s capacities and capabilities in performing, planning, managing, measuring and defining cybersecurity capabilities across 10 domains.
The CRR Domains:
Asset Management: Asset management is critical for cyber resilience because organizations need to understand what assets they have and where they are located. This information is necessary for effective risk management, vulnerability management, and incident response.
Controls Management: Controls management involves the implementation, monitoring, and maintenance of security controls that protect an organization’s assets. Effective controls management can prevent, detect, and mitigate the impact of cyberattacks.
Configuration and Change Management: Configuration and change management are important for ensuring that systems and applications are configured and updated securely. Changes to system configurations and applications can introduce new vulnerabilities, so effective configuration and change management is necessary to maintain cyber resilience.
Vulnerability Management: Vulnerability management involves identifying and prioritizing vulnerabilities in an organization’s systems and applications. By addressing vulnerabilities, organizations can reduce the risk of cyberattacks and minimize the impact of any successful attacks.
Incident Management: Incident management is critical for responding to cyberattacks and minimizing their impact. Effective incident management includes incident detection, response, containment, and recovery.
Service Continuity Management: Service continuity management involves planning for and responding to disruptions to an organization’s services. By planning for disruptions and developing contingency plans, organizations can maintain critical services during and after a cyberattack.
Risk Management: Risk management involves identifying, assessing, and prioritizing risks to an organization’s assets. Effective risk management can help organizations understand the likelihood and potential impact of cyberattacks and prioritize their resources accordingly.
External Dependency Management: The purpose of External Dependencies Management is to establish processes to manage an appropriate level of controls to ensure the sustainment and protection of services and assets that are dependent on the actions of external entities.
Training and Awareness: The purpose of Training and Awareness is to develop skills and promote awareness for people with roles that support the critical service.
Situational Awareness: Situational Awareness involves monitoring the cyber threat landscape and understanding the potential impact of emerging threats. By maintaining situational awareness, organizations can proactively respond to emerging threats and maintain their cyber resilience.
Methodology
Although CRR is meant to be an instructor lead or self assessment module based on series of Questions and Answers, you can use it as a reference and conduct your own assessment. You may or may not use it as is, rather refer only the high level methodology and customise it based on your needs. Having said that, lets move on.
There are 10 domains and each domain has its own set of goals. Each domain is composed of a purpose statement, a set of specific goals and associated practice questions unique to the domain, and a standard set of Maturity Indicator Level (MIL) questions.
The MIL questions examine the institutionalisation of practices within an organisation. The Maturity indicator levels (MIL) are scored from 0 to 5. and are classified as Incomplete, Performed, Planned, Managed, Measured, Defined.
As shown in picture below, the number of goals and practice questions varies by domain, but the set of MIL questions and the concepts they encompass are the same for all domains. All CRR questions have three possible responses: “Yes,” “No,” and “Incomplete.”
All the QnA is on a Portable Document Format (PDF) and after filling in the answers you can generate a report with the results that can also map to NIST CSF Framework. Note: This requires Adobe Acrobat PDF Reader and will not render in Preview in mac.
However, you can use this PDF as is or leverage it to understand the domains better and include a more hands on review of the existing architectures, practices and make it more comprehensive through an offline report.
Key Takeaways
The Cyber Resiliency Review (CRR) offers a great insight into an organization’s cybersecurity stance. This assessment enhances the collective awareness across the organization regarding the necessity of effective cybersecurity management. It evaluates the critical capabilities essential for upholding vital services during periods of operational challenges and emergencies. Additionally, it serves as a validation of managerial achievements and stimulates constructive discussions among participants representing various functional areas within the organization.
Furthermore, the CRR delivers a comprehensive final report, charting the relative maturity of resilience processes across the ten domains. It also presents potential improvement options for consideration, drawing upon established standards, best practices, and references to the Computer Emergency Response Team – Resilience Management Model (CERT-RMM).
In summary, while breaches remain an inevitable aspect of the digital landscape, the degree of suffering they inflict is a matter of choice. By focusing on cyber resilience, organizations can fortify themselves to emerge stronger in the face of adversity.
How are you assessing the resiliency? Feel free to comment and let your thoughts and feedback.