×
Community Blog Friday Q&A - Week 2 - Could Alibaba Cloud Survive A Nuclear War?

Friday Q&A - Week 2 - Could Alibaba Cloud Survive A Nuclear War?

Why is disaster-tolerant architecture so important? Let's take a look, using a nuclear attack as our hypothetical scenario!

By Jeremy Pedersen

Welcome back for week 2 of our Alibaba Cloud Q&A blog series.

This week, I'm going to address just one question, because it's a big one!

Q: Could Alibaba Cloud survive a nuclear war?

nuclear_blast

I'm paraphrasing a bit here, but I really did have a customer ask me if Alibaba Cloud was "nuclear proof".

Let's roll up our sleeves and tackle this one step-by-step.

What do we mean by nuclear war?

Let's start by ruling out a global nuclear conflict between the major world powers, because that is a literal armageddon from which no living thing is safe. Friends, the world is ending: do you really care if your servers are still up?

Instead, let's use an imaginary terrorist attack as our scenario. Somehow, a terrorist group has built a nuclear weapon. One of two things could happen:

  1. A successful nuclear detonation
  2. A nuclear "fizzle"

In the first scenario, we will assume a terrorist organization has successfully detonated a real, functional nuclear weapon. In the second, we will assume that the detonation was only partially successful (a fizzle) in nuclear bomb parlance) or that the weapon was a "dirty bomb", a conventional bomb intended to spread nuclear material over a wide area, forcing an evacuation and cleanup of the resulting fallout, but not necessarily a nuclear explosion.

What do we mean by survive?

As with any task in systems architecture or design, we need to decide what it means to succeed, and what metrics we should use to measure success.

Let's look at two metrics:

  • Survival of systems (i.e. systems continue to operate and serve users)
  • Survival of data (user data is preserved)

For each of these metrics, we will define 3 failure states:

  • Total (all data is lost, and/or our system is in an unrecoverable state)
  • Partial (some data is lost, and/or our system suffers some downtime)
  • None (no data is lost, and/or our system suffers no downtime)

Great! Now let's design a couple of different system architectures and see how each one responds to being nuked!

Designing for the bomb

You are the lead systems architect for the Acme Widget Corporation (WidgeCo for short). You have been given the task of designing a web application. You are going to need one or more web servers (Alibaba Cloud ECS), a database (Alibaba Cloud RDS), object storage (Alibaba Cloud OSS), and load balancers (Alibaba Cloud SLB). You will also use Alibaba Cloud DNS to map your hostname to the public IP address of your load balancer.

Alibaba Cloud breaks its services down by Region and Zone. A Zone is a single Alibaba Cloud datacenter. All zones are physically separate from one another and use independent networking, power, and cooling equipment. A Region is a geographical area (like Kuala Lumpur or Beijing) that contains multiple Zones.

You are considering 3 potential architectures:

Option 1: Single-Zone

Your database, web server(s), object storage, and load balancer are all located in a single Alibaba Cloud Zone (datacenter). You are not using any of the Multi-Zone features built into SLB, RDS, or OSS.

single_zone

Survivability: This architecture is vulnerable to a single-Zone failure: loss of cooling, power, or networking would cause downtime but should not cause data loss. A major disaster like a fire or explosion could cause either partial or total data loss, depending on your backup policy.

Option 2: Multi-Zone

Your web servers are deployed across multiple Zones. You are using the built-in multi-Zone capabilities of Alibaba Cloud SLB and RDS. You have turned on ZRS (Zone Redundant Storage) for your OSS buckets, and you are making regular backups of your ECS Cloud Disks.

multi_zone

Survivability: This architecture will survive the complete loss of a single Zone with no data loss and no downtime (but potentially some performance degradation). This architecture is vulnerable to the loss of multiple Zones within a Region.

Option 3: Multi-Region

Your web servers are deployed across multiple Zones in multiple Alibaba Cloud Regions. You are using Alibaba Cloud DTS to synchronize data between RDS databases in different Regions. You are using DNS health checks and load balancing to distribute traffic between load balancers in multiple Alibaba Cloud Regions. You have turned on Cross Region Replication (CRR) for your OSS buckets, to automatically replicate data across Regions.

multi_region

Survivability: This architecture can survive the total loss of all Zones within a single Region. This architecture is vulnerable to the loss of multiple Regions.

Ok, let's blow some stuff up!

Scenario 1: Dirty bomb or "fizzle"

A terrorist organization sets off a dirty bomb or a partially successful nuclear weapon. Damage from the blast is limited, but radioactive fallout forces an evacuation. An Alibaba Cloud Zone is damaged in the blast, or is without power, networking, or cooling as a result of the attack.

Single-Zone Architecture:
- Data Loss: Partial (some data may be lost depending on backup settings)
- Downtime: Partial (temporary: system is recoverable)

Multi-Zone Architecture:
- Data Loss: None (most or all data survives)
- Downtime: None (no downtime)

Multi-Region Architecture:
- Data Loss: None (all data survives)
- Downtime: None (no downtime)

Scenario 2: Small nuclear blast

A terrorist organization succeeds in detonating a low-yield nuclear weapon. An Alibaba Cloud Zone is near ground zero. That Zone is completely destroyed. Other Zones may be damaged or lose power, networking, or cooling.

Single-Zone Architecture:
- Data Loss: Total (some data may be recoverable depending on backup policy)
- Downtime: Total (unrecoverable)

Multi-Zone Architecture:
- Data Loss: None (most or all data survives)
- Downtime: Partial (some downtime)

Multi-Region Architecture:
- Data Loss: None (all data survives)
- Downtime: None (no downtime)

Scenario 3: Large nuclear blast

A terrorist organization succeeds in detonating a large nuclear weapon, completely devastating a city. All Alibaba Cloud Zones within the affected Region are completely destroyed.

Single-Zone Architecture:
- Data Loss: Total (unrecoverable)
- Downtime: Total (unrecoverable)

Multi-Zone Architecture
- Data Loss: Total (unrecoverable)
- Downtime: Total (unrecoverable)

Multi-Region Architecture:
- Data Loss: None (most or all data survives)
- Downtime: None (no downtime)

What have we learned?

Although a nuclear attack is not a likely scenario, we can learn a lot about disaster planning from this hypothetical exercise.

First, it's important to plan ahead for both large and small scale disasters by building failover and redundancy into your application from the start.

Whenever possible, avoid single points of failure (SPoFs) by backing data up across multiple zones, taking advantage of multi-zone availability features in Alibaba Cloud products, and even building across multiple regions as time and budget permits.

This will protect you against the small and mundane (power outages, fires, human error) and the large and unpredictable (hurricanes, earthquakes, tsunamis).

I've Got A Question!

Great! Reach out to me at jierui.pjr@alibabacloud.com and I'll do my best to answer in a future Friday Q&A blog.

You can also follow the Alibaba Cloud Academy LinkedIn Page. We'll re-post the more interesting questions and answers there.

0 0 0
Share on

JDP

12 posts | 7 followers

You may also like

Comments

JDP

12 posts | 7 followers

Related Products