Community Blog How To Make Sure It Doesn't Go Wrong, And What To Do When It Does Anyway - Friday Blog, Week 66

How To Make Sure It Doesn't Go Wrong, And What To Do When It Does Anyway - Friday Blog, Week 66

Learn how to make sure you are doing everything you can to avoid failure, while planning for the worst to happen anyway!

By: Jeremy Pedersen

How To Make Sure It Doesn't Go Wrong, And What To Do When It Does Anyway

A lot of what I'm about to say is industry-standard advice and has been for years.

If you read this and find yourself saying "well, duh!", then great! If some of this is new to you...take it to heart! It might save you some headaches one day.

I didn't know all these things when I started out in IT, and some of the things I'll discuss below are relatively new developments...it took work to make them a part of my "mental model" of IT risk factors.

That's enough background: let's jump in!

1. Don't Just Make Backups: Test Backups

Everybody knows to make backups (although not everybody follows this sage advice).

What many don't do is test their backups! How do you know your backups are working if you never check them? Answer: you don't.

In the cloud, there's no excuse for not testing your backups thoroughly: tools like terraform make it easy to spin up an exact replica of your entire production environment, if you want to. Take advantage of that.

For every backup system you put in place, make sure it really, really works. Pretend your environment really has failed, or even better, break it on purpose, like Netflix does with "chaos monkey".

2. Consider Your Weak Points

You've got backups of your backups. You've got a high-availability load balancer spanning two Alibaba Cloud Availability Zones. Your database in on an Enterprise Edition RDS instance, spanning 3 different Availability Zones.

You've got copies of your RDS instance data and ECS disk images backed up to a "failover" Alibaba Cloud Region, just in case your chosen Region fails.

Then it happens: your Region has a major outage. You set things up in your backup region from your fresh backups, and are just getting ready to edit your DNS records when you realize: you can't! Turns out Alibaba Cloud DNS had a critical dependency on the failed region.

Your backup systems are up and running and....your users can't find them. You can't point yoursite.com at your infrastructure. Whoops!

What's the lesson here? Know what your weak points are, and plan around them, if possible.

3. Keys Are King, But Kings Must Die

Do. Not. Use. Passwords.

Really! I mean it. If you must use a password (we live in an imperfect world), make sure you pair it with some form of secondary authentication, like a timed one-time password (TOTP) or SMS verification code.

On public cloud, you can avoid passwords almost entirely if you're careful. We now live in a world where it is entirely possible to do most - if not all - administration through limited time authentication tokens.

Alibaba Cloud, Amazon AWS, GCP, and Azure all have systems in place that make it possible to issue short-term access tokens allowing limited access to cloud accounts. AWS has IAM roles, Alibaba Cloud has RAM roles, and Azure has Azure RBAC.

Of course there's a dark side to all of this. Using "Infrastructure as Code (IaC)" tools like Terraform and integration tools like Travis CI, creates lots of opportunities for you to leak your keys, as evidenced by this major Travis CI security issue. Yikes!

Keys and tokens are great, but remember to rotate them early, rotate them often, and limit their permissions.

4. Wax On, Wax Off

(Almost) every operation you do should be repeatable and reversible.

Planning to make a change to your database? What is your rollback plan, if the change fails? Know how to do, and also how to undo.

Altering your infrastructure? What's your recovery plan if terraform apply fails?

Updating an app on a Kubernetes cluster? What's your rollback plan?

You get the idea. For every action, know how to back out!

Tip: this is great life advice, too. Lunch with friends on Saturday? Have your excuse ready in case your plans change!

5. Collect All The Logs

Log Everything.

Running a Kubernetes cluster on ACK? Collect cluster metrics with CloudMonitor, install ARMS Prometheus, and ship container logs to Log Service with logtail. Do it all!

Logs are usually easier to collect than you think, and cheaper to store than you imagine. Plus, you never know when a log will hold the key to solving a recurring issue, tracing an attack, or understanding the actions that led up to a failure or outage.

6. Playbooks Still Count

Automation is great when things are working well, but can fail you when things go wrong.

As the aviation industry has discovered, autopilot can dull your flying skills. Not deep enough for you? Take a look at this research report from mitre.org.

The same principle applies in IT. Used to having your databases configured for you, and your CI/CD pipeline handling all your packaging and deployments? What will happen when these things fail to work as expected?

Your chances will depend (in part) on your ability to keep a cool head during hair-on-fire emergencies, as well as your team culture, but you can still prepare.

Once again, we can take a page out of the airline playbook here. There's a surprisingly good post on aviation safety on "thepointsguy.com", which is primarily a site focused on saving money on travel.

A quick read-through will tell you right away that in an emergency, pilots have the following priorities:

  1. Fly the plane
  2. Decide who does what
  3. Run the checklists

Pilot checklists are used to troubleshoot problems in a methodical way to make sure the best possible decisions are made.

We can borrow this idea: make sure you have playbooks for the most common emergency scenarios you could encounter. These troubleshooting guides can easily keep you from making a bad situation worse, meaning better outcomes for your users and better uptime statistics for you.

7. Do The Boring Stuff

This concept never gets enough love!

Run updates and patches. Rotate keys. Clean up old accounts. Change passwords. Did I mention updates and patching? Oh, and know when your SSL certificates and domain names will expire. It's easy to miss the easy stuff!

But Wait, There's More?

I could go on, but I think this covers the really important stuff, so I'll see you all next week! Have a great weekend!

0 1 0
Share on


71 posts | 154 followers

You may also like



71 posts | 154 followers

Related Products