By: Jeremy Pedersen
A lot of what I'm about to say is industry-standard advice and has been for years.
If you read this and find yourself saying "well, duh!", then great! If some of this is new to you...take it to heart! It might save you some headaches one day.
I didn't know all these things when I started out in IT, and some of the things I'll discuss below are relatively new developments...it took work to make them a part of my "mental model" of IT risk factors.
That's enough background: let's jump in!
Everybody knows to make backups (although not everybody follows this sage advice).
What many don't do is test their backups! How do you know your backups are working if you never check them? Answer: you don't.
In the cloud, there's no excuse for not testing your backups thoroughly: tools like terraform make it easy to spin up an exact replica of your entire production environment, if you want to. Take advantage of that.
For every backup system you put in place, make sure it really, really works. Pretend your environment really has failed, or even better, break it on purpose, like Netflix does with "chaos monkey".
You've got backups of your backups. You've got a high-availability load balancer spanning two Alibaba Cloud Availability Zones. Your database in on an Enterprise Edition RDS instance, spanning 3 different Availability Zones.
You've got copies of your RDS instance data and ECS disk images backed up to a "failover" Alibaba Cloud Region, just in case your chosen Region fails.
Then it happens: your Region has a major outage. You set things up in your backup region from your fresh backups, and are just getting ready to edit your DNS records when you realize: you can't! Turns out Alibaba Cloud DNS had a critical dependency on the failed region.
Your backup systems are up and running and....your users can't find them. You can't point
yoursite.com at your infrastructure. Whoops!
What's the lesson here? Know what your weak points are, and plan around them, if possible.
Do. Not. Use. Passwords.
Really! I mean it. If you must use a password (we live in an imperfect world), make sure you pair it with some form of secondary authentication, like a timed one-time password (TOTP) or SMS verification code.
On public cloud, you can avoid passwords almost entirely if you're careful. We now live in a world where it is entirely possible to do most - if not all - administration through limited time authentication tokens.
Alibaba Cloud, Amazon AWS, GCP, and Azure all have systems in place that make it possible to issue short-term access tokens allowing limited access to cloud accounts. AWS has IAM roles, Alibaba Cloud has RAM roles, and Azure has Azure RBAC.
Of course there's a dark side to all of this. Using "Infrastructure as Code (IaC)" tools like Terraform and integration tools like Travis CI, creates lots of opportunities for you to leak your keys, as evidenced by this major Travis CI security issue. Yikes!
Keys and tokens are great, but remember to rotate them early, rotate them often, and limit their permissions.
(Almost) every operation you do should be repeatable and reversible.
Planning to make a change to your database? What is your rollback plan, if the change fails? Know how to do, and also how to undo.
Altering your infrastructure? What's your recovery plan if
terraform apply fails?
Updating an app on a Kubernetes cluster? What's your rollback plan?
You get the idea. For every action, know how to back out!
Tip: this is great life advice, too. Lunch with friends on Saturday? Have your excuse ready in case your plans change!
Logs are usually easier to collect than you think, and cheaper to store than you imagine. Plus, you never know when a log will hold the key to solving a recurring issue, tracing an attack, or understanding the actions that led up to a failure or outage.
Automation is great when things are working well, but can fail you when things go wrong.
The same principle applies in IT. Used to having your databases configured for you, and your CI/CD pipeline handling all your packaging and deployments? What will happen when these things fail to work as expected?
Your chances will depend (in part) on your ability to keep a cool head during hair-on-fire emergencies, as well as your team culture, but you can still prepare.
Once again, we can take a page out of the airline playbook here. There's a surprisingly good post on aviation safety on "thepointsguy.com", which is primarily a site focused on saving money on travel.
A quick read-through will tell you right away that in an emergency, pilots have the following priorities:
Pilot checklists are used to troubleshoot problems in a methodical way to make sure the best possible decisions are made.
We can borrow this idea: make sure you have playbooks for the most common emergency scenarios you could encounter. These troubleshooting guides can easily keep you from making a bad situation worse, meaning better outcomes for your users and better uptime statistics for you.
This concept never gets enough love!
Run updates and patches. Rotate keys. Clean up old accounts. Change passwords. Did I mention updates and patching? Oh, and know when your SSL certificates and domain names will expire. It's easy to miss the easy stuff!
I could go on, but I think this covers the really important stuff, so I'll see you all next week! Have a great weekend!
JDP - April 29, 2022
JDP - July 2, 2021
Alibaba Cloud Community - July 1, 2022
JDP - April 15, 2022
JDP - April 16, 2021
JDP - November 26, 2021
Customized infrastructure to ensure high availability, scalability and high-performanceLearn More
A one-stop, cloud-native platform that allows financial enterprises to develop and maintain highly available applications that use a distributed architecture.Learn More
Alibaba Cloud is committed to safeguarding the cloud security for every business.Learn More
Accelerate AI-driven business and AI model training and inference with Alibaba Cloud GPU technologyLearn More
More Posts by JDP