Pages

Sunday, April 23, 2017

DevOps Availability and Risk

DevOps Availability and Risk


excellent points from an episode of Arrested DevOps, entitled "Who Owns Your Availability?" (TLDR: you do!) https://www.arresteddevops.com/availability/

My thoughts:

- technical risk can produce business risk, as in "your hundred employees cant do anything for an hour", up to "your database is gone therefore the company is gone" kinds of risk.  Or, "feature X doesnt work for user class Y" kinds of risk. Do you as a business prioritize consumers paying you, or you delivering their stuff, or your admins/phone people delighting your customers, or your developers fixing bugs?

From the show (Charity Majors, Pete Cheslock, ADO crew). (Quotes are my foggy recollections, not quotes.):

- cache ("vendor") your dependencies

If you cant deploy to production because GitHub or a 3rd party package server in China is down, things are not good.  Likewise, if your server is connecting to China and all your packages are local, perhaps its time for a security check. (If you dont know what servers your server is talking to, thats another risk.)

- what is your Risk Profile? What is considered acceptable risk?

As your company starts its probably fine to rely on the internet being always available all the time. Not being able to deploy for an hour/day might be okay. Spending resources on growing your company might be a good tradeoff vs security and availability.

- your dependencies are cached. What about deps of deps of deps?

- "Packerize the base"

If your system has a baked, reliable base, with a little bit of changes on top, then its easier to track down and fix things that break.

One mechanism is "baking" all your random dependencies to a Docker layer.  Or, network volume -- Amazon S3 for example (deb-s3).  It can still go down, but if its up you get everything in one place. Itll be there for you even if the original host is not happy for whatever reason. One person mentioned she had more problems with GitHubs reliability than her own.

Another failure mode: known-good version is broken. Your business depends on the "beer-1.0" package. Its been working fine for months.  Developer gets drunk and uploads a broken package, but uses the same version number -- "beer-1.0" is now broken.  You can no longer make changes to your business!  Since you own your availability, its your problem.

- "if you treat your devs like children, theyll act like children. Theyll become subject matter experts on doing things the wrong way. We as devops can be spirit guides, career counselors for your leveling up skills." Developers own the code, the availability. Give them pagers and wake them up when the site has problems.

- site should have "circuit breakers" - if the site is in "continuous partial failure", thats better than just being down for everyone, full stop.


I dig the Arrested DevOps podcast, and listen to it often. Thanks!



No comments:

Post a Comment

Note: Only a member of this blog may post a comment.