AWS Fault Injection Simulator

bithavoc · on Dec 15, 2020

s/us-east-1/Fault Injection Simulator/g

jedberg · on Dec 15, 2020

As funny as this is, Chaos Engineering is not just random failures. It's causing known failures to see what happens. They may be a surprise to the service or even the engineers running the service, but someone should know what's happening so it can be turned off and so things can be monitored and recorded and hypotheses can be proposed and tested.

stingraycharles · on Dec 16, 2020

Doesn’t that kind of defeat the point of chaos engineering, that is, making the process unsupervised and triggering failure scenarios that were not yet known?

I always considered the simplest form of chaos engineering to be a process that randomly kills processes on a server, and see what happens.

If, on the other hand, you mean that it’s important that the “chaos” aspects needs and on/off button and that someone needs to be managing that, then I agree. :)

jedberg · on Dec 16, 2020

Yeah, that second thing. :)

The chaos itself should be random, but the time and place should not be.

oblio · on Dec 15, 2020

Now that there are other regions there, is there any reason, except for legacy, to use us-east-1?

New features are generally ported quite fast to the other regions, so this doesn't seem like a valid reason in 2020.

brodouevencode · on Dec 15, 2020

Regional costs are different. east-1 and west-2 are largely the same, but east-2 and anything outside of the US tend to be more expensive.

The proper way to handle this is using multi-region HA/DR.

stingraycharles · on Dec 16, 2020

For some reason our customers keep asking us to launch our services in us-east-1, which I can only assume is for legacy reasons.

What I don’t understand, however, is what’s preventing Amazon from upgrading the infrastructure in us-east-1 and making it more stable. Is it about mitigating risk of unknown side effects and/or incompatibilities?

bithavoc · on Dec 15, 2020

I blame Heroku, for America us-east is the only option so it's a legacy for many to keep creating resources there.

electroly · on Dec 17, 2020

The obvious one is if you're physically located near us-east-1 and want low latency to your office. us-east-2 is all the way in Ohio.

dgrin91 · on Dec 15, 2020

This was the good laugh I needed today. Thank you.

brodouevencode · on Dec 15, 2020

Had the same thought but with kinesis

jedberg · on Dec 15, 2020

This is exciting! I've been asking for this for years. There is only so much you can do by implementing your chaos on instance. Having access to a lower level of infrastructure will be huge! Especially in a world of serverless services where we can't implement on instance nearly as easily.

gadders · on Dec 15, 2020

Just let me setup your cloud instances. I guarantee loads of random failures.

atonse · on Dec 15, 2020

Inattention-to-detail as a service? I love it!

Hire me to do a half-assed job so your infrastructure can randomly fail to make it more resilient.

gadders · on Dec 16, 2020

Yeah, I'm applying to YC next year. Let me know if you want in as a co-founder.

atonse · on Dec 16, 2020

I would let you know but I can’t be bothered to write an official email. Inattention to detail and all.

dijksterhuis · on Dec 16, 2020

I'm game of you need a hwlping hand,

hrpnk · on Dec 15, 2020

AWS provided recently a library [1] for running chaos experiments [2], which look like an evolution of the work their own employee did before [3]. I wonder if the FIS is just a re-packaging and proper integration of these or something more.

[1] https://github.com/amzn/awsssmchaosrunner

[2] https://aws.amazon.com/blogs/opensource/building-resilient-s...

[3] https://github.com/adhorn/chaos-ssm-documents

discodave · on Dec 15, 2020

Chaos Runner, or anything else that you run on your EC2 instances (or in your ECS containers) requires you to install some kind of agent or daemon. This is non-optimal as it is work in and of itself, it is also hard to correctly simulate things like the network, or various AWS services being down/unreachable.

If you run the agent/daemon on your production stack, then it's a potential vector for misconfiguration or attack. But if you don't run the agent/daemon in production, then it's another way in which your test stack diverges from production!

I saw various PR/FAQs related to Chaos engineering while I worked in both EC2 and the AWS developer tools org. I've been gone over a year now, but I would bet that FIS does something at the EC2 Network level so that you don't have to install stuff on your instances or containers.

csears · on Dec 15, 2020

Are there any tools that simulate control plane failures for AWS or other cloud providers? This looks great, but it seems to only work on compute resources.

Many AWS outages result in APIs failing or returning errors, which should be possible to simulate. I'd like to create experiments where EC2 instances won't spin up in a particular AZ, EBS disks can't be reattached, Route53 zones can't be updated, IAM changes don't take effect, etc.

aynsof · on Dec 15, 2020

There's no tool that I'm aware of, but you could implement hard failures using SCPs. That will cause an error when the blocked API call is called. It won't, however, give you latency or any nuance in failure rates.

juliend2 · on Dec 15, 2020

Looks like that guy's tweet was heard: https://twitter.com/cperciva/status/1292260921893457920

But AWS made it more configurable, and not in a single region.

pbw · on Dec 15, 2020

Does it mention pricing? I couldn't find it.

the_arun · on Dec 15, 2020

Me neither. I would expect AWS to provide this service for free. It is like paying a price to design our systems to ensure they are reliable even if a zone/region goes down managed by AWS. Well - they might call this shared responsibility :)

jacques_chester · on Dec 15, 2020

> Me neither. I would expect AWS to provide this service for free.

A truly cunning pricing scheme would be to strike a Coasian Bargain: "we inject faults for free! If you want us to stop, you'll need to pay extra". That would mean the cost would fall mostly on those who are unable to avoid paying it.

enw · on Dec 15, 2020

The cost would fall one those who are unable to design for fault-tolerance.

jacques_chester · on Dec 15, 2020

Or who are unable to, due to existing constraints.

The charitable name for this lucrative segment of the market is "Enterprise".

jrott · on Dec 15, 2020

I didn't see it either. Hoping it's semi-reasonable because I'd like to be able to really use it.

jedberg · on Dec 15, 2020

They never talk about pricing until a service launches.

arafsheikh · on Dec 15, 2020

Compute + DB should cover a lot of use cases, but it’ll be great to have support for more AWS services. DNS (Route53) issues are a pain to deal with for example.

tpetry · on Dec 15, 2020

What problems does route53 create? Is the api not available? Or the dns servers returning ko data?

twothamendment · on Dec 16, 2020

A route 53 propagation issue caused a bunch of our web servers to be unable to connect to the read replicas of our RDS cluster. The cluster scaled in a RR, but the DNS change took hours to propagate. The web servers kept trying to connect to the replica that didn't exist.

Our workaround was to put the IP address in the hosts files - not an ideal setup, but it got the job done.

sa46 · on Dec 15, 2020

Are there similar offerings for other clouds?

aynsof · on Dec 15, 2020

Gremlin offers a chaos engineering service which I believe can be used across multiple clouds.

agilob · on Dec 15, 2020

Just use Oracle...

chii · on Dec 16, 2020

you want to simulate faults, not have actual faults...

qz2 · on Dec 15, 2020

I need this now.

deeblering4 · on Dec 15, 2020

Looks cool, but maybe wait until after the new year before blowing up production on the team :)

agentwalker · on Dec 15, 2020

Looks interesting - what stands out to you?

SilasX · on Dec 15, 2020

ctrl+f ... Nope, no mention of Netflix’s Chaos Monkey.

https://en.wikipedia.org/wiki/Chaos_engineering#Chaos_Monkey

gitbisect · on Dec 15, 2020

Why would it? AWS was doing Chaos Engineering long before it was even called Chaos Engineering and well before Chaos Monkey at Netflix. Most historical coverings of Chaos Engineering trace back to Jesse Robbins at AWS pulling power cables from racks to simulate failure and running GameDays in the early 2000s.

See: https://en.wikipedia.org/wiki/Jesse_Robbins#Contributions_to... https://medium.com/the-cloud-architect/chaos-engineering-ab0...

inopinatus · on Dec 15, 2020

People have been doing Chaos Engineering, and the more general category of "DevOps", for years before those things were given labels. I think we may credit Netflix with the awesome name of Chaos Monkey and for building a systemic in-house practice, but they were not the first to perform intentional destructive resilience testing in production.

Case in point, I've been yanking out active drives, cables, PSUs and CPUs since the 1980s, and not always by accident.

What's more, I was taught to do by someone who'd been babysitting VAXes for years; they had a ceramic axe that, at the age of fourteen, I'd seen them swing right through a three-phase minicomputer power feed at my local university's data center "for test purposes".

notacoward · on Dec 15, 2020

One of the more memorable projects in my career had me sticking pens in fans for months on end, because it tickled an error-handling edge case.

relix42 · on Dec 16, 2020

Chaos Monkey was explained to me as "The (virtual) monkey swinging around the (virtual) data center unplugging the (virtual) cables". It very well could have been based on Robbins. The point wasn't originality - the point was to have the same benefit in our burgeoning cloud environment.

rswail · on Dec 16, 2020

In the 80s we had a serial line in our test lab set up with a chain of paperclips on the TX and RX lines.

Made it easy to insert comms errors and failures.

SilasX · on Dec 15, 2020

Not saying you're wrong, but maybe Wikipedia is leaving off a lot of relevant history on the Chaos Engineering article then.

Edit: Funny how my original comment is downvoted for contradictory reasons, both because "obviously that was irrelevant to the state of the art in chaos engineering", and because "oh don't worry, that's relevant, they just discuss it in other contexts".

jedberg · on Dec 15, 2020

I think your comment was downvoted because it's a silly thing to focus on.

> but maybe Wikipedia is leaving off a lot of relevant history on the Chaos Engineering article then.

This is certainly true.

jon-wood · on Dec 15, 2020

However the very first sentence is "AWS Fault Injection Simulator is a fully managed chaos engineering service".

SilasX · on Dec 15, 2020

I know -- still no mention of the state of the art or what product it might replace.

nolaspring · on Dec 15, 2020

In Dr Vogel's keynote presentation he brought up Netflix and Chaos Monkey specifically as precursors to this service.

SilasX · on Dec 15, 2020

It would be better if that context carried over to this document.

laumars · on Dec 15, 2020

That would be superfluous to the content in which most people would refer to that document for. Amazon already state it’s for Chaos Engineering, they don’t need to list other similar projects on their own product page.

fnord77 · on Dec 15, 2020

what is with the amazon spam today?

are their SDRs trying to boost their numbers for the year?

karlding · on Dec 15, 2020

It's AWS re:Invent [0], where they typically announce new product offerings.

[0] https://reinvent.awsevents.com/

fnord77 · on Dec 19, 2020

didn't realize this. and I went to re:invent last year!

DavidSJ · on Dec 15, 2020

Amazon frequently releases a bunch of new cloud services all on the same day.

frakkingcylons · on Dec 15, 2020

A new service announcement isn’t spam.