Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
AWS Fault Injection Simulator (amazon.com)
229 points by luhn on Dec 15, 2020 | hide | past | favorite | 57 comments


s/us-east-1/Fault Injection Simulator/g


As funny as this is, Chaos Engineering is not just random failures. It's causing known failures to see what happens. They may be a surprise to the service or even the engineers running the service, but someone should know what's happening so it can be turned off and so things can be monitored and recorded and hypotheses can be proposed and tested.


Doesn’t that kind of defeat the point of chaos engineering, that is, making the process unsupervised and triggering failure scenarios that were not yet known?

I always considered the simplest form of chaos engineering to be a process that randomly kills processes on a server, and see what happens.

If, on the other hand, you mean that it’s important that the “chaos” aspects needs and on/off button and that someone needs to be managing that, then I agree. :)


Yeah, that second thing. :)

The chaos itself should be random, but the time and place should not be.


Now that there are other regions there, is there any reason, except for legacy, to use us-east-1?

New features are generally ported quite fast to the other regions, so this doesn't seem like a valid reason in 2020.


Regional costs are different. east-1 and west-2 are largely the same, but east-2 and anything outside of the US tend to be more expensive.

The proper way to handle this is using multi-region HA/DR.


For some reason our customers keep asking us to launch our services in us-east-1, which I can only assume is for legacy reasons.

What I don’t understand, however, is what’s preventing Amazon from upgrading the infrastructure in us-east-1 and making it more stable. Is it about mitigating risk of unknown side effects and/or incompatibilities?


I blame Heroku, for America us-east is the only option so it's a legacy for many to keep creating resources there.


The obvious one is if you're physically located near us-east-1 and want low latency to your office. us-east-2 is all the way in Ohio.


This was the good laugh I needed today. Thank you.


Had the same thought but with kinesis


This is exciting! I've been asking for this for years. There is only so much you can do by implementing your chaos on instance. Having access to a lower level of infrastructure will be huge! Especially in a world of serverless services where we can't implement on instance nearly as easily.


Just let me setup your cloud instances. I guarantee loads of random failures.


Inattention-to-detail as a service? I love it!

Hire me to do a half-assed job so your infrastructure can randomly fail to make it more resilient.


Yeah, I'm applying to YC next year. Let me know if you want in as a co-founder.


I would let you know but I can’t be bothered to write an official email. Inattention to detail and all.


I'm game of you need a hwlping hand,


AWS provided recently a library [1] for running chaos experiments [2], which look like an evolution of the work their own employee did before [3]. I wonder if the FIS is just a re-packaging and proper integration of these or something more.

[1] https://github.com/amzn/awsssmchaosrunner

[2] https://aws.amazon.com/blogs/opensource/building-resilient-s...

[3] https://github.com/adhorn/chaos-ssm-documents


Chaos Runner, or anything else that you run on your EC2 instances (or in your ECS containers) requires you to install some kind of agent or daemon. This is non-optimal as it is work in and of itself, it is also hard to correctly simulate things like the network, or various AWS services being down/unreachable.

If you run the agent/daemon on your production stack, then it's a potential vector for misconfiguration or attack. But if you don't run the agent/daemon in production, then it's another way in which your test stack diverges from production!

I saw various PR/FAQs related to Chaos engineering while I worked in both EC2 and the AWS developer tools org. I've been gone over a year now, but I would bet that FIS does something at the EC2 Network level so that you don't have to install stuff on your instances or containers.


Are there any tools that simulate control plane failures for AWS or other cloud providers? This looks great, but it seems to only work on compute resources.

Many AWS outages result in APIs failing or returning errors, which should be possible to simulate. I'd like to create experiments where EC2 instances won't spin up in a particular AZ, EBS disks can't be reattached, Route53 zones can't be updated, IAM changes don't take effect, etc.


There's no tool that I'm aware of, but you could implement hard failures using SCPs. That will cause an error when the blocked API call is called. It won't, however, give you latency or any nuance in failure rates.


Looks like that guy's tweet was heard: https://twitter.com/cperciva/status/1292260921893457920

But AWS made it more configurable, and not in a single region.


Does it mention pricing? I couldn't find it.


Me neither. I would expect AWS to provide this service for free. It is like paying a price to design our systems to ensure they are reliable even if a zone/region goes down managed by AWS. Well - they might call this shared responsibility :)


> Me neither. I would expect AWS to provide this service for free.

A truly cunning pricing scheme would be to strike a Coasian Bargain: "we inject faults for free! If you want us to stop, you'll need to pay extra". That would mean the cost would fall mostly on those who are unable to avoid paying it.


The cost would fall one those who are unable to design for fault-tolerance.


Or who are unable to, due to existing constraints.

The charitable name for this lucrative segment of the market is "Enterprise".


I didn't see it either. Hoping it's semi-reasonable because I'd like to be able to really use it.


They never talk about pricing until a service launches.


Compute + DB should cover a lot of use cases, but it’ll be great to have support for more AWS services. DNS (Route53) issues are a pain to deal with for example.


What problems does route53 create? Is the api not available? Or the dns servers returning ko data?


A route 53 propagation issue caused a bunch of our web servers to be unable to connect to the read replicas of our RDS cluster. The cluster scaled in a RR, but the DNS change took hours to propagate. The web servers kept trying to connect to the replica that didn't exist.

Our workaround was to put the IP address in the hosts files - not an ideal setup, but it got the job done.


Are there similar offerings for other clouds?


Gremlin offers a chaos engineering service which I believe can be used across multiple clouds.


Just use Oracle...


you want to simulate faults, not have actual faults...


I need this now.


Looks cool, but maybe wait until after the new year before blowing up production on the team :)


Looks interesting - what stands out to you?


ctrl+f ... Nope, no mention of Netflix’s Chaos Monkey.

https://en.wikipedia.org/wiki/Chaos_engineering#Chaos_Monkey


Why would it? AWS was doing Chaos Engineering long before it was even called Chaos Engineering and well before Chaos Monkey at Netflix. Most historical coverings of Chaos Engineering trace back to Jesse Robbins at AWS pulling power cables from racks to simulate failure and running GameDays in the early 2000s.

See: https://en.wikipedia.org/wiki/Jesse_Robbins#Contributions_to... https://medium.com/the-cloud-architect/chaos-engineering-ab0...


People have been doing Chaos Engineering, and the more general category of "DevOps", for years before those things were given labels. I think we may credit Netflix with the awesome name of Chaos Monkey and for building a systemic in-house practice, but they were not the first to perform intentional destructive resilience testing in production.

Case in point, I've been yanking out active drives, cables, PSUs and CPUs since the 1980s, and not always by accident.

What's more, I was taught to do by someone who'd been babysitting VAXes for years; they had a ceramic axe that, at the age of fourteen, I'd seen them swing right through a three-phase minicomputer power feed at my local university's data center "for test purposes".


One of the more memorable projects in my career had me sticking pens in fans for months on end, because it tickled an error-handling edge case.


Chaos Monkey was explained to me as "The (virtual) monkey swinging around the (virtual) data center unplugging the (virtual) cables". It very well could have been based on Robbins. The point wasn't originality - the point was to have the same benefit in our burgeoning cloud environment.


In the 80s we had a serial line in our test lab set up with a chain of paperclips on the TX and RX lines.

Made it easy to insert comms errors and failures.


Not saying you're wrong, but maybe Wikipedia is leaving off a lot of relevant history on the Chaos Engineering article then.

Edit: Funny how my original comment is downvoted for contradictory reasons, both because "obviously that was irrelevant to the state of the art in chaos engineering", and because "oh don't worry, that's relevant, they just discuss it in other contexts".


I think your comment was downvoted because it's a silly thing to focus on.

> but maybe Wikipedia is leaving off a lot of relevant history on the Chaos Engineering article then.

This is certainly true.


However the very first sentence is "AWS Fault Injection Simulator is a fully managed chaos engineering service".


I know -- still no mention of the state of the art or what product it might replace.


In Dr Vogel's keynote presentation he brought up Netflix and Chaos Monkey specifically as precursors to this service.


It would be better if that context carried over to this document.


That would be superfluous to the content in which most people would refer to that document for. Amazon already state it’s for Chaos Engineering, they don’t need to list other similar projects on their own product page.


what is with the amazon spam today?

are their SDRs trying to boost their numbers for the year?


It's AWS re:Invent [0], where they typically announce new product offerings.

[0] https://reinvent.awsevents.com/


didn't realize this. and I went to re:invent last year!


Amazon frequently releases a bunch of new cloud services all on the same day.


A new service announcement isn’t spam.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: