As funny as this is, Chaos Engineering is not just random failures. It's causing known failures to see what happens. They may be a surprise to the service or even the engineers running the service, but someone should know what's happening so it can be turned off and so things can be monitored and recorded and hypotheses can be proposed and tested.
Doesn’t that kind of defeat the point of chaos engineering, that is, making the process unsupervised and triggering failure scenarios that were not yet known?
I always considered the simplest form of chaos engineering to be a process that randomly kills processes on a server, and see what happens.
If, on the other hand, you mean that it’s important that the “chaos” aspects needs and on/off button and that someone needs to be managing that, then I agree. :)
For some reason our customers keep asking us to launch our services in us-east-1, which I can only assume is for legacy reasons.
What I don’t understand, however, is what’s preventing Amazon from upgrading the infrastructure in us-east-1 and making it more stable. Is it about mitigating risk of unknown side effects and/or incompatibilities?
This is exciting! I've been asking for this for years. There is only so much you can do by implementing your chaos on instance. Having access to a lower level of infrastructure will be huge! Especially in a world of serverless services where we can't implement on instance nearly as easily.
AWS provided recently a library [1] for running chaos experiments [2], which look like an evolution of the work their own employee did before [3]. I wonder if the FIS is just a re-packaging and proper integration of these or something more.
Chaos Runner, or anything else that you run on your EC2 instances (or in your ECS containers) requires you to install some kind of agent or daemon. This is non-optimal as it is work in and of itself, it is also hard to correctly simulate things like the network, or various AWS services being down/unreachable.
If you run the agent/daemon on your production stack, then it's a potential vector for misconfiguration or attack. But if you don't run the agent/daemon in production, then it's another way in which your test stack diverges from production!
I saw various PR/FAQs related to Chaos engineering while I worked in both EC2 and the AWS developer tools org. I've been gone over a year now, but I would bet that FIS does something at the EC2 Network level so that you don't have to install stuff on your instances or containers.
Are there any tools that simulate control plane failures for AWS or other cloud providers? This looks great, but it seems to only work on compute resources.
Many AWS outages result in APIs failing or returning errors, which should be possible to simulate. I'd like to create experiments where EC2 instances won't spin up in a particular AZ, EBS disks can't be reattached, Route53 zones can't be updated, IAM changes don't take effect, etc.
There's no tool that I'm aware of, but you could implement hard failures using SCPs. That will cause an error when the blocked API call is called. It won't, however, give you latency or any nuance in failure rates.
Me neither. I would expect AWS to provide this service for free. It is like paying a price to design our systems to ensure they are reliable even if a zone/region goes down managed by AWS. Well - they might call this shared responsibility :)
> Me neither. I would expect AWS to provide this service for free.
A truly cunning pricing scheme would be to strike a Coasian Bargain: "we inject faults for free! If you want us to stop, you'll need to pay extra". That would mean the cost would fall mostly on those who are unable to avoid paying it.
Compute + DB should cover a lot of use cases, but it’ll be great to have support for more AWS services. DNS (Route53) issues are a pain to deal with for example.
A route 53 propagation issue caused a bunch of our web servers to be unable to connect to the read replicas of our RDS cluster. The cluster scaled in a RR, but the DNS change took hours to propagate. The web servers kept trying to connect to the replica that didn't exist.
Our workaround was to put the IP address in the hosts files - not an ideal setup, but it got the job done.
Why would it? AWS was doing Chaos Engineering long before it was even called Chaos Engineering and well before Chaos Monkey at Netflix. Most historical coverings of Chaos Engineering trace back to Jesse Robbins at AWS pulling power cables from racks to simulate failure and running GameDays in the early 2000s.
People have been doing Chaos Engineering, and the more general category of "DevOps", for years before those things were given labels. I think we may credit Netflix with the awesome name of Chaos Monkey and for building a systemic in-house practice, but they were not the first to perform intentional destructive resilience testing in production.
Case in point, I've been yanking out active drives, cables, PSUs and CPUs since the 1980s, and not always by accident.
What's more, I was taught to do by someone who'd been babysitting VAXes for years; they had a ceramic axe that, at the age of fourteen, I'd seen them swing right through a three-phase minicomputer power feed at my local university's data center "for test purposes".
Chaos Monkey was explained to me as "The (virtual) monkey swinging around the (virtual) data center unplugging the (virtual) cables". It very well could have been based on Robbins. The point wasn't originality - the point was to have the same benefit in our burgeoning cloud environment.
Not saying you're wrong, but maybe Wikipedia is leaving off a lot of relevant history on the Chaos Engineering article then.
Edit: Funny how my original comment is downvoted for contradictory reasons, both because "obviously that was irrelevant to the state of the art in chaos engineering", and because "oh don't worry, that's relevant, they just discuss it in other contexts".
That would be superfluous to the content in which most people would refer to that document for. Amazon already state it’s for Chaos Engineering, they don’t need to list other similar projects on their own product page.