Kudos for a thorough and transparent writeup, and (by the looks of things) understanding that processes fail rather than individuals.
That said, I have to admit to having at least three eye-bulge WTF moments while reading this.
I'm also surprised that there isn't a Remediation step of "firewall the development machines away from the production database".
(And isn't the change to database_cleaner to make it throw when run against remote databases by default a serious break of API compatibility? What if someone's depending on that behavior?)
What a great way to see failure: what went wrong, how to improve without blaming it on the poor guy. I wish all companies were like this - very forward thinking!
I think it is now widely understood that best practice is to blame processes, not people.
However, i do have a niggling worry that letting people off the hook risks stunting their personal growth. If you can make mistakes and not bear any responsibility for it, what is driving you to get better at doing your job?
> If you can make mistakes and not bear any responsibility for it, what is driving you to get better at doing your job?
It could be any of the following:
1) ambition: desire to improve yourself
2) fear: desire not to screw up (and feel that you are making mistakes)
3) fear: job security
In all cases what's essential is a feedback system. The fact that you are not directly the one to be blamed doesn't imply you are automatically not accountable for anything you do. Everyone shares at least partially the blame.
Creating a blameless culture, having a good feedback system in place with maybe some goals, objectives and ways to improve is probably something you need, if you want to keep your employees and have them grow with you and not treat them as expandable.
Good management should be able to do this. If someone is not fit for a role, a good manager will be able to find another way for this person to be useful. The last thing you want to do is to fire someone. At least that's my opinion. Again, management.
IMO it's distinguishing mistakes from pure negligence.
A developer who's able to easily run scripts in production without any checks in place is different than a developer who knowingly sidesteps existing processes in order to perform a task faster or easier.
The punishment for one versus the other should be very different.
These things can be different: whether you take personal responsibility for a mistake, whether you are legally responsible for a mistake, whether your manager holds you responsible for a mistake, whether your team holds you responsible for a mistake, and whether a post-mortem report blames you for a mistake. It may be easier to take personal responsibility for a mistake if this isn't coupled with these other things.
I think when the circumstances actually show someone was grossly negligent, it's very hard to convincingly pretend that only processes are to blame. Remember that the current thinking is working against thousands of years of a cultural pattern assigning responsibility.
It's also interesting how many seem to be using this idea meant to be applied to internal processes and individuals and applying it to the companies themselves.
The most significant example was Gitlab: they had something like four backup mechanisms for their database, yet three(!) of them had not been working for months, with nobody noticing. Then, someone actually dropped the production database. The last working backup system also had some problems I can't remember right now.
Yet in the comments here, there was barely a hint of criticism of Gitlab. Instead, they were lauded for not lynching the poor guy who made the last of what must have been dozens of terrible decisions leading to the incident.
Travis' incident here does actually seem like the sort of freak accident that one can't rule out completely, at least not at a company of their size. But forgiveness might not always be warranted.
> I think when the circumstances actually show someone was grossly negligent, it's very hard to convincingly pretend that only processes are to blame.
On the contrary - the entire reason why you transition from culture to process as you grow is because when you measure by results, there's no difference between unintentional negligence (leaving a production terminal open à la Travis), intentional negligence (let me bystep this annoying check, let me procrastinate on fixing the backups a la GitLab), and malice.
If you get large enough, you will have grossly negligent people, by statistical inevitability. You can either accept this statistical inevitability and design your process for it or you can continue to believe that you (and everybody else) really actually do only ever hire the very best.
GP is pretty clearly talking about blaming the company (processes) and not individuals.
The entire post laments the fact that entities external to the company will repeatedly forgive the company if the company goes through the motion of publishing a blameless post-mortem. Even if those post-mortems repeatedly indicate the company really didn't learn anything, or change the processes that led up to the incident.
But by golly, they're forward thinking company, and therefore blameless.
Isn't the fact that there were "dozens of terrible decisions" leading to one poor guy dropping the production database (in a way that was extremely hard to recover) absolutely a failure of process rather than a failure of a person? He made the last mistake in a big series of mistakes, so unless he was also responsible for all the prior mistakes and failed/non-existent processes leading to his mistake, it seems very unfair to blame him. It is, of course, perfectly fair to blame Gitlab, and plenty of people did just that in the comment thread about it at the time.
> Yet in the comments here, there was barely a hint of criticism of Gitlab. Instead, they were lauded for not lynching the poor guy who made the last of what must have been dozens of terrible decisions leading to the incident.
The threads are still there. You can go back and read them, and notice that quite a few comments are along the lines "not an individual fault, but catastrophic on the group level", "I don't think I can trust Gitlab to run a production service", "How the hell do you have all your backups failing and not notice?!".
While in a huge company it might be hard, I feel like any small engineering team can be composed entirely of self motivated people
If you make a mistake (a real one that causes problems for the company and its users), that is enough for most people to want to avoid it happening again.
At the very least, different credentials in the database for each environment would make sense.
I prefer thorough E2E testing to mocked large-component integration testing (i.e. spinning up a db vs mocking it out) so I can totally understand needing to run tests against a database but yeah reading this was a collar-pulling "yikes" moment for me.
Their remediation steps are good though, for a start... They might want to look into spinning up completely new databases programattically for tests rather than connecting to one...
This seems like it fits the narrative. Still, WTF? A quick google tells me that PostgreSQL supports SSL cert-based authN. The prod environment (and possibly an Ops host) should be the only ones with the right certs to connect to the prod DB.
Creating an ssl connection to the postgres server might add extra connection time (Yes could be prevented by persistent connections but that's not always feasible, especially in a cloud based environment). So it's not the best solution.
Postgres pg_hba.conf does support access by ip address easily enough. For access to the production databases the ip addresses should be limited to those that need access. Or if not in postgres, a firewall could limit it.
So if a developer accidentally tried to wipe it, they wouldn't be allowed to connect. If they actually needed to connect, they would first have to add their ip address (or use a bastion host).
The GP was presumably talking about TLS with mutual auth. There’s no excuse for not having TLS at all enabled, but mTLS is a great extra security feature. The cost of mTLS (performance wise) is negligible, and if you can’t keep a persistent database connection up (because cloud?!) but also can’t do TLS because perf then I don’t know what to tell you. The only performance issue TLS has is that it isn’t used enough.
As someone who has helped many, many companies with Postgres I can tell you that nobody runs PG over TLS in production if everything’s talking over private links.
if you are running PG like a KV store (simple lookups/simple updates/inserts) where most of the dataset fits in memory then it quickly becomes CPU bound and changes in PG configuration/access patterns can have large effects. for example i've seen running pgbouncer on the same host drop performance by about 33%, and switching from simple protocol to extended protocol single use prepared statements drop performance by 33%.
It's possible they disabled the localhost check at some point.
Edit: It does now, after Travis folk fixed it. :D
And, Rails (which I assume they're using) typically loads DB credentials from a `database.yml` file for a particular `RAILS_ENV` (local dev, local automated testing, prod). By setting the `DATABASE_URL` env var, they are overriding that (this is fine/expected behavior for app deployment as far as Rails is concerned; it's comparatively less common for localdev). So that bypassed the `RAILS_ENV=prod` check that might also have caught this (e.g. `RAILS_ENV` was undefined or `development`).
So I'd really have loved it if they dug into some of the UX factors to this issue:
* Why do they allow remote access to their production database? (Is it too hard to get a local database copy? "For debugging purposes"? Why was the eng in prod the day before? Was it unintentional? Should we revisit those assumptions? Should we resolve whatever issue that caused us to shortcut by allowing remote, direct prod db access?)
Why, for localdev, they override database.yml and use env vars (which I've found is more cumbersome to use and not as common). Yeah, in production you should use RAILS_ENV/DATABASE_URL/etc. - so are they attempting to have prod parity? Why or why not?
* Why are folks routinely logging in (for debugging purposes) with a DB user that is, effectively, root? I bet that user can `drop table` too. Should folks have a limited access account and then jump through a hoop or two for root?
It sounds like they want to "debug" prod by running some local rails tools (rails c?) with DATABASE_URL set to the prod db. Is that the "best" way to do it? heroku etc. actually open a rails console in the environment and not locally.
So one of the ways of analyzing the root cause here is that the autoincrement index of the user table in their database is security-sensitive, and relatively normal DB operations like "Let's roll back the DB" have serious security implications involving ID reuse. What are some ways to make this less dangerous? (The rest of it was an operational failure, but it would have been less trouble if it weren't a security failure.)
I can think of the following:
- Don't use auth cookies that are signed messages consisting of a UID + expiration date and other data, use auth cookies that are opaque keys into some valid-auth database. This is significantly less efficient (every operation needs a lookup into the DB before you can do anything; if you move it into a cache you now risk the cache being out-of-date with your DB). AFAIK using signed UIDs has no security downside other than this, right?
- Identify users by usernames, not by UIDs. This makes renaming users (which GitHub allows, so Travis is forced to allow) difficult and security-risky.
- Use UIDs that are selected from a large random space so make collisions unlikely, e.g., UUIDs or preferably 256-bit random strings. This seems fine and probably preferable from a security point of view. Is this fine from the DB point of view?
Anything else? Maybe a DB restore-from-backup option that preserves autoincrement counters and nothing else - is that a standard tool?
I use a mix: UIDs (auto-incrementing primary keys) for internal app use (e.g. joins), but use UUIDs for referencing records outside the app (anything sent out over the API).
The UIDs just make for easier to read logs, and easier to inspect and hand-write db queries. The UUIDs just seem much more secure when communicating with client applications.
I also agree that authing with a DB is preferable, if you can afford to do that at the scale your app needs (which is most apps out there).
Why not just UUID for the primary key also and only have one column? Having 2 columns seems like extra complexity for not much gain. Modern databases usually have a native UUID type column which stores and compares better then using the string/char type.
One more benefit of using UUIDs for primary keys is that clients can generate models along with the primary key and know what the key will be BEFORE they submit it to the database. That is, it works really well with distributed systems. Auto incrementing primary key is really state stored at the database level and a source of contention
I would even argue that UUIDs are far better to read logs. You do a grep for that UUID in the log and it's easier to find then to grep for integer primary keys. If your contains disparate models then UUIDs will find your entries easier then integers.
If you ever have to rollback your database but keep some of the new data, it's easier also. You just export the data you want to keep, do your rollback, and them import that data knowing that auto incrementing primary keys will never be an issue.
You can copy objects easily from one database to another and know the primary key will be the same, and not clash.
For hand writing db queries, it's also better to use a UUID since then I can use that same query cut and pasted into other databases with the same data and know that I don't have to check primary keys. The UUID is just a copy/paste.
A UUID primary key has to live in every index in the table and every foreign key that references it.
We have a table with a UUID as its primary and it alone consumes 22GB, then a index reference it, so now another 22GB... That one primary key uses over 100GB of storage, a developer recently went to add another table that reference it and we had to decide whether we were okay taking another hit. If we used your example of using a char type (we use blob) it would be double the size...
It doesn't sound like much, but in a database with all primary keys being UUIDs you are going to inflate the size of your DB quickly, or have to forego using foreign keys. I imagine if we used only UUIDs we would have to double or triple our database infrastructure. Additionally, we would be forced to introduce partitioning sooner.
Now we have a tech debt ticket to add an auto increment to that table so we can reclaim disk / memory.
>I imagine if we used only UUIDs we would have to double or triple our database infrastructure
I think you're greatly exaggerating how big UUIDs are compared to ints. The upper bound is 4x, and that's if your table is keys only, with no other data. That falls down to 2x if you're using bigints. In a typical user table with a bunch of other fields, like a username + password hash, the storage used by the keys is quickly dwarfed by the storage used for "other data".
I would argue that's a reasonable hit for the gains. I would never add a tech debit ticket to change just to reclaim disk / memory, that's an anti pattern.
Memory and disk space is cheap.
Also UUIDs are just 128bit integers internally. My example clearly points out that most modern databases have a native uuid type (and if not use use a 128 bit integer).
Using a char type to store UUIDs is very wasteful and no wonder your indexes are so huge.
The tech debt ticket you really have is to convert type of uuid column from char/blob to native.
Note also that if you need to expose your data, you can either expose the auto incrementing primary key (which leaks data, in that other people can work out growth rates of various models, like how England worked out how many German tanks were being built based on a auto incrementing serial number), or you create a another column (as someone else suggested) which is a uuid. Which you would need to index anyway.
If index size is an issue then you can just integers which are randomly generated. You can calculate your collision chance based on how large your integer is and how often you create. If you happen to pick 128bit integers, congratulations, you're using UUIDs.
>Using a char type to store UUIDs is very wasteful and no wonder your indexes are so huge.
That was a hypothetical, its 22GiB using BLOB storage.
>Note also that if you need to expose your data, you can either expose the auto incrementing primary key (which leaks data, in that other people can work out growth rates of various models, like how England worked out how many German tanks were being built based on a auto incrementing serial number), or you create a another column (as someone else suggested) which is a uuid. Which you would need to index anyway.
I state that we use the auto-increment inside of the application, everything exposed by the API is a UUID. While we lose 32 bits of space per row, just one foreign key or index saves enough space to justify it.
Best of both worlds.
>Memory and disk space is cheap.
Not when you need to move from multiple r4.8XL to r4.16XLs (think multi-az, multi region). It would work out to around double the price, we can hire a developer for the cost of hosting that infrastructure. Even when your business has the money, you don't want to be the guy telling them to drop another $24k a month on hosting costs because we didn't think one year ahead.
This of course ignores all the issues that come with replication and partitioning, multi-master etc.
>"We have a table with a UUID as its primary and it alone consumes 22GB, then a index reference it, so now another 22GB... That one primary key uses over 100GB of storage,"
I'm not following your math here. Can you elaborate? I see 22 and 22 and the 100.
Aren't UIDs effectively the same as integers in this case? Seems like it's conceptually similar to using the MD5 hash of the integer ID. Wouldn't it still cause exactly the same issues mentioned in the article? Anything non-random would.
The ORM I wrote (and use...) automatically generates a an internal and an external identifier for every persisted object. Works pretty well in practice.
There are simpler solutions. When you recover your database, simply add a large constant to all autoincrement columns, that should prevent any overlap. Atleast for existing setups this is ideal.
The other option is to use snowflake IDs with a random suffix. They will still sort like normal IDs but won't collide if you roll back (or are unlikely to collide).
UUIDs or fully random strings are a bit meh as primary column in a database. They disconnect all meaning and potential sorting from the records (sorting by primary key is meaningless and random PKs may ruin database performance depending on setup).
Snowflakes will behave like a UUID or random string when properly engineered but are still sortable.
Identifying users by usernames also leads to all kinds of sillyness, internally you should be using the ID and not the username. Usernames are meant for human consumption, not machine consumption.
"Snowflakes" meaning like Twitter Snowflake, an autoincrement with a timestamp folded in (and maybe some ability for multiple read/write database servers to generate IDs without actively coordinating with each other)? That would work - can I easily get Postgres or something to generate these?
I usually prefer Snowflakes with a random suffix behind a timestamp. 34bit UNIX Timestamp sidesteps the 2038 issue and 30 bits of randomness are good enough to prevent collisions with a p < 0.0001 of up to 400 inserts per second (or 1500 inserts per second for p < 0.001). And you can always tweak this to match your workloads (use a bit more randomness, use a bit more timestamp, maybe even use a custom epoch instead of 1970 to reduce the number of bits necessary or use millisecond accurate timestamps)
PostgreSQL supports calling functions as the default value of a column so it should be possible to simply put in a function that returns such a snowflake (either in pure SQL or in PL/Python) and PG will use that.
I’ve tried a similar approach and ran into an issue of id generation during tests. Tests run in parallel and quick succession. Ran into collisions right away. Any suggestions on how to mitigate that?
First, you should still use a UNIQUE column, that prevents the most urgent collision issues.
Second, if your tests run a lot faster than actual loads, consider using a special test timestamp format that allows for higher throughput at the expense of longevity (ie, 10bit timestamp since start of test + 54 bit of random data) or alternatively throttle the tests to closer resemble real workloads (within one or two orders of magnitude)
You can also replace the random part with a Node ID + atomic counter. Each Node that generates IDs gets a unique identifier as ID which is inserted after the timestamp. Then the last few bits of the ID are used by an atomic counter. This ensures that nodes don't generate colliding IDs with eachother (the NodeID part) and with themselves (the Atomic part).
> First, you should still use a UNIQUE column, that prevents the most urgent collision issues.
Right, of course. I was using the UNIQUE/PK columns. When I said collision, I did mean that db threw errors on duplicates. Which is, of course, expected.
> You can also replace the random part with a Node ID
The solution is to stop using auto-incrementing IDs generated by the database, and UUIDs are not the only answer.
It is trivial to have an app request and reserve a pool of IDs and assign from them as needed. It guarantees no overlap, is completely distributed, requires no db roundtrip to insert, can remain as numeric data, allows every row to be uniquely identified across tables or databases, still has loose ordering, and it only requires a system with atomic increments to implement.
From where? If the counter for this pool of IDs lives in the database being rolled back, it is vulnerable in the same way as auto-incrementing IDs. If it doesn't, it isn't - but the safety you're getting comes simply from the fact that it doesn't get rolled back, not the rest of the machinery. It would be enough to make sure you don't roll back autoincrement counters when rolling back the rest of the database.
That said, i am a fan of the approach you outline. I learned it from the IdGenerator in ATG Dynamo (c. 2500 - 2200 BC):
You're right, it's an external system. Left it out since a key/value db is super simple to operate and very cheap by just using dynamo/cosmos db/cloud datastore to maintain it globally.
All the benefits help offset the extra dependency and make database backups much simpler, especially since every row has a unique id.
Is there a reason to do this other than security against DB rollbacks? It seems like a lot of complexity to add another stateful (micro?)service just to solve this problem. It's certainly trivial to implement—it's probably even interview-sized—but it doesn't sound trivial at all to operate, since it has reliability, scalability, security, and possibly NIH concerns. (Am I building this on top of another database? On top of a Paxos of some sort? What happens during an extended network partition?)
All you need is a basic key/value store. Zookeeper/consul/etc if that's already running, or use the numerous cloud services where it would costs pennies to have a globally replicated key/value db.
Increment the "ids" key by Y amount, get the current number N, and you now have N - Y as the range to work with. No paxos or network splits to worry about, and far better performance since its 1 call to get a batch of ids.
If I already have a high-availability key-value DB with persistence, sure - but probably I don't, most designs I've seen for SQL + KV on top involve using it as a pure cache with explicit permission to drop the KV store instead of restoring that to a disk snapshot whenever something goes funny. And for this use case (robustness against reverting the SQL database), making the KV store a mere cache of data in the SQL database defeats the point entirely.
And, honestly, most sites I've seen don't need the cache, they can just run on top of the SQL database itself. The fact that you're not serving hundreds of queries per second doesn't mean you don't need security.
Having run Zookeeper, "No paxos or network splits" isn't really how I would describe it. (Nor "super simple to operate.") When it works, it works well, but it's another thing that you need to operate. Adding a dependency on a cloud service just to supply autoincrement doesn't really sound like a great solution. Especially because, by the nature of the problem, I am outsourcing auth to the cloud service: a malicious service can pretty easily issue themselves token for an arbitrary user. And if I'm not already comfortable outsourcing auth, e.g., because everything is in this cloud already, it doesn't seem like this actually solves my original problem which was increasing security.
I think you're misunderstanding. Use postgres or whatever database you want, just generate the ids separate in your app and use a simple key/value store to maintain the counter. The SLAs of a something like Azure CosmosDB is several times higher than any app you can build yourself.
None of this is directly related to security. The issue for them was putting ids in the tokens, and there's no way to solve that other than not doing it. However if they didn't reuse ids, then it wouldn't have mattered, and that part can be solved by moving the id generation out of the database so that backup/restores of the database are simpler. This is in addition to all the other benefits of separate id generation I first mentioned.
> The shell the tests ran in unknowingly had a DATABASE_URL environment variable set as our production database. It was an old terminal window in a tmux session that had been used for inspecting production data many days before. The developer returned to this window and executed the test suite with the DATABASE_URL still set.
I was expecting something like this. I remember, I configured my terminal windows to change their background when I'm on production systems [2], after around I read about gitlab database incident [1].
Though note that wouldn't help here, at least according to a common sense reading of this part of the post:
> we connected our development environment to a production database with write access
i.e. they weren't logged in to a production machine, they were logged in to a development machine that was permitted to connect directly with write access to the production database.
Since it's a local shell doing something like this is a lot easier - just change $PS1 to add some colors when certain variables are set.
I use something like this today on my personal machine to distinguish between my personal and work email addresses in $EMAIL (for silly firewall reasons it's easier to originate work-owned OSS on my personal machine), and on my work machine to tell me which production zone I'm talking to - but I don't have it color when I have any variable set at all, I probably should do that.
I did a similar thing as a result of the gitlab incident. I use a custom iterm2 profile called "prod" that sshs into the production "shell box" and sets up a badge + bright colour.
I've similarly started to rely on Google Cloud Shell as the only way to access our production database. That way, firewalls never have to allow external access, and it's very clear whether I'm on production.
Adding a Metabase instance with read-only credentials has helped this a lot - because executing one-off data queries is easy (and safe) now.
I was surprised about a tmux session connected to production DB for days. Though it was idle there are a lot of things that can go wrong during window switching. My colleague also pointed out the subtle error of assuming the value of DATABASE_URL in the system instead of being set explicitly by the test script that could have avoided this.
That being said I am amazed at their transparency over the whole issue and a thorough write up of the whole incident. It's something we can all learn from.
This is why I've grown to consider some actions to taint terminal windows, and tainted windows must be destroyed asap.
For example, administrative access to vault requires a root token in the environment of the shell, so that shell is tainted with too much access. Having an ssh-command to a productive data storage also taints a shell, because it's really dangerous to accidentally up-arrow-enter and then you end up on a productive data storage.
History has been a bit annoying with that, but since I installed histdb, that has been a non-issue.
I'm a heavy user of my shell history. Frequent creation and destruction of different concurrent shell sessions clobbers the history quite well until there's nothing really usable left. This resulted in some inertia to keep windows around.
zsh-histdb replaces the single-file ~/.zsh_history with an sqlite3-database - and this database is session aware. This way I can just have as many concurrent session as I want, and create / destroy shell sessions as much as I want and the shell history remains intact.
That file is created on (almost, see below) each and every machine I administer as part of the initial configuration/provisioning process. By itself, it's nothing spectacular, but it sounds like it would have prevented this incident (as would have several other things, of course).
For the hosts with csh as the default, it's:
$ cat /etc/profile.d/timeout.csh
set -r autologout=10
I am not a tmux user so maybe this is obvious, but how are the environmental variables transferring from the production machines to where the tests are being run?
Or, is this stating that the developer started running tests from the production environment?
> but how are the environmental variables transferring from the production machines to where the tests are being run?
tmux[1] is a terminal multiplexer, a screen alternative.
Some speculation, they might have stared the session on a remote machine; set `DATABASE_URL`; use it for something; detach from the session; to later reattached with the bash env still in tack; without the developer being aware he executed the tests commands.
Another tmux "problem" is developers using it to tail logs with a infinite back-scroll to only fill up server memory. One solution was to kill all tmux session nightly via cron. Might have helped here.
Finally as tui lover, try tmux! Really keeps the frustration at bay if you ssh to a box and your session drop to know your terminal workspace is still in tack.
the tmux thing is a red herring, i encounter similar environment variable things plenty just with open terminal windows(though, not for database_url).
i read it as a couple days ago in their local terminal they exported the prod db as an environment variable. ran some script or console or something that needed to run in prod. didnt unset the variable. Then next time they run any command in that terminal it hits prod.
maybe it was a remote machine, but there wasnt an active db connection, just a hidden config that exploded the next time a command ran
My guess is the second one where the developer assumed that it's a testing environment tmux session but started the test script in a production environment.
If you are interested in a list of steps you can take to avoid this happening to your data, here are some suggestions. I don't believe that any single measure is sufficient. And I also believe that it's valid to balance the strictness of your controls against the the amount of protection you really need.
1. Vault the passwords. People and machines should fetch passwords on-demand using identity credentials.
2. Create a read-only database account. In all cases, use the account that matches the need. Running reports? Use the read-only account.
3. Restrict access to read-only and read-write database accounts. Provide this account information to a limited set of people and tools.
4. Provide a fairly straightforward way for people to get temporary elevated access. If it's easy to get elevated access, then users will not be tempted to "hold on" to elevated access longer than they should (e.g. by leaving a terminal open for a very long time).
5. Rotate the credentials of all the accounts regularly. This ensures that temporary elevated access will become long-term access. It also greatly reduces the harm created when credentials are leaked, exposed, or forgotten about (e.g. in an environment variable in an old window).
Note that none of the steps above require a heavy investment in automation. You can start with basic (even fully manual) processes for key management and access management, and evolve to automation as you grow.
Finally, keep in mind that this type of accident is not just a "small company" problem. Recall this AWS ELB outage on Christmas Eve of 2012 - https://aws.amazon.com/message/680587/
"The data was deleted by a maintenance process that was inadvertently run against the production ELB state data."
> 4. Provide a fairly straightforward way for people to get temporary elevated access. If it's easy to get elevated access, then users will not be tempted to "hold on" to elevated access longer than they should (e.g. by leaving a terminal open for a very long time).
Or to hack around it in some other way, eg by adding some sort of backdoor to a production application.
This is such an important lesson, and one that unfortunately is lost on many command-and-control type IT departments.
I think the root issue here is that the production database "user" has too many privileges, and the reason for that is migrations. This is compounded by the test user essentially needing to be a db superuser to create and destroy test databases, as well as run the migrations for them. I've noticed this lately with Django, but I'm guessing that it's a general problem.
When I design a DB system, Ideally the production 'user' can only do those things that we reasonably expect them to be able to do, and truncate isn't one of them. Also drop tables, potentially delete entries, and any maintenance tasks. DDL modifications are right out.
Those tasks can be run from a specific user, and locked down to a certain types of connections that aren't allowed from production.
When i first worked with migrations, they were run manually as part of the release process. Over time, we automated them - by making them part of the release tool. It never occurred to us to make them part of the application itself. For us, it was therefore completely natural that the app's DB credentials did not have DDL permissions.
The whole idea of migrations being run by the app still seems really silly to me. I suppose this is the obvious move for developers who (very sensibly!) build applications, but don't build their own release tools, to whom it would never occur to put migrations anywhere except the application.
> The whole idea of migrations being run by the app still seems really silly to me.
I have one app that kinda runs migrations - there's a helper app which is built at the same time and (normally) runs before the main app. But this is based on a per-user database which means there's no "release" process that could run the migrations.
(Similar to iOS apps, really - when you get a new version of Teleappchatbook, it'll often have an "upgrading..." step which I assume is migrations / index updates, etc.)
Yeah I think it's interesting that companies will setup completely separate dev/prod databases but then grant all permissions to the the user. Roles are powerful and seem underutilized. Most of the time I only need read access so perhaps it's particular to my work, but why not use a dev role, that only has read access, for developing against production data?
Totally agree with you, and now that you bring it up in this way I think I want to change how we execute (Django) migrations at work.
Create a separate user with higher privileges that can run migrations. Use that connection during the deployment phase that executes migrations. Otherwise, the app gets a standard "user" without DDL access.
Wow, the issue with the signed token is very interesting. Found it surprising the authentication method specifically wasn't mentioned in the remediation.
Food for thought: the security issue wouldn't have happened if (1) travis used UUIDs instead of sequential IDs as a pkey, or (2) used a secret token for auth instead of a signed (presumably) JWT.
1) They don't need UUIDs as a pkey, but just as another field. I like using auto-incrementing IDs as the primary key, and UUIDs for referencing records over internet-facing APIs. Best of both worlds!
JWT itself is a nice container for signing a small amount of JSON and being able to easily pass that around. I use it a lot for situations where I want to ensure someone hasn't futzed with the data, and/or I want an auto-expiring token of some sort.
JWT, by itself, is not an authentication and authorization system, but people often use it as such.
If you use a secret token you would compare the token from the first user 123 to what’s stored in the DB (for the second user ID) see that it doesn’t match and not log them in.
Because they used a signed user ID, there was no way to differentiate the user 123 from before the system restore and after.
When I look at our development, devops and operations people, the likely hood of making stupid mistakes seems to be correlated to skill-level rather than what title the person has. The developers who work part time with operations seems to cause less problems than pure operations-people, maybe because they only do it part time and therefore are a bit more careful.
But I'm sure you can enlighten me with some actual research-based facts?
I thought the same thing. The production database should not even be on the same LAN segment as the development stuff, you should have to VPN or tunnel into it specifically to query it.
only devops should have this info and they should be guarding it fiercly and only use when they know why/when... it would be easy enough to mirror all production data to a 2nd db w/ full read/write that's updated daily or weekly from source. That should give plenty of data for devs to work with.
Least privileged access isn't necessarily the antithesis of devops.
You could argue that dev-ops model include automation for operations on production systems, with direct access to production systems limited to a reduced set of staff. Developers can create the change-sets to modify infrastructure, but those change sets are reviewed, validated, tested, and then executed on production via CI/CD Automation.
Infrastructure as Code, and Immutable Infrastructure lends itself well to that approach.
The idea that developers should actually be involved in ops doesn't seem to be DevOps anymore. Apparently that's called "NoOps" now, and DevOps is just ops people scripting things.
TL;DR don't rely on humans to "do the right thing", even if they're supposed to know what they're doing (i.e. ops people).
As part of a team (and an organization) that practices "devops" heavily, all of our devs do ops. We maintain the separation using an oncall/ops rotation and only touching prod from designated "ops" hosts. We also follow a "2 person" rule when touching production. We use a secrets management system to deliver credentials to hosts and alarming setup when the "wrong" hosts (e.g. dev hosts with access to prod creds) have access to certain creds.
Finally I can show a solid example to my team mates who ridiculed me when I said we needed restrictions on access to prod servers. This is a great write-up!
Um, you shouldn't need a solid example for such a crystal clear best practice. I mean, everyone has access restrictions to PROD, it's simply to valuable and too costly to have to bring back up after a fuckup.
>" Using our API logs, and with information from our upstream provider about the IP address the query originated from, we were able to identify a truncate query run during tests using the Database Cleaner gem."
I'm assuming by "upstream provider" here they mean ISP/IaaS provider. Either way they didn't have enough information under their control to identify the source of the query. The reliance on a third party for accurate logging information seems like a big blind spot.
What if the upstream provider didn't have the logs? Or the request for access to those took an excessive amount of time? I didn't see anything in the remediation steps to address this.
One thing that immediately caught my attention: the fact that it is possible for a single query/command/request to wipe everything.
To be frank at a place I worked there had always been something like this too: if you were logged in as super admin, wiping all data is just one POST request away. That was super convenient when testing things, but having the same in production made me uneasy. Fortunately before any incident happened I added additional checks that required special command line flags to enable this API. Perhaps still not super foolproof but I felt much better.
I wish people would talk about this more. I am of the belief that if a single command can lead to a failure like this, you can’t simply plan on that accident not reoccurring. You should basically assume that it will reoccur.
Ideally, I think databases should integrate checks such as these. For example, how often does a production users table need to be truncated intentionally? Even by superusers? Usually very, very, rarely. So imagine if the database made you jump hoops before you could do that. People rely on permissions for this sort of thing but given how complicated permissions can become as the team and the database grows, it’s not hard to screw up the permissions and access control. I believe catching this kind of doomsday scenario is best when built in deeply at a very low level.
What about a database that would require separate DDL from data commands? Instead of having a single super-user that can do both DDL and INSERT/UPDATE/DELETE, you would instead have a DDL user, and a data user. That would probably prompt people to only use the data user in their application?
Shouldn’t there be a remediation step of making it impossible to login into another users’ session? E.g. generate a random number for every provisioned user and add it to the token.
All of my production terminals have dark-red background and my screen hardstatus also red. This is my default in rc files, and I have to explicitly link rc files to get my dev-only black background with lime hardstatus.
Lots of focus in the comments on the database access issue, but trusting the user specified (signed) token doesn't seem like a great idea. Not validating the token against database seems like a painful shortcut
This makes me think of the Google SRE book. They advise that, if there is a problem this big, any SRE should have the power to turn off the production load balancers until the problem is fixed.
I don't think that TravisCI did anything wrong. However, if they had turned off the load balancers as soon as they realized that there was a huge issue, it might have protected customer data more. They optimized uptime over completely fixing the issue. Also, perhaps nobody felt that they had the authority to turn off the production service.
So the session keys mapped to usernames, rather than IDs in the database? Otherwise, when the database is restored with the old user IDs, the session would become invalid instead of continuing to work. This is what I'm seeing:
1. Tables truncated.
2. In this window, someone creates an account with a username that existed in the dropped database.
3. They see a blank user page because a new user record was created.
4. Database restored.
5. It's as if you're logged into the original user's account.
This is not what happened. The tokens were mapped to user IDs and when people signed in, the db created new users which may have had the same IDs as old deleted accounts. When they restored the DB, these tokens pointed to other users and granted access to these other users' accounts. Quite an unfortunate situation. May have been mostly avoidable if UUIDs were used instead of incrementing IDs, but hindsight is 20/20.
The part of that that I don't get is how a new user could have the same ID as an old (truncated) user since "our system created new records for them, with primary keys generated from the existing sequence (PostgreSQL does not reset id sequences on truncate)."
Do they mean that the only potentially exposed accounts are those that signed up after the database was restored?
Yeah they must mean new accounts, if not then I'm lost. I guess it could have reset autoincrement but they said it didn't. The only other thing I can think of is that the signed token that's put in localStorage is sent to the server like "someuser|sometoken", the server inspects sometoken, says it checks out, then takes the client at its word that it's someuser.
Classic case of Developer returning to window with prod env setup. I am sure it was a "blameless post-mortem" i.e action item contains change in tooling and processes rather than trying to change human behaviour.
Aren’t these the folks that spammed every github repo with a spam pull request to integrate their system into your code? I kinda lost all respect for this project and their developers after that incident.
Apologies that you were affected by this. The script creating these pull requests was created and run by a third party not affiliated with the company. We were similarly upset by this.
Correct, this was from an overenthusiastic user, someone we did not know and had no direct contact with.
At the time we weren't actually making money yet, and most of the contributions to Travis CI came from outside collaborators.
To add to the confusion, we did indeed have a bot in the early days that would comment on pull requests, but only if the repository was using Travis CI already (this has now been replaced by GitHub's status API). However, this was not the same bot account that kept opening unsolicited pull requests on random projects.
I was looking for one as well, but it seems we did not write a blog post. I will do some digging when I find the time, as I know we at least messaged some people that voiced their frustration directly.
I will answer in the context of PostgreSQL which is the system I know the best. PostgreSQL allows you to create "copies" of the database using replication. A database which is replicating is always read-only. The database from which it is replicating can be read-write (a master) or read-only. In other words, replicas can replicate from replicas in a tree structure.
Any PostgreSQL server which is replicating is referred to in the documentation as a "standby" or "slave". The terms are used interchangeably.
The replication can be synchronous, in which case the "master" and "slave" have exactly the same set of committed transactions at all times. Or, it can be asynchronous, in which case the slave might possibly lag behind the master. Only a master can have a synchronous replica. It used to be that you could only have one synchronous standby, but with recent PostgreSQL you can have as many as you like.
A standby is called "warm" if it is not serving client requests. It's "hot" if it is. Of course, standbys can only serve read traffic.
To answer your question, "read-only follower" is probably a "hot standby". At least, a Heroku follower is generally there to serve read traffic.
"master/slave" evokes the unhappy history of slavery, which people care about to widely varying degrees.
"leader/follower" or "primary/replica" are much more neutral terms that won't prompt negative emotions in many people.
People who choose one of the latter two options do so either because they feel it is more accurate, or because they wish to avoid the negative connotations of "master/slave", or a mix of both.
It's better to stick to tried and true terminology that everybody understands than to needlessly introduce new and redundant designations for concepts that have been in use for decades, just to avoid upsetting rather irrational American sensibilities.
Leader/follower and primary/replica are common terminology now. It's unfair to paint the sensibilities as irrational - this is exactly an example of the kind of nonchalance that helps lead to underrepresentation of minorities in tech.
You might not care, but there are a lot of people that do. Taking steps like this improves the comfort level others while affecting yours none. Why is that not worth it?
So now we have three pairs of terms that apparently mean exactly the same thing. I'm sure that whatever "minorities in tech" are more interested in not being needlessly confused, than not seeing the word "slave". Who would even think of this in terms of human slavery, and why would it be connected to american minorities only?
I'm currently job-seeking and I've seen many jobs ads asking for CI experience.
I'm not fond of using SaaS solutions and would like to fiddle with CI in private (e.g. using a private gitlab repo.
What would be the steps to setup an own, private and open source CI solution for, say, a Go, PHP, or JavaScript project?
You can run GitLab yourself, for free. Or even just supply your runner on GitLab.com: the builds run on your own equipment, but you don't need to host your own GitLab instance.
Presumably because that's already SOP. It sounds like the query was supposed to run against a development or staging DB, but an environment variable that the dev wasn't aware of caused it to run against prod instead.
That said, I have to admit to having at least three eye-bulge WTF moments while reading this.
I'm also surprised that there isn't a Remediation step of "firewall the development machines away from the production database".
(And isn't the change to database_cleaner to make it throw when run against remote databases by default a serious break of API compatibility? What if someone's depending on that behavior?)