I was a fly on the wall as this work was being done and it was super interesting to see the discussions. I was also surprised that Jepsen didn’t find critical bugs. Clarifying the docs and unusual (intentional) behaviors was a very useful outcome. It was a very worthwhile confidence building exercise given that we’re running a bank on Datomic…
Given that Rich Hickey designed this database the outcome is perhaps unsurprising. What a fabulous read - anytime I feel like I’m reasonably smart it’s always good to be humbled by a Jensen analysis
A good design does not guarantee the absence of implementation bugs. But a good design can make introducing bugs harder / less probable. This must be the case, and then it's a case to study and maybe emulate.
In practical terms, if you are a database and Jepsen doesn't find any bugs, that's as much assurance as you are going to get in 2024 short of formal verification.
Formal verification is very powerful but still not full assurance. Fun fact: Testing and monitoring of Datomic has sometimes uncovered design flaws in underlying storages that formal verification missed.
To start with, you usually perform verification on a behavioral model and not on the code itself. This opens up the possibility that there are behavioral differences between the code itself and the model which wouldn't be caught.
Thank you. I've updated my initial guess of p(critical bugs | did not find critical bugs) from 0.5 to 0.82 given my estimate of likelihood and base rates.
Evidence of absence ("we searched really carefully and nothing came up") does update the Bayesian priors significantly, so the probability of absence of bugs can now be estimated as much higher.
I doubt any organization that isn't directly putting lives on the line are testing database technology as thoroughly and competently as Jepsen. Banks jobs are to be banks, not be Jepsen.
I would have thought they would be more rigorous, since mistakes for them could threaten the very viability of the business? Which is why I assume most are still on mainframes. (Never worked at a bank)
Banks exist since a long time before computers existed, and thus have ways to detect and correct errors that are not purely technological (such as double entry bookkeeping, backups, supporting documentation, different processes). So a bank can survive a db doing nasty things on a low enough frequency such that is not detected beforehand, so they don’t need to “prove in coq” that everything is correct.
Mistakes don't threaten them that much. When Equifax (admittedly not a bank) can make massive negligent fuckups and still be a going concern there isn't much heat there. Most fuckups a bank make can be unwound.
Mainframe systems aren't tested to the Jepsen level of standard just because they were build on mainframes in the 70s/80s. In fact, quite the opposite.
Banks are not usually ran by people who go for the first fad.js they see ahead; they usually also can think ahead further than 5 min.
Also, I'm sure they engineer their systems so that every operation and action is logged multiple times and have multiple redundancy factors.
A main transaction DB will not be a "single source of truth" for any event. It will be the main source of truth, but the ledger you see in your online bank is only a simplified view into it.