My favorite database *by far* today is LMDB (B+Tree).[0] Performance is insane, ...

hendzen · on Aug 30, 2016

If I'm not mistaken, doesn't LMDB have a single, global write lock?

erichocean · on Aug 30, 2016

> doesn't LMDB have a single, global write lock?

It does, yes. The best architecture when using LMDB is to have a single writer thread, and many independent reader threads.

That said, it is nice to be able to run the occasional separate process to do some maintenance task on the database (e.g. to do some writes), and the global lock comes in handy then.

The main reason a single write thread design isn't a problem is you can do a million or more separate, fully transactional and isolated write transactions per second on a single core with LMDB. Writes are simply not the bottleneck.

You can't get anywhere near that kind of write performance with a fully transactional and concurrent write system like CockroachDB, not even with 100x the hardware. However, some of the eventually consistent databases, like ScyllaDB, can scale up to more total writes per second (but of course you give up the transactions, and the consistency).

The actual issue I have with using LMDB in this way is keeping it fed! My code is now written in C++, I'm using Aeron for high-performance messaging and Cap'n Proto for messages (eliminates serialization/deserialization) as well as Cap'n Proto for storing objects inside LMDB. I'm nowhere near LMDB's write thread being the bottleneck, and I'll only get close to it once I'm on a 20Gbit/sec machine.

IMO a properly designed system will always and everywhere be bandwidth constrained, not CPU or "architecture" constrained. (Put another way, everyone has access to supercomputers these days, we just use them poorly most of the time.)

jasonwatkinspdx · on Aug 31, 2016

> You can't get anywhere near that kind of write performance with a fully transactional and concurrent write system

Sure you can. MSR's bw-tree[1] is my favorite example but there are others. I've learned some interesting things from both you and Howard and thank you for it, but the hype around lmdb is a little thick at times. It is possible to build lock free low latency concurrent transactional datastores.

Edit for folks to lazy to scan the whole paper:

They test a real world workload sampled from xbox live services. It's roughly 100 byte keys, 1k values, and around 8:1 read to write ratio. They sustained over 10 million ops a second on a 4 core machine. There are no locks and all transactions can read or write as they like concurrently.

[1]: https://www.microsoft.com/en-us/research/wp-content/uploads/...

erichocean · on Aug 31, 2016

> Sure you can.

Where's the source code for that? (I read the paper when it came out, good stuff!)

In my experience, I don't fully trust research results—especially for databases—until they've been replicated in production. Especially on high-performance systems, small changes (oh, we added error handling, or oh, now we're actually calling fsync) can easily drop actual achieved performance 1 or even 2 orders of magnitude. (I'm not saying this is true for the linked paper, just that I'd want to verify the actual achieved performance on my own workloads before I switched to it. I do believe it's possible FWIW, I just haven't seen it in the real world.)

So if there's an LMDB-style codebase out there I can spin up that uses a Bw-tree internally, I'm definitely interested and would love a link. :-)

Also FWIW, what I like about LMDB isn't just the speed, it's the entire architecture. This fall I'm looking into using ScyllaDB for page storage, and storing the meta pages inside CockroachDB, with a user-per-LMDB-environment model. LMDB's essential design makes that (relatively) easy, whereas with, say, RocksDB, not so much (at least for me).

jasonwatkinspdx · on Aug 31, 2016

> Where's the source code for that?

Closed. It's backing Hekaton in MS SQL and some features of Azure DocumentDB.

While there are many systems in the literature that demonstrate some variation of lock free log and indirection map, they're all either research demonstration codebases or closed source to the best of my knowledge.

The reason I have an axe to grind here is I think the generalizations around LMDB made by you and others are making this situation worse, by convincing people that haven't read the last few years of literature that LMDB's design is all there is. LMDB is great but narrow. It has no answer for many workloads. It's inevitable that better things will be devised. I want those things to be FOSS. Do the over-reaching and over-simplifying comments about LMDB make that more or less likely?

erichocean · on Aug 31, 2016

> I think the generalizations around LMDB made by you and others are making this situation worse

Fair enough. For anyone else following, @jasonwatkinspdx is correct that LMDB is not the right (= best available today) choice for certain workloads. I personally used TrailDB for event traces and if I needed prefix compression on keys (like the CockroachDB team), RocksDB would be an arguably better choice.

What I don't agree with is that LMDB's existence or users are causing open source database research to stagnate. Quite the contrary, I think the benefits of LMDB's architecture have not yet been fully realized and I would love to see more research be done to improve/extend it.

Most of the research action today (at least to me) seems focused on making log-structure merge trees and variations thereof less sucky at the margin instead of identifying what makes LMDB's design actually tick and then improving on it. There's still a lot of low-hanging fruit in the LMDB architecture waiting to be explored.

jasonwatkinspdx · on Aug 31, 2016

Thanks for a really civil reply to what was admittedly a pretty barbed comment on my part. I wasn't aware of TrailDB, I'll check it out. Thanks!

lj3 · on Aug 30, 2016

Do you have any opinions on Cap'n Proto vs Extprot[0]?

[0]: https://github.com/mfp/extprot

kentonv · on Aug 30, 2016

I hadn't heard of Extprot before so I took a quick look. It appears to be heavily inspired by Protobuf, having almost exactly the same arrangement on the wire, adding a few tweaks but keeping many idiosyncrasies.

The Cap'n Proto documentation compares it extensively with Protobuf and most of those comparisons probably apply to Extprot as well.

(Disclosure: I am the author of Cap'n Proto and also Protobuf v2.)

erichocean · on Aug 30, 2016

I haven't seen Extprot before you mentioned it. That said, it appears to require deserialization and that's something we avoid. LMDB can hand you back a pointer to your object in the memory map, and we can easily use that pointer with Cap'n Proto without any calls to malloc().

AnbeSivam · on Sept 1, 2016

What are your thoughts on SBE vs Cap'n Proto. I don't know much about serialization libraries, but was thinking about looking into these both soon. Since you are using Aeron, I thought you would have looked into SBE( since developer being same for both) and not decided to go with it.

erichocean · on Sept 1, 2016

SBE is better if you don't have (much) in the way of message format evolution and you intend to read each message in full.

Cap'n Proto has a way better compatibility story, and there's no need to read the entire message just to read a particular part.

SBE is very good for its intended use case: high-frequency trading with tiny, but tons, of small messages. Cap'n Proto is better for use as a stable (but evolvable) message format between systems (e.g. RPCs), or for on-disk storage if you've got a database (like LMDB) that doesn't force you to copy read-only data.

lj3 · on Sept 1, 2016

Thank you! I'm very newbish on this stuff, but I'm trying to learn.

filereaper · on Aug 30, 2016

Really liking CockroachDB so far.

Also reading through their site, I didn't know Google's Spanner required AtomicClocks! Holy deep pockets Batman!

MichaelGG · on Aug 30, 2016

I don't think such clocks are that expensive. A quick search shows several companies offering them ready to go. I'd guess that even relative to the engineering time involved, the cost of the clocks was irrelevant.

GPS NTP servers start at $1000, but I dunno how much that delivers. The actual GPS signal is probably accurate enough, but the rest of the hardware chain on a $1000 box might not be <1ms. Google might have also wanted a backup in case antennas are disturbed or GPS signal somehow interrupted?

But even then - just because Google chose 7ms doesn't mean everyone else has to. Drop a GPS NTP master at each site and live with that? I wouldn't imagine it'd get much worse than 10-20ms?

I don't really know a whole lot about this though.

technion · on Aug 30, 2016

Not quite atomic - but an interesting Raspberry Pi project is an NTP server with a GPS unit. It doesn't take deep pockets a all and it offers far more accurate time sync than any average person needs. NTP will rate it at a lower stratum than any network-only time server.

I had one in the NTP pool for a while.

jzelinskie · on Aug 30, 2016

I used to think this as well, until I started talking to more shops that own their own hardware. As it turns out, having atomic clocks isn't quite as extraordinary as you think. If there's enough competition in the public cloud, we might be seeing them there pretty soon.

jasonwatkinspdx · on Aug 31, 2016

Only a small number, used as insurance/backup for problems with GPS. It's not really a huge problem for someone buying many racks of their own equipment.