Btree vs. LSM

hendzen · on Aug 30, 2016

I have a lot of respect for the author of that page, Alex Gorrod. However, this benchmark is quite dated (2014). The WiredTiger codebase was still immature when it was written. Since WiredTiger's acquisition by MongoDB (the company) and integration in to MongoDB (the database), the btree and LSM implementations have undergone extensive changes and are now much more hardened and adapted to production workloads. Furthermore it would be worth considering other engines such as InnoDb (btree) and RocksDB (LSM), which along with WiredTiger are considered the leading open source storage engines.

For more recent and realistic benchmarks you should look at the work of Mark Callaghan.

erpellan · on Aug 30, 2016

Mark's blog: http://smalldatum.blogspot.co.uk/

3pt14159 · on Aug 29, 2016

Great analysis but can someone fork this and update the graphs to improve some things? I'd do it but I'm on a really bad internet connection right now.

1. Use SI magnitudes, so that we don't have to count the zeros. (so 100k instead of 100000.)

2. Either use the same chart for LSM vs Btree for each of read and write, or at the very least use the same y axis between the two.

3. It's hard to see the difference on the limited write benchmarks. Can we make it either logarithmic or another graph with these differences highlighted?

edejong · on Aug 29, 2016

4. Please use graphs which highlight the distribution of the write/read latencies. This points to possible outliers and reduces the reliability on averages. Possible example graphs are a violin plot[1] or a scatterplot with jitter [2].

[1] https://en.wikipedia.org/wiki/Violin_plot [2] http://stackoverflow.com/questions/23675735/how-to-add-boxpl...

ddorian43 · on Aug 29, 2016

Many benchmarks on different dbs using either lsm/btree on different environments/data-sets: https://symas.com/products/lightning-memory-mapped-database/... (including wired tiger)

petergeoghegan · on Aug 29, 2016

I agree that the quality of each implementation is very pertinent here. I wonder if the B-Tree implementation in question used the Lehman & Yao B-link technique, in order to obviate the need for "latch coupling"/"crabbing". At the same time, it's interesting to consider if the LSM implementation is similarly optimized to minimize low-level locks/semaphore overhead, if in fact that's actually possible.

In general, B-Tree index performance has plenty to do with low-level optimizations, beyond things that are published in academic papers. I'm talking about things like micro-optimizing the number of CPU cache misses on internal pages, the use of interpolation search, and so on.

erichocean · on Aug 29, 2016

My favorite database by far today is LMDB (B+Tree).[0] Performance is insane, and very low-variance. Reads scale linearly with core counts, and it has a lot of useful index types and knobs to get maximum performance.

What am I most looking forward to using later this year? ScyllaDB[1] and CockroachDB[2], both in conjunction with LMDB.

[0] http://104.237.133.194/doc/

[1] http://www.scylladb.com/

[2] https://www.cockroachlabs.com/

hendzen · on Aug 30, 2016

If I'm not mistaken, doesn't LMDB have a single, global write lock?

erichocean · on Aug 30, 2016

> doesn't LMDB have a single, global write lock?

It does, yes. The best architecture when using LMDB is to have a single writer thread, and many independent reader threads.

That said, it is nice to be able to run the occasional separate process to do some maintenance task on the database (e.g. to do some writes), and the global lock comes in handy then.

The main reason a single write thread design isn't a problem is you can do a million or more separate, fully transactional and isolated write transactions per second on a single core with LMDB. Writes are simply not the bottleneck.

You can't get anywhere near that kind of write performance with a fully transactional and concurrent write system like CockroachDB, not even with 100x the hardware. However, some of the eventually consistent databases, like ScyllaDB, can scale up to more total writes per second (but of course you give up the transactions, and the consistency).

The actual issue I have with using LMDB in this way is keeping it fed! My code is now written in C++, I'm using Aeron for high-performance messaging and Cap'n Proto for messages (eliminates serialization/deserialization) as well as Cap'n Proto for storing objects inside LMDB. I'm nowhere near LMDB's write thread being the bottleneck, and I'll only get close to it once I'm on a 20Gbit/sec machine.

IMO a properly designed system will always and everywhere be bandwidth constrained, not CPU or "architecture" constrained. (Put another way, everyone has access to supercomputers these days, we just use them poorly most of the time.)

jasonwatkinspdx · on Aug 31, 2016

> You can't get anywhere near that kind of write performance with a fully transactional and concurrent write system

Sure you can. MSR's bw-tree[1] is my favorite example but there are others. I've learned some interesting things from both you and Howard and thank you for it, but the hype around lmdb is a little thick at times. It is possible to build lock free low latency concurrent transactional datastores.

Edit for folks to lazy to scan the whole paper:

They test a real world workload sampled from xbox live services. It's roughly 100 byte keys, 1k values, and around 8:1 read to write ratio. They sustained over 10 million ops a second on a 4 core machine. There are no locks and all transactions can read or write as they like concurrently.

[1]: https://www.microsoft.com/en-us/research/wp-content/uploads/...

erichocean · on Aug 31, 2016

> Sure you can.

Where's the source code for that? (I read the paper when it came out, good stuff!)

In my experience, I don't fully trust research results—especially for databases—until they've been replicated in production. Especially on high-performance systems, small changes (oh, we added error handling, or oh, now we're actually calling fsync) can easily drop actual achieved performance 1 or even 2 orders of magnitude. (I'm not saying this is true for the linked paper, just that I'd want to verify the actual achieved performance on my own workloads before I switched to it. I do believe it's possible FWIW, I just haven't seen it in the real world.)

So if there's an LMDB-style codebase out there I can spin up that uses a Bw-tree internally, I'm definitely interested and would love a link. :-)

Also FWIW, what I like about LMDB isn't just the speed, it's the entire architecture. This fall I'm looking into using ScyllaDB for page storage, and storing the meta pages inside CockroachDB, with a user-per-LMDB-environment model. LMDB's essential design makes that (relatively) easy, whereas with, say, RocksDB, not so much (at least for me).

jasonwatkinspdx · on Aug 31, 2016

> Where's the source code for that?

Closed. It's backing Hekaton in MS SQL and some features of Azure DocumentDB.

While there are many systems in the literature that demonstrate some variation of lock free log and indirection map, they're all either research demonstration codebases or closed source to the best of my knowledge.

The reason I have an axe to grind here is I think the generalizations around LMDB made by you and others are making this situation worse, by convincing people that haven't read the last few years of literature that LMDB's design is all there is. LMDB is great but narrow. It has no answer for many workloads. It's inevitable that better things will be devised. I want those things to be FOSS. Do the over-reaching and over-simplifying comments about LMDB make that more or less likely?

erichocean · on Aug 31, 2016

> I think the generalizations around LMDB made by you and others are making this situation worse

Fair enough. For anyone else following, @jasonwatkinspdx is correct that LMDB is not the right (= best available today) choice for certain workloads. I personally used TrailDB for event traces and if I needed prefix compression on keys (like the CockroachDB team), RocksDB would be an arguably better choice.

What I don't agree with is that LMDB's existence or users are causing open source database research to stagnate. Quite the contrary, I think the benefits of LMDB's architecture have not yet been fully realized and I would love to see more research be done to improve/extend it.

Most of the research action today (at least to me) seems focused on making log-structure merge trees and variations thereof less sucky at the margin instead of identifying what makes LMDB's design actually tick and then improving on it. There's still a lot of low-hanging fruit in the LMDB architecture waiting to be explored.

jasonwatkinspdx · on Aug 31, 2016

Thanks for a really civil reply to what was admittedly a pretty barbed comment on my part. I wasn't aware of TrailDB, I'll check it out. Thanks!

lj3 · on Aug 30, 2016

Do you have any opinions on Cap'n Proto vs Extprot[0]?

[0]: https://github.com/mfp/extprot

kentonv · on Aug 30, 2016

I hadn't heard of Extprot before so I took a quick look. It appears to be heavily inspired by Protobuf, having almost exactly the same arrangement on the wire, adding a few tweaks but keeping many idiosyncrasies.

The Cap'n Proto documentation compares it extensively with Protobuf and most of those comparisons probably apply to Extprot as well.

(Disclosure: I am the author of Cap'n Proto and also Protobuf v2.)

erichocean · on Aug 30, 2016

I haven't seen Extprot before you mentioned it. That said, it appears to require deserialization and that's something we avoid. LMDB can hand you back a pointer to your object in the memory map, and we can easily use that pointer with Cap'n Proto without any calls to malloc().

AnbeSivam · on Sept 1, 2016

What are your thoughts on SBE vs Cap'n Proto. I don't know much about serialization libraries, but was thinking about looking into these both soon. Since you are using Aeron, I thought you would have looked into SBE( since developer being same for both) and not decided to go with it.

erichocean · on Sept 1, 2016

SBE is better if you don't have (much) in the way of message format evolution and you intend to read each message in full.

Cap'n Proto has a way better compatibility story, and there's no need to read the entire message just to read a particular part.

SBE is very good for its intended use case: high-frequency trading with tiny, but tons, of small messages. Cap'n Proto is better for use as a stable (but evolvable) message format between systems (e.g. RPCs), or for on-disk storage if you've got a database (like LMDB) that doesn't force you to copy read-only data.

lj3 · on Sept 1, 2016

Thank you! I'm very newbish on this stuff, but I'm trying to learn.

filereaper · on Aug 30, 2016

Really liking CockroachDB so far.

Also reading through their site, I didn't know Google's Spanner required AtomicClocks! Holy deep pockets Batman!

MichaelGG · on Aug 30, 2016

I don't think such clocks are that expensive. A quick search shows several companies offering them ready to go. I'd guess that even relative to the engineering time involved, the cost of the clocks was irrelevant.

GPS NTP servers start at $1000, but I dunno how much that delivers. The actual GPS signal is probably accurate enough, but the rest of the hardware chain on a $1000 box might not be <1ms. Google might have also wanted a backup in case antennas are disturbed or GPS signal somehow interrupted?

But even then - just because Google chose 7ms doesn't mean everyone else has to. Drop a GPS NTP master at each site and live with that? I wouldn't imagine it'd get much worse than 10-20ms?

I don't really know a whole lot about this though.

technion · on Aug 30, 2016

Not quite atomic - but an interesting Raspberry Pi project is an NTP server with a GPS unit. It doesn't take deep pockets a all and it offers far more accurate time sync than any average person needs. NTP will rate it at a lower stratum than any network-only time server.

I had one in the NTP pool for a while.

jzelinskie · on Aug 30, 2016

I used to think this as well, until I started talking to more shops that own their own hardware. As it turns out, having atomic clocks isn't quite as extraordinary as you think. If there's enough competition in the public cloud, we might be seeing them there pretty soon.

jasonwatkinspdx · on Aug 31, 2016

Only a small number, used as insurance/backup for problems with GPS. It's not really a huge problem for someone buying many racks of their own equipment.

jules · on Aug 30, 2016

B-epsilon trees are a modification of B-trees that support much higher write throughput. It is done by adding a buffer to each node in the tree so that you don't need to travel all the way down the tree every time. When a buffer is full all its contents are pushed down to the next level in one batch. It would be interesting to see how well they stack up against LSM's write throughput.

ddorian43 · on Aug 30, 2016

Looks like fractal trees: https://en.wikipedia.org/wiki/Fractal_tree_index

jules · on Sept 2, 2016

Thanks for the link, that cleared up a confusion that I've had for a while. I've read a lot of contradictory things about fractal trees, but apparently they changed the data structure from a COLA to a B-epsilon tree at some point but kept the name fractal tree.

armon · on Aug 30, 2016

The used synthetic benchmark is not a great indicator of performance in most (any?) production environments. It's generally useful and interesting to understand the tradeoffs between an LSM and B-Tree. Particularly, if you have an update or delete heavy workload the compaction cost of an LSM can become an issue. B-Tree's don't suffer from compaction issues, so in some sense trade consistently slower writes instead of amortizing out an I/O intensive compaction.

Some of the worst production experiences I've had came from exhausting I/O on the database, and then having LevelDB / RocksDB / LSM stores kick off their compaction. B-Tree will give you a very consistent redline in terms of performance generally.

TL;DR: There are trade offs between the two, but this benchmark is not particularly insightful given that it doesn't really test any of the interesting boundary conditions or real world query patterns.

andyjpb · on Aug 30, 2016

B-trees (especially Copy on Write B-trees) tend to be limited by the disk seek performance. The write performance of a virgin B-tree will also vary substantially depending on the keys that are being inserted. Random keys will lead to a lot of splitting of the tree which can be slow as it requires a lot of locking and allocation. Sequential keys will amortise the expensive part of building the structure of the tree.

This test uses a 10GB test size which is very small compared to the size of RAM available for the page cache. This will hide the seek penalties that the B-Tree takes on reads. Over a much larger test one would expect the write performance of a well written LSM to flatten out to a constant, indefinitely sustainable, rate. A (COW) B-tree would be expected to gradually degrade over time as it gets more fragmented and takes more seek penalties for writes and range reads.

The seek penalty is relevant for SSDs as there are still only a finite number available: you can expect on the order of 100K/s rather than the 100/s you might expect from a conventional disk. You still want to be mostly operating those things in streaming read mode where possible.

filereaper · on Aug 30, 2016

I'd love to see a breakdown of Databases broken down by their primary underlying storage mechanism (eg):

- RDBMS: B-Tree Layout (good for lookups)

- No-SQL (like) DB's: LSM (good for heavy write throughputs)

And then there's ones like Dremel which opt for high-octane full-table scans.

jeffdavis · on Aug 30, 2016

RDBMS often support a variety of primary and secondary index types. There's nothing about the relational model or SQL that won't work with an LSM tree.

rawnlq · on Aug 29, 2016

Might be a stupid question, but is this performance of just the data structure in memory or with reading/writing to disk also?

mnw21cam · on Aug 30, 2016

Let's see - they are generating 10GB of data, but they have 144GB of RAM. I'd say those reads are coming from RAM. The writes are quite possibly waiting for disc though.

avita1 · on Aug 29, 2016

LSMs and Btrees are both data structures designed for use with disks, so the tests were probably run with disks.

alexnewman · on Aug 30, 2016

Anything with blocks. SSD have blocks. nvme on the other hand....

hans · on Aug 30, 2016

here's a comparison of the LSM and Fractal trees.

http://highscalability.com/blog/2014/8/6/tokutek-white-paper...

professorm · on Aug 29, 2016

I noticed they used LD_PRELOAD=/usr/lib64/libjemalloc.so

glibc malloc not up to the task?

wolf550e · on Aug 29, 2016

People often run MySQL with jemalloc preloaded. Here's a benchmark comparing glibc ptmalloc, jemalloc and tcmalloc for some storage engines: http://lmdb.tech/bench/inmem/malloc/

mtanski · on Aug 29, 2016

From my experience and extensive testign while building a OLAP systems jemalloc performs much faster then glibc's malloc.

It's just as tcmalloc or faster (in new versions). Unlike tcmalloc is has very low memory fragmentation and is aggressive by default in returning it to the OS.

Returning memory to the OS is very valuable in DBMS since every extra page not used in userspace is a extra page in page cache which in my testing leads to better query performance under load.