MondayDB: A new in-house data engine from monday.com

bob1029 · on Oct 9, 2023

> Unlimited tables

If you ever find yourself in a position where you need a database table per instance of an abstraction, you've almost certainly failed to model the domain appropriately.

If you find yourself in a position where you think you need to write your own database engine because nothing on the market can even approach your problem, you probably need to go take a nap.

The only instancing that makes sense to me is at the database level (aka one per customer/board/etc), and if you're going to be doing a lot of database files... Did SQLite even remotely cross the mental threshold here? I can see several ways to make it solve this problem if you are going to insist on one table/file per real world thing.

One SQLite database per board with everything defined in SQL command text using application defined functions to bind to platform features. Customization can be performed deeply and at the grain of each board. You could allow enterprise customers to bring custom schemas or even code modules to inject and call. Would also make import/export absolutely trivial for any customer.

Or we can spend all our special innovation tokens on developing a goddamn database engine from zero...

renegade-otter · on Oct 9, 2023

There are RARE exceptions.

AirTable developed their own DB, but their case may be justified. I listened to an interview with a founder once - very thoughtful and reasonable. They fully understood what they were getting into.

But by default, NEVER:

* Write your own DB

* Write your own search engine to match ElasticSearch or other serious technologies

It will take over your life, and then ruin it.

tablet · on Oct 10, 2023

Not sure why they needed own DB. Fibery.io has similar domain and we built everything on Postgres. Works like a charm and you even don't have Airtable bases connectivity problem. We have schema-per-customer and table-per-entity-type model, performance is quite good.

dsagal · on Oct 9, 2023

Shhh, don't tell them about SQLite! It's the secret sauce behind Grist (I am a founder), and we are sort of competitors. Wouldn't want Monday.com knowing about our competitive advantage!

benjaminwootton · on Oct 9, 2023

One table or schema per customer is possibly a valid model for segregation in a multi tenant system. I’ve never felt the need to do it before, but maybe it could work, if you squint?

baq · on Oct 9, 2023

If you never touch the schema in any way, perhaps...? It sounds like an operational nightmare otherwise.

throwaway0365 · on Oct 9, 2023

I have read about table inheritance but never used it. It seems like it would make it possible to alter the schema of the root table, but it probably introduces other issues that I am not aware of. You would still need to manage indexes, constraints and default values for each child table, but the structure of the table itself would be inherited.

benjaminwootton · on Oct 9, 2023

But the advantages are - better for compliance, faster analytical queries, possible to upgrade people at different paces. I agree it would be painful from a database migration perspective but I can see situations where it might be the right call.

nijave · on Oct 9, 2023

How do you manage all these sqlite dbs? Throw them on 1 large server, develop some scheme to sprinkle them across multiple servers, something else?

Redis and Cassandra already handle distribution out of the box.

It seems like there product is basically a web GUI for creating db tables with workflow automation so it doesn't seem like too far of a stretch to have 1 table per customer table. I guess the alternative is designing some system for shoving arbitrary schemes into another schema.

I guess https://en.m.wikipedia.org/wiki/Entity%E2%80%93attribute%E2%... is an option

sgarland · on Oct 9, 2023

To have any hope at HA with SQLite, you’d have to use something like RQLite [0], and at that point you’re already building a far more complex system than Cassandra + Redis.

They didn’t build a DB, they taped some products together with middleware. That isn’t to say what they did is bad, just that it’s not “let’s write a DB from scratch” as the headline implies.

[0]: https://github.com/rqlite/rqlite

otoolep · on Oct 9, 2023

rqlite[1] creator here, happy to answer any questions.

[1] https://www.rqlite.io

sgarland · on Oct 9, 2023

Hi! Your product looks quite good to be clear, and that was in no way a slam on it. I’ve long wanted a reasonable excuse to use it, just to see how it works at scale.

One thing that stopped me from trying it in the past for a project was this:

> Is rqlite a good match for a network of nodes that come and go – perhaps thousands of them?

> Unlikely.

While I wasn’t going to get to thousands, it was going to be hosted on K8s, and it seemed like I’d have to write lifecyclehooks to ensure nodes gracefully left consensus, and others felt that was an unnecessary complexity for the problem. Can you speak more to this point, since in the FAQ you also say it can run on K8s? If a node dies due to, say, hardware failure, when it rejoins is it seen as a slightly out-of-date existing node since it’s part of statefulset? If so, assuming quorum is maintained, this is probably a lot easier than I was making it out to be.

otoolep · on Oct 9, 2023

When I say "thousands coming and going" I'm talking about trying to run rqlite on, say, a network of cellphones. When people ask me about this, it usually means they have the wrong mental model of what rqlite is.

However, running it on Kubernetes should work quite well. Check out the documentation at https://rqlite.io/docs/guides/kubernetes/

>If a node dies due to, say, hardware failure, when it rejoins is it seen as a slightly out-of-date existing node since it’s part of statefulset? If so, assuming quorum is maintained, this is probably a lot easier than I was making it out to be.

Assuming I follow you, yes, this should work fine -- and exactly as you expect. I suggest you try it out, and if you don't understand what you're seeing file a ticket on GitHub, or join the rqlite Slack[1].

[1] https://rqlite.slack.com/

robust-cactus · on Oct 9, 2023

Eh - sometimes you gotta solve hard problems and no amount of simplification is going to net good results.

shafyy · on Oct 9, 2023

Eh - we're talking about a glorified project management application here.

donatj · on Oct 9, 2023

From ~8 seconds to ~3 seconds to fetch 5,000 items seems like going from horrible to still really actually pretty bad.

Just about any SQL server with well defined schema and indexes would blow this out of the water by several orders of magnitude.

Schemaless in this old devs opinion is basically always a bad idea. Your data has a schema, whether or not you wish to define it. In not bothering to do so though you’re not doing yourself many favors.

maxerickson · on Oct 9, 2023

The upside is that it is like AirTable, except you have to stand up a server with a custom service if your logic or computation doesn't fit into their no code tooling.

sgarland · on Oct 9, 2023

You’re correct on all points, but at some point you begin to hit practical limits on things like maximum tables per DB. You could shard of course, but then you have to handle that, as well as failover, data locality, etc.

TekMol · on Oct 9, 2023

    Until a while ago, when a user landed on their
    board, we threw all the board data right into
    the client (usually a web browser running on a
    desktop computer).

    the client is limited in its resources. Depending
    on the client device and board structure, it
    started to struggle and finally, crashed after a
    few thousand items (“table rows” in monday.com
    terminology). If we really pushed it, we could
    handle up to 20k items. Beyond that, it was game
    over.

This does not make sense. Why would a browser not be able to handle more than 20k table rows?

I have not tested it, but I would think juggling arrays with a few million entries should not make a browser sweat.

kardianos · on Oct 9, 2023

That is exactly what I came here to say as well. In my datatable custom query layer, I can pull down 1 million rows and view them instantly displaying them in a continuous list on the client, lazy load related table rows, and neither the web client nor the traditional web server on a traditional SQL Server, where I dynamically build the queries and can filter on any column, break a sweat. And the server is on 10 year old hardware that still runs spinning disks and the clients are often older lower powered computers. And that's with me not optimizing performance, other then thinking about it lightly up front. EDIT: Yes, the tables are sometimes medium wide, 20-50 columns buffered on the client on those rows.

DrScientist · on Oct 9, 2023

I think because their whole original logic was client side - to generate a filtered view would require loading all the data in the client first and then applying a filter.

kardianos · on Oct 9, 2023

Or just re-query the server with the new filters applied.

tmikaeld · on Oct 9, 2023

You're right, a modern browser can handle it without issues using virtual scroll [0].

[0] https://shlomiassaf.github.io/ngrid/demos/virtual-scroll-per...

eviluncle · on Oct 9, 2023

Virtualzing (or windowing, as it is sometimes called) is definitely one way to tackle this problem. However, it's not a silver bullet. The more complex the UI/rendering is, the more optimizations you will need. The demo you link shows a very simple rendering with no complex user interactions, mouseover events, etc.

robust-cactus · on Oct 9, 2023

Works fine until you try to do it on a mobile device

wahnfrieden · on Oct 9, 2023

mobile will crash the page simply off memory use, esp older phones

danjc · on Oct 9, 2023

Agreed. Obviously you shouldn't render all of that, just keep it in memory and render the rows that are scrolled into view.

ddorian43 · on Oct 9, 2023

Honestly, the end result was a lot worse than I expected. Using Redis & Cassandra as 1 db, when your requirements where unlimited tables, fast filter/sort, etc. Literally the worst of all worlds imaginable.

Imagine maintaining that (2 separate distributed async-by-default, no transactions/indexes, limited-types, etc etc clusters).

At this day and age where everyone is going for sync-by-default because it's extremely easier on the developer side to use the db.

nijave · on Oct 9, 2023

Mostly curious how they have Redis setup. Article mentioned they're using it to buffer writes before hitting Cassandra but that seems pretty easy to lose data unless they're running Redis in sync mode (afaik it's a lot slower)

I've seen a similar architecture using Kafka to buffer writes but then you lose kv lookup at the buffer.

gigatexal · on Oct 9, 2023

A lot of the architectural decisions, from a high-level, make sense: compute and storage are separated, redis and cassandra are used instead of reinventing those. It's a bit OLAP and a bit OLTP in that users might make a few point updates to things here and there OLTP but then the filtering and aggregating and showing all sorts of views and such is clearly in the analytics domain hence OLAP and the use of a columnar setup.

All that said, I'd like to hear about this being open sourced and being self hosted and being Jepsen tested or at least a clear distinction that this DB is, and will only ever be, an in-house DB so the former requests will never happen.

riku_iki · on Oct 9, 2023

> make sense: compute and storage are separated

I am wondering why opposite wouldn't make sense, you could greatly reduce over network traffic for heavy queries.

gigatexal · on Oct 9, 2023

It’s about scale. Of course the compute nodes need some storage for when they spill to disk but they compute what is asked of them and then submit their work. So a coordinator can split the work and send it to a worker and that can be replicated on N nodes achieving a parallel speed up and then reconstituted.

As you mention there’s an overhead or a tax to this sure but the overall speed up’s are there.

If you’re talking about one big node with gobs of fast storage and CPUs for compute then that’s more your traditional scale up instead of scale out approach and for many years that was the conventional wisdom in databases: when things got slow just move your DB to a bigger and bigger machine. But that has its limits. So scaling horizontally was the answer and with it came all of its complexity sure but it allowed for much better scaling.

gigatexal · on Oct 9, 2023

If a single node contains the relevant partition(s) has all the data then it could be that the compute happens there and the result is then sent back no map reduce work needed.

riku_iki · on Oct 9, 2023

I envision it as many nodes store many partitions, and initial compute happens locally, and then results are re-partitioned if needed for the next step.

Trivial case is predicate push down, when predicate is applied on storage backend and not compute node.

I think ClickHouse is an example of such architecture, and they don't have dedicated compute nodes.

gigatexal · on Oct 10, 2023

Yeah there's no perfect way to do this but there are advantages to being flexible and the use-case you mention is a subset of what I am describing.

Imagine a workload that is write heavy and handles very few reads say 20:1 writes to reads. Then having many storage nodes could speed up the reads and if you have say a fix set of nodes allocating 20x more storage nodes than compute nodes might make sense.

The whole thing is a science and I do like the flexibility.

riku_iki · on Oct 10, 2023

there are many nodes in both my description and ClickHouse.

benpacker · on Oct 9, 2023

This is quite confusing - you shouldn’t need a columnar engine to filter/aggregate 20k rows.

Postgres / MySQL with a table per board (or one giant partitioned table with a JSONb column) would have been totally fine, as would have one SQLite database per customer that you can load the initial page of and then ship the whole thing to the client for richer interactivity.

fefe23 · on Oct 9, 2023

medium.com is getting worse and worse. Don't post content there.

https://imgur.com/a/5yXPDST

"Distraction-free reading." riiiiight

I suspect the original poster wanted to tell me something. Too bad I shall never find out.

llui85 · on Oct 9, 2023

Replace medium.com with scribe.rip.

xnx · on Oct 9, 2023

I'm amazed that a non-database company thought it was a good idea to write their own database in 2023, and astounded when they admit it publicly. Unless this is a clever application on https://meta.m.wikimedia.org/wiki/Cunningham%27s_Law

DrScientist · on Oct 9, 2023

Is your post an example of Cunninghams Law on purpose?

As far as I can see they didn't implement their own DB - they just built a service composed of existing DB's - like Cassandra and Redis.

rvnx · on Oct 9, 2023

Article shows that Cassandra is slow to retrieve data, so they added Redis as a cache to fetch recent events from there.

renegade-otter · on Oct 9, 2023

It is now planet scale! Proper database design and indexes be damned!

ptrik · on Oct 9, 2023

Curious to see how this stacks up against a more specialised HTAP database like SingleStore / TiDB

buremba · on Oct 9, 2023

It sounds similar to how Keen.io works but all these concepts are fairly established: https://softwareengineeringdaily.com/2016/05/23/kafka-storm-...

Considering they use many distributed systems under the hood, it's not intended for users to interact with the database running it locally but eventually they want to be a "data cloud" company maybe. The use-cases seem to be matching HTAP databases and I wonder why they didn't try any HTAP database in the market.

anonzzzies · on Oct 9, 2023

Hmm maybe I missed something but isn’t Monday some kind of do-all task management system? A client of mine forced us on it and the un-opinionated style made it quite crappy. You can basically have to do whatever all managers came up with randomly and it sucked. This needs a custom db or there is a lot more to this product ?

lcbasu · on Oct 9, 2023

Nice take to mix both OLTP and OLAP to acheievce low latencies for queries while still making sure to be consistent for transactional queries.

Curious: What could have been the deciding factor to choose Cassandra over ClickHouse?

riku_iki · on Oct 9, 2023

Clickhouse is not good choice for OLTP.

mborch · on Oct 9, 2023

Perhaps their team could have found something useful in KalDB from Slack which is based on Lucene:

https://github.com/slackhq/kaldb

BoorishBears · on Oct 9, 2023

But think of their resumes. Their performance reviews.

Saying "Improved loading times by fixing our fucked up indexes and query patterns on current database" sounds weak. Sounds *shudder* incremental.

But implementing an in-house database engine for a literal CRUD app? That's takes being a visionary. There's so much bikeshedding for you to stamp down on and show leadership (never mind it's your fault there's any bikeshedding to start)

And I mean the article says it all at the end:

> Looking ahead, we’re weighing up our next moves. We might refactor some of our logic to be executed with highly-performant tech such as DuckDB. We might take advantage of columnar formats such as Arrow and Parquet. We may even refactor our logic using Rust language as a side-car or dedicated microservice. I’m very excited about the future, and will keep you updated!

They're already salivating impact they just unlocked by smashing their fist through the beating heart of what the last guys did and taking a triumphant bite. It's not just this first iteration of the platform: they've given themselves the momentum to start tearing down everything that ever touched a database at that company! This is how you break the IC glass ceiling!!!!

_

And I absolutely love this line: Eventually, we found that none of these options completely met our requirements.

Related tangent: there's an in-house serialization format at the AV company I work at that's been a massive pain in my ass since I got there. I'm one of those evil tech leads that wants to do things that actually measurably deliver value so instead of building a SaaS product inside my tech company I like to do things like build tools with data we have... but it turns out rolling your own not-Protobuf also means having to write your own client libraries! Which means I consistently run into stupid bugs and edge cases that the genius who invented this format doesn't have to deal with because they long impacted their way into a barely-technical role, and now random people get the joy of relearning this guy's invented solution to fix it while they try to get useful things done.

One day I get so annoyed this mess that I go and look up why the fuck someone made their own Protobuf. I dig through old Slack messages until I get to a presentation deck, and that golden nugget almost word for word was sitting there on a slide: none of the existing options completely met our requirements. and on the slide you've got 10 perfectly good solutions like Protobuf, Capt'n Proto, MsgPack...

So what they mean of course is, quite literally, no one checked every single box. There were options that checked 9/10 boxes, but the moment that 10th box wasn't checked... they had their excuse. It's like a toddler hitting you with the "you said sit in my room but you didn't say which room!!!!!"

And of course never mind that "settling" for 9/10 boxes would have enabled an infinitely better solution for the 9 that are checked: because when asked why on earth this project is needed, I'll get to show a nice table with a big red X on every single solution.

_

Now excuse me while I go throw up at the idea of birthing a company with blood sweat and tears only to have it infected with this kind of sickness.

I swear, tech is the only industry where if you walked into a room and asked what one improvement to their work they'd like, it'd be something at odds with your customers:

I can walk into McDonalds and the suggestions will be things like "move appliance X closer to Y so I can do Z that makes my job easier".

If they were tech workers I'd get "replace the coffee machine with a $50,000 italian espresso maker so I can practice for my next job at the Dorsia."

jddj · on Oct 9, 2023

On point and genuinely funny. Five stars.

The envisioned duckdb/parquet/arrow/rust journey is so 2023 HN it could be satire.

maxerickson · on Oct 9, 2023

The fundamental UI is just a single row based table at a time and their initial DB was a graph database.

Zetobal · on Oct 9, 2023

Looks horrific to work with.