In these setups compute is ephemeral and decoupled from storage (you would norma...

hamandcheese · on Dec 18, 2024

But duckdb has its own on-disk format, and it generally is the best performing thing to scan (as opposed to always just querying, say, parquet files in S3).

So if you want to use the duckdb native format, and you have a lot of data... what do you do? How do you keep your duckdb file up to date with incoming data to your data lake? Maybe you just... don't? Have a daily rebuild of your ephemeral-ish duckdb node?

It seems like it would be kind of a pain to manage once you get to moderate scale (say, TB+ datasets).