Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

In these setups compute is ephemeral and decoupled from storage (you would normally use an object storage offering that is actually redundant out-of-the-box), so the 2TB is for working memory + OS + whatever and the 20TB NVMe is purely for maybe local spilling and a local cache so you can save on storage reads.

If a node fails when running a process (e.g. for an external reason not related to your own code or data: like your spot EC2 instance terminating due to high demand), you just run it again. When you're done running your processes, normally the processing node is completely terminated.

tl;dr: you treat them like cattle with a very short lifecycle around data processes. The specifics of resource/process scheduling being dependent on your data needs.



But duckdb has its own on-disk format, and it generally is the best performing thing to scan (as opposed to always just querying, say, parquet files in S3).

So if you want to use the duckdb native format, and you have a lot of data... what do you do? How do you keep your duckdb file up to date with incoming data to your data lake? Maybe you just... don't? Have a daily rebuild of your ephemeral-ish duckdb node?

It seems like it would be kind of a pain to manage once you get to moderate scale (say, TB+ datasets).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: