OSS Delta support deletion vectors. The problem is that the OSS Deltalake (based on Delta-rs) python library does not and this prevents engines like DuckDB and Polars from writing to suck tables. I'm pretty sure DV is in OSS since 3.1
FYI I had V-Order and Optimzed Write disabled in the benchmark. The only wrote diff was that I enabled deletion vectors in Spark since it’s supported which the other two don’t.
Miles Cole here: I’d love to see Daft on Ray become more widely used. Same Dataframe API and run it in either single or multi-machine mode. The only thing I don’t love about it today is that their marketing is a bit misleading. Daft is distributed VIA Ray, Daft itself is not distributed.
Thanks for the feedback on marketing! Daft is indeed distributed using Ray, but to do so involves Daft being architected very carefully for distributed computing (e.g. using map/reduce paradigms).
Ray fulfills almost a Kubernetes-like role for us in terms of orchestration/scheduling (admittedly it does quite a bit more as well especially in the area of data movement). But yes the technologies are very complementary!
Author of the blog here: fair point. Pretty much every published benchmark has an agenda that ultimately skews the conclusion. I did my best here to be impartial, I.e I fully designed the benchmark and each test prior to running code on any engine to mimic typical ELT demands w/o having the opportunity to optimize Spark since I know it well.
I think you did a good job for these workloads. I did some informal experimenting last year when I had to implement an ELT-type system and I ended up doing it in Spark as well. It was my last choice, because I find operating and debugging Spark to be a huge pain. But everything else I tried was way slower.
I didn't think that people used polars a lot for ELT. I've usually seen it used for aggregations with small outputs (which, as you called out, it does a great job at).
Miles Cole here… thx for the correction, another reader just notes this as well. I’ll get this corrected tomorrow and possibly retest after verifying I have spill set up. Thx!
Great! A small correction on your post. Polars does have SQL suppor. It isn't the main usecase, so it isn't as good as that of Spark and DuckDB, but it does exist and is being improved on.