mwc360's comments

mwc360 · on Dec 16, 2024

OSS Delta support deletion vectors. The problem is that the OSS Deltalake (based on Delta-rs) python library does not and this prevents engines like DuckDB and Polars from writing to suck tables. I'm pretty sure DV is in OSS since 3.1

mwc360 · on Dec 16, 2024

FYI I had V-Order and Optimzed Write disabled in the benchmark. The only wrote diff was that I enabled deletion vectors in Spark since it’s supported which the other two don’t.

craydandy · on Dec 17, 2024

Thanks for the clarification. I didn't see it in the article.

mwc360 · on Dec 16, 2024

Author here: that’s exactly what I was trying to communicate but you said it better :)

cmdlineluser · on Dec 16, 2024

There is a Spark API[1] being built using their Relational API[2].

Progress is being tracked on Github Discussions[3].

[1]: https://duckdb.org/docs/api/python/spark_api.html

[2]: https://duckdb.org/docs/api/python/relational_api.html

[3]: https://github.com/duckdb/duckdb/discussions/14525

mwc360 · on Dec 16, 2024

Very cool! This seems like fantastic functionality and would make it super easy to migrate small Spark workloads to DuckDB :)

mwc360 · on Dec 16, 2024

Miles Cole here: I’d love to see Daft on Ray become more widely used. Same Dataframe API and run it in either single or multi-machine mode. The only thing I don’t love about it today is that their marketing is a bit misleading. Daft is distributed VIA Ray, Daft itself is not distributed.

jaychia · on Dec 16, 2024

Hey, I'm one of the developers of Daft :)

Thanks for the feedback on marketing! Daft is indeed distributed using Ray, but to do so involves Daft being architected very carefully for distributed computing (e.g. using map/reduce paradigms).

Ray fulfills almost a Kubernetes-like role for us in terms of orchestration/scheduling (admittedly it does quite a bit more as well especially in the area of data movement). But yes the technologies are very complementary!

mwc360 · on Dec 16, 2024

Author of the blog here: fair point. Pretty much every published benchmark has an agenda that ultimately skews the conclusion. I did my best here to be impartial, I.e I fully designed the benchmark and each test prior to running code on any engine to mimic typical ELT demands w/o having the opportunity to optimize Spark since I know it well.

titanomachy · on Dec 17, 2024

I think you did a good job for these workloads. I did some informal experimenting last year when I had to implement an ELT-type system and I ended up doing it in Spark as well. It was my last choice, because I find operating and debugging Spark to be a huge pain. But everything else I tried was way slower.

I didn't think that people used polars a lot for ELT. I've usually seen it used for aggregations with small outputs (which, as you called out, it does a great job at).

mwc360 · on Dec 16, 2024

Miles Cole here… thx for the correction, another reader just notes this as well. I’ll get this corrected tomorrow and possibly retest after verifying I have spill set up. Thx!

mwc360 · on Dec 16, 2024

Hi - Miles Cole here… I used lazy APIs where available. I.e. everything up to write_delta() is lazy in the Polars (Mod) variant.

Yeah I was debating whether to share all of the source code. I may share a portion of it soon.

ritchie46 · on Dec 16, 2024

Great! A small correction on your post. Polars does have SQL suppor. It isn't the main usecase, so it isn't as good as that of Spark and DuckDB, but it does exist and is being improved on.

mwc360 · on Dec 16, 2024

Ritchie - thx for graciously correcting some things I got wrong, will get it corrected!