I intentionally kept it lightweight. Just Parquet files + simple partitioning + commits on Hugging Face. That already covers most of what I need, without introducing a heavier stack or extra dependencies.
Also, I wanted something that is easy to consume anywhere. With this setup, you can point DuckDB or Polars directly at the data and start querying, no catalog or special tooling required.
[Author here] The whole pipeline runs on a single ~$10/month VPS, but it can process hundreds of TB even with just 12GB RAM and a 200GB SSD.
The main reason I built this was to have HN data that is easy to query and always up to date, without needing to run your own pipeline first. There are also some interesting ideas in the pipeline, like what I call "auto-heal". Happy to share more if anyone is interested :)
A lot of the choices are trade-offs, as usual with data pipelines. I chose Parquet because it is columnar and compressed, so tools like DuckDB or Polars can read only the columns they need. This matters a lot as the dataset grows.
I went with Hugging Face mainly because it is simple and already handles distribution and versioning. I can just push data as commits and get a built-in history without managing extra infrastructure (and, more conveniently, if you read the README, you can query it directly using Python or DuckDB).
The pipeline is incremental. Instead of rebuilding everything, it appends small batches every few minutes using the API. That keeps it fresh while staying cheap to run. The data is also partitioned by time, so queries do not need to scan the entire dataset (and I use very simple tech, just a Go binary running in a "screen" session, using only a few MB of RAM for the whole pipeline).
@keepamovin thanks, your project was a big inspiration for this.
I built my own pipeline with a slightly different setup. I use Go to download and process the data, and update it every 5 minutes using the HN API, trying to stay within fair use. It is also easy to tweak if someone wants faster or slower updates.
One part I really like is the "dynamic" README on Hugging Face. It is generated automatically by the code and keeps updating as new commits come in, so you can just open it and quickly see the current state.
The code is still a bit messy right now (I open sourced it together with around 3.6M lines across 100+ other tools, hidden in a corner of GitHub, anyone interested can play Sherlock Holmes and find it :) ), but I will clean it up, and open source as clearer new repository and write a proper blog post explaining how it works.
Connecting directly with the author of the project that inspired me is awesome.
Let's collaborate and see how we can make our two projects work together.
DuckDB has a feature that can write to SQLite: https://duckdb.org/docs/stable/core_extensions/sqlite. Starting from Parquet files, we could use DuckDB to write into SQLite databases. This could reduce ingress time to around five minutes instead of a week.
If I have some free time this weekend, I would definitely like to contribute to your project. Would you be interested?
As for my background, I focus on data engineering and data architecture. I help clients build very large-scale data pipelines, ranging from near real-time systems (under 10 ms) to large batch processing systems (handling up to 1 billion business transactions per day across thousands of partners). Some of these systems use mathematical models I developed, particularly in graph theory.
One of the things that i got interested in from the comments on my show was parquet. Everyone raving about it. Happy to see a project using that today.
---
You should drive the car to the car wash, but you should walk yourself.
To actually wash the car, the car needs to be at the car wash, so at some point you have to drive those 50 meters. A sensible approach is:
- Drive the car the 50 meters to the wash bay.
- Park or queue as required, then get out and do the wash.
- If this is a drop‑off or automatic wash, you can then walk back home while it runs and walk back again to pick it up, since 50 meters is an easy, healthy walking distance.
DevSecOps Engineer
United States Army Special Operations Command · Full-time
Jun 2022 - Jul 2025 · 3 yrs 2 mos
Honestly, it is a little scary to see someone with a serious DevSecOps background ship an AI project that looks this sloppy and unreviewed. It makes you question how much rigor and code quality made it into their earlier "mission critical" engineering work.
Maybe, but the group of people they are/were working with are Extremely Serious, and Not Goofs.
This person was in communications of the 160th Special Operations Aviation Regiment, the group that just flew helicopters into Venezuela. ... And it looks like a very unusual connection to Delta Force.
Considering how many times I've heard "don't let perfection be the enemy of good enough" when the code I have is not only incomplete but doesn't even do most of the things asked (yet), I'd wager quite a lot
reply