Show HN: Robust LLM extractor for websites in TypeScript

spiderfarmer · 2026-03-26T09:36:44 1774517804

My platform has 24M pages on 8 domains and these NASTY crawlers insist on visiting every single one of them. For every 1 real visitor there are at least 300 requests from residential proxies. And that's after I blocked complete countries like Russia, China, Taiwan and Singapore.

Even Cloudflares bot filter only blocks some of them.

I'm using honeypot URLs right now to block all crawlers that ignore rel="nofollow", but they appear to have many millions of devices. I wouldn't be surprised if there are a gazillion residential routers, webcams and phones that are hacked to function as a simple doorways.

Things are really getting out of hand.

dmos62 · 2026-03-27T07:43:59 1774597439

Have you considered recaptcha v2 and similar? Proof of work might slow them down. Sounds pretty bad. Would be great if Cloudflare, Datadome, etc. were doing this for you and thus banning these devices for everyone.

cj · 2026-03-26T13:45:49 1774532749

What crawlers are using residential proxies?

spiderfarmer · 2026-03-26T14:12:58 1774534378

Now if they identified themselves, I could block them.

I'd put my money on Chinese AI model makers, but I don't trust any company that is in desperate need of fresh data.

sheept · 2026-03-26T05:43:07 1774503787

> LLMs return malformed JSON more often than you'd expect, especially with nested arrays and complex schemas. One bad bracket and your pipeline crashes.

This might be one reason why Claude Code uses XML for tool calling: repeating the tag name in the closing bracket helps it keep track of where it is during inference, so it is less error prone.

andrew_zhong · 2026-03-26T06:03:52 1774505032

Yeah that's a good observation. XML's closing tags give the model structural anchors during generation — it knows where it is in the nesting. JSON doesn't have that, so the deeper the nesting the more likely the model loses track of brackets.

We see this especially with arrays of objects where each object has optional nested fields. For complex nested objects, the model can get all items well formatted but one with an invalid field of wrong type. That's why we put effort into the repair/recovery/sanitization layer — validate field-by-field and keep what's valid rather than throwing everything out.

olafura · 2026-03-26T14:20:56 1774534856

Unless I'm totally misunderstanding something it's not xml but special tokens for the tokenizer someone smarter than me might know https://medium.com/@nisarg.nargund/why-special-tokens-matter...

sheept · 2026-03-26T18:52:54 1774551174

Not in Claude Code, where asking it to print the XML used for tool calling makes it accidentally trigger the tool call

faangguyindia · 2026-03-26T09:02:13 1774515733

Hardly matters, this isn't a problem that you'd have these days with modern LLMs.

Also, a model can always use a proxy to turn your tool calls into XML

And feed you back json right away and you wouldn't even know if any transformation did take place.

andrew_zhong · 2026-03-26T09:37:25 1774517845

We do see fewer invalid JSONs on latest bigger LLMs but still can happen on smaller and cheaper models. There is also case when input is truncated or a required field not found, which are inherently difficult.

On XML vs JSON, I think the goal here is to generate typed output where JSON with zod shines - for example the result can type check and be inserted to database typed columns later

faangguyindia · 2026-03-26T09:48:39 1774518519

Thing is even with XML LLM will fail every now and then.

I've built an agent in both tool calling and by parsing XML

You always need a self correcting loop built in, if you are editing a file with LLM you need provide hints so LLM gets it right the second time or 3rd or n time.

Just by switching to XML you'll not get that.

I used to use XML now i only use it for examples in in system prompt for model to learn. That's all

andrew_zhong · 2026-03-26T09:57:26 1774519046

Agreed - in this project I did a one path sanitation to recover invalid optional / nullable fields or discard invalid objects in nested array.

I know multi path LLM approaches exist: e.g. generating JSON patches

https://github.com/hinthornw/trustcall

plastic041 · 2026-03-26T04:57:01 1774501021

> Avoid detection with built-in anti-bot patches and proxy configuration for reliable web scraping.

And it doesn't care about robots.txt.

andrew_zhong · 2026-03-26T06:01:54 1774504914

Good point. The anti-bot patches here (via Patchright) are about preventing the browser from being detected as automated — things like CDP leak fixes so Cloudflare doesn't block you mid-session. It's not about bypassing access restrictions.

Our main use case is retail price monitoring — comparing publicly listed product prices across e-commerce sites, which is pretty standard in the industry. But fair point, we should make that clearer in the README.

plastic041 · 2026-03-26T07:23:46 1774509826

robots.txt is the most basic access restrictions and it doesn't even read it, while faking itself as human[0]. It is about bypassing access restrictions.

[0]: https://github.com/lightfeed/extractor/blob/d11060269e65459e...

zendist · 2026-03-26T07:14:43 1774509283

Regardless. You should still respect robots.txt..

andrew_zhong · 2026-03-26T07:27:07 1774510027

We do respect robots.txt production - also scraping browser providers like BrightData enforces that.

I will add a PR to enforce robots.txt before the actual scraping.

plastic041 · 2026-03-26T07:54:35 1774511675

How can people believe that you are respecting bot detection in production when your software's README says it can "Avoid detection with built-in anti-bot patches"?

andrew_zhong · 2026-03-26T18:54:35 1774551275

I hear you loud and clear - will replace the stealth browser with plain playwright and remove anti-bot as a feature.

messe · 2026-03-26T06:36:53 1774507013

> It's not about bypassing access restrictions.

Yes. It is. You've just made an arbitrary choice not to define it as such.

andrew_zhong · 2026-03-26T07:31:53 1774510313

I will add a PR to enforce robots.txt before the actual scraping.

messe · 2026-03-26T23:44:57 1774568697

Or just follow web standards and define and publish your User-Agent header, so that people can block that as needed.

You're creating the wrong kind of value. I really hope your company fails, as its success implies a failure of the web in general.

I wish you the best success outside of your current endeavour.

Flux159 · 2026-03-26T05:09:13 1774501753

This looks pretty interesting! I haven't used it yet, but looked through the code a bit, it looks like it uses turndown to convert the html to markdown first, then it passes that to the LLM so assuming that's a huge reduction in tokens by preprocessing. Do you have any data on how often this can cause issues? ie tables or other information being lost?

Then langchain and structured schemas for the output along w/ a specific system prompt for the LLM. Do you know which open source models work best or do you just use gemini in production?

Also, looking at the docs, Gemini 2.5 flash is getting deprecated by June 17th https://ai.google.dev/gemini-api/docs/deprecations#gemini-2.... (I keep getting emails from Google about it), so might want to update that to Gemini 3 Flash in the examples.

andrew_zhong · 2026-03-26T08:45:10 1774514710

HTML -> markdown -> LLM is standard practice. We strip elements like aside, embed, head , iframe etc. the criteria is conservatively set to avoid removing too many elements (especially in extractMain mode)

https://github.com/lightfeed/extractor/blob/main/src/convert...

I have used gemma 3 and had good results.

Once Gemini 3 flash drops the preview suffix, will update the examples. Thank you for the pointer.

qingcharles · 2026-03-27T01:47:18 1774576038

I use <aside> for infoboxes on my wiki. Am I using aside wrong or are you stripping too heavily?

rsafaya · 2026-03-29T12:22:42 1774786962

Maybe it's time scrapers actually paid publishers via something like HTTP 402 for their data instead of an arms race with Cloudflare on one side and residential proxies on the other.

l3x4ur1n · 2026-03-26T12:10:07 1774527007

Would this work for my use case?

I need to extract article content, determine it's sentiment towards a keyword and output a simple json with article name, url, sentiment and some text around the found keyword.

Currently I'm having problems with the json output, it's not reliable enough and produces a lot of false json.

andrew_zhong · 2026-03-26T15:22:12 1774538532

What kind of LLMs are you using? In structured output mode?

In this library we recover nullable and optional fields, invalid elements in nested array, bad urls, repair incomplete JSONs. If these issues are what you see, yes it should work for your case.

letier · 2026-03-26T07:57:03 1774511823

The extraction prompt would need some hardening against prompt injection, as far as i can tell.

Ryand1234 · 2026-03-28T10:47:12 1774694832

Hey, 1 question, does it extract interactive data too? I mean data that is visible after interaction, like the collapse bar and others?

andrew_zhong · 2026-03-28T16:15:38 1774714538

You can use a browser automation library for rendering JS + interaction (like click collapse button) and then use this library to extract the HTML after interaction. Here is an example of using a AI browser automation library with prompt to interact (but you can also use playwright if you know the exact element to interact)

https://github.com/lightfeed/extractor?tab=readme-ov-file#us...

vetler · 2026-03-26T09:11:25 1774516285

My instinct was also to use LLMs for this, but it was way to slow and still expensive if you want to scrape millions of pages.

andrew_zhong · 2026-03-26T10:06:09 1774519569

Put things to perspective - Gemini 2.5 flash is 0.3/1M tokens - assuming each page is 700 tokens and output is not much you are looking at $210 for 1M pages

vetler · 2026-03-26T11:54:00 1774526040

You will absolutely struggle to get all the info you need into 700 tokens per page.

Edit: There's also the added complexity of running a browser against 1M pages, or more.

andrew_zhong · 2026-03-26T15:26:08 1774538768

I agree that When pages have similar structure, for one time extraction as it is (not reasoning from context), scraping with selectors is the way to go.

This library also supports HTML as input so running a browser is not required.

vetler · 2026-03-30T10:29:27 1774866567

Came back here to say I was wrong! I have been experimenting, and it is doable. I have been experimenting with setting up a scraping pipeline with LLM enrichment since I wrote the comment above, and have very positive results so far. :)

dmos62 · 2026-03-26T05:57:10 1774504630

What's your experience with not getting blocked by anti-bot systems? I see you've custom patches for that.

andrew_zhong · 2026-03-26T06:21:00 1774506060

The anti-bot patches here (via Patchright) are about preventing the browser from being detected as automated — fixing CDP leaks, removing automation flags, etc. For sites behind Cloudflare or Datadome, that alone usually isn't enough — you'll need residential proxies and proper browser fingerprints on top. The library supports connecting to remote scraping browsers via WebSocket and proxy configuration for those cases.

spiderfarmer · 2026-03-26T09:27:35 1774517255

As someone who is getting HAMMERED TO NO BELIEVE by residential proxies, I just want to express my hatred to all of you.

dmos62 · 2026-03-26T09:50:32 1774518632

Curious. Care to share more? What approaches have you tried?

spiderfarmer · 2026-03-26T09:53:19 1774518799

https://news.ycombinator.com/edit?id=47528370

andrew_zhong · 2026-03-26T18:53:41 1774551221

[Update] I will replace the stealth browser with plain playwright and remove anti-bot as a feature.

AirMax98 · 2026-03-26T06:39:37 1774507177

This feels like slop to me.

It may or may not be, but if you want people to actually use this product I’d suggest improving your documentation and replies here to not look like raw Claude output.

I also doubt the premise that about malformed JSON. I have never encountered anything like what you are describing with structured outputs.

andrew_zhong · 2026-03-26T08:01:03 1774512063

In context of e-commerce web extraction, invalid JSON can occur especially in edge cases, for example:

price: z.number().optional() -> price: “n/a”

url: z.string().url().nullable() -> url: “not found”

It can also be one invalid object (e.g. missing required field, truncated input) in an array causing the entire output to fail.

The unique contribution here is we can recover invalid nullable or optional field, and also remove invalid nested objects in an array.

zx8080 · 2026-03-26T05:28:15 1774502895

Robots.txt anyone?

andrew_zhong · 2026-03-26T06:02:00 1774504920

Good point. The anti-bot patches here (via Patchright) are about preventing the browser from being detected as automated — things like CDP leak fixes so Cloudflare doesn't block you mid-session. It's not about bypassing access restrictions.

Our main use case is retail price monitoring — comparing publicly listed product prices across e-commerce sites, which is pretty standard in the industry. But fair point, we should make that clearer in the README.

reyqn · 2026-03-26T06:24:33 1774506273

https://news.ycombinator.com/item?id=47340079

andrew_zhong · 2026-03-26T18:55:48 1774551348

[Update]] I will replace the stealth browser with plain playwright and remove anti-bot as a feature.

zendist · 2026-03-26T07:14:52 1774509292

Regardless. You should still respect robots.txt..

bilekas · 2026-03-26T09:18:46 1774516726

> comparing publicly listed product prices across e-commerce sites

Those prices and information is for the public viewers, the reason why some people have ROBOTS.txt for example is to reduce the traffic load that slop crawlers generate. The bandwidth is not free so why would you assume to ignore their ROBOTS.txt when you're not footing the bill ?