Hacker Newsnew | past | comments | ask | show | jobs | submit | larodi's commentslogin

(not trying to pick a fight, but...) what difference it makes if an agent spent 100$ to understand and came to the same result, given it is those who use it that will benefit of it, and thus the craze here.

note: i'm not saying author did not improve his skills overall, but also last '6 years' perhaps also means - fair amount of digging the web with search engines, which are... like AI 0.1


The difference is that despite the fact that I will never use this, I am nonetheless here to celebrate the effort. What difference does it make if someone runs a marathon or takes an uber given they may arrive at the same location?

There's an algo called dynamic time warping (DTW) and is very often overlooked. My wild guess would be is at play @Shazam.

Ayyy I used DTW to track bots on a certain social media site. They tend to act in herds so DTW helps smooth out delayed, repeat actions.

Is a brilliant algo, and also works for multi-dim data. U can choose different distance functions - still works. Perhaps Dijkstra-shortest-path level significant for the robotic/ai era.

Can s.o. please explain, does the Cursor EULA really allow it to train on my code, as I really don't expect Claude Code or CODEX to do it either?

They will because there is no way to prove they didnt

It does unless you opt out

There’s one in BG that does that and yo great success. Reply here if this is a destination you have access to. I’ll provide detail.

I am open for shops with a good reputation

This whole poisoning intent is so incredibly misappropriated, that I feel sad about it. First of all - there is enough content to train on already, that is not poisoned, and second - the other new content is largely populated in automated manner from the real world, and by workers in large shops in Africa, that are being paid to not produce shit.

So yes, you can pollute the good old internet even more, but no, you cannot change the arrow of time, and then there's already the growing New Internet of APIs and public announce federations where this all matters very little.


This is an interesting sentiment given how desperate AI labs seem to be source any new internet content from any walled-garden platform willing to take their money (and how willing they are to try & take it even if you don't consent).

Abusive, sneaky scraping is absolutely through the roof.


I feel as though you are confusing AI use in scraping by random companies and actual AI companies scraping. The AI companies seem to see value in walled garden sources like Reddit, Stack Overflow, etc. However, I don't think there has been any major instance of a major American AI company doing aggressive online website scraping and not respecting robot.txt.

Per https://thelibre.news/foss-infrastructure-is-under-attack-by..., all of the major American AI companies are not respecting robot.txt and participating in the AI-fueled DDoS of the internet.

The issue is that UA are editable by the user, and there is no proof that some random person/scraper isn't just using a suspected trusted bot's UA string. Every ethical service also posts what IP addresses they use, so that people can compare the traffic they get to see if it is actually their bot scraping. What this article describes is the game of every third-party unethical scraper; they do anything and everything to try and get their request through. They steal UA's, they steal residential IP addresses through botnets, they attempt to circumvent CAPTCHAs using AI, etc. So the behavior in this article is not prove for any major AI provider doing unethical scraping.

There may be plenty of content out there but everyone with any content on the internet is struggling to keep AI crawlers that they never authorized out. In many cases, people are having to do so just to protect their infrastructure from request spamming.

Since AI crawlers don't obey any consent markers denying access to content, it makes sense for content owners who don't want AI trained on their content to poison it if possible. It's possibly the only way to keep the AI crawlers away.


I don't think this traffic is actually coming from crawlers for training.

Think about it, why would a training scraper need to hit the same page hundreds of times a day? They only need to download it once.

I think this is LLMs doing web searches at runtime in response to user queries. There's no caching at this level, so similar queries by many different users could lead the LLM to request the same page many times.


> It's possibly the only way to keep the AI crawlers away.

Unfortunately that won't work. If you've served them enough content to have noticeable poisoning effect then you've allowed all that load through your resources. It won't stop them coming either - for the most part they don't talk to each other so even if you drive some away more will come, there is no collaborative list of good and bad places to scrape.

The only half-way useful answer to the load issue ATM is PoW tricks like Anubis, and they can inconvenience some of your target audience as well. They don't protect your content at all, once it is copied elsewhere for any reason it'll get scraped from there. For instance if you keep some OSS code off GitHub, and behind some sort of bot protection, to stop it ending up in CoPilot's dataset, someone may eventually fork it and push their version to GitHub anyway thereby nullifying your attempt.


My point is that if crawlers have to worry about poison that may make them start to respect robots.txt or something. It's a bit like a "Beware of Dog" sign.

Unfortunately the use of the sign often highlights what the scrapers want most, so if they pay attention to it (rather than just completely ignoring it as most do now) it will be to specifically follow where told not to.

The scrapers ideally want content that is original. Often content that is also new is more highly prized, but not as much as you might think⁰. This will only become more of a driver as the amount of LLM generated content that is out there to be mixed in increases, in order to limit the Habsburg problem they won't want too much regurgitated content in the training data.

Bad content from before LLM scraping became a resource problem¹ is highly unlikely to be marked in robots.txt, the same for content newly generated-by-an-LLM. People attempting to fend off scrapers and other bots with robots.txt entries are likely protecting the sort of content the scrapers actively want - original output that they've put some time into or code in a repo they don't want scraped (as scraping a repo is incredibly inefficient and resource heavy from the PoV of the repo owner).

I strongly suspect that the amount of desirable content behind robots.txt “blocks” is far too valuable to ignore despite the amount of poison content traps, or just things otherwise not worth the time scouring through, that might also be there. A “beware of the dog” sign is of no protection when the reader actively wants to see the doggies!

--------

[0] if scraping for training an LLM you don't want just new content, but you would prefer as much of your input data as possible to be as few steps as possible from original

[1] and a copying concern, though I'll avoid that discussion as it can get quite thorny and whichever side or fence you are on in that matter the resource consumption is objectively a problem all the same.


How would that become a strong, stable signal, if both highly valuable and highly slopified content will use robots.txt?

For clarification poisoning and slop are different concepts. Slop is the output of AI. Poisoning is making your content (that may otherwise be good content) fuck up in the internals of an LLM. Classic example is the nightshade attack on image generators.

One could imagine an open source project that doesn't want to be ingested by an LLM. They could try to put that in the license but of course the license won't be obeyed. Alternately, if they could alter the code such that the OSS project itself remains high quality, but if you try to train a coding LLM on it the LLM will output code full of SQL injection exploits (for instance) or maybe just bogus uncompilable stuff, then the LLM authors will suddenly have a reason to start respecting your license and excluding the code from their index.


To me, slop is anything that makes sense at the surface level but falls apart upon closer examination, since there's "nobody home, semantically".

So why would this not be poisoning with (in this case human-generated) slop?


I may be picking nits on nits here, but… If slop indicates content with no reasoning, then deliberate slop isn't slop as it is generated with both reason and purpose known to, and understood by, the creator. Though if someone deliberately uses a generative model to create slop that line of reasoning might eat itself…

Maybe we should extend "death of the author" to non-human author entities as well :)

My bet is many of these crawlers collect price matching, socio-political and other data.

It is curious how it gets decided that all spiders crawl for training. And in fact the walled data is much more interesting, and particularly Reddit, X, and FB data where we still have indications of human or at least correct data lives.

These cannot be poisoned that easy.


If you put something on the open web, as I see it, you only get so much say in what people do with it.

Yes, they can't publish it without attribution and/or compensation (copyright, at least currently, for better or worse). Yes, they shouldn't get to hammer your server with redundant brainless requests for thousands of copies of the same content that no human will ever read (abuse/DDOS prevention).

No, I don't think you get to decide what user agent your visitors are using, and whether that user agent will summarize or otherwise transform it, using LLMs, ad blockers, or 273 artisanal regular expressions enabling dark/bright/readable/pink mode.

> it makes sense for content owners who don't want AI trained on their content to poison it if possible. It's possibly the only way to keep the AI crawlers away.

How would that work? The crawler needs to, well, crawl your site to determine that it's full of slop. At that point, it's already incurred the cost to you.

I'm all for banning spammy, high-request-rate crawlers, but those you would detect via abusive request patterns, and that won't be influenced by tokens.



Yes, you _can_ but you probably wont.

> there is enough content to train on already, that is not poisoned

This is true. Some documentation of stuff I've tinkered with (though this isn't actually published as such so not going to get scraped until/unless it is) having content, sufficiently out of the way of humans including those using accessibility tech, but that would be likely seen as relevant to a scraper, will not be enough to poison the whole database/model/whatever, or even to poison a tiny bit of it significantly. But it might change any net gain of ignoring my “please don't bombard this with scraper requests” signals to a big fat zero or maybe a tiny little negative. If not, then at least it was a fun little game to implement :)

To those trying to poison with some automation: random words/characters isn't going to do it, there are filtering techniques that easily identify and remove that sort of thing. Juggled content from the current page and others topologically local to it, maybe mixed with extra morsels (I like the “the episode where” example, but for that to work you need a fair number of examples like that in the training pool), on the other hand could weaken links between tokens as much as your “real” text enforces them.

One thing to note is that many scrapers filter obvious profanity, sometimes rejecting whole pages that contain it, so sprinkling a few offensive sequences (f×××, c×××, n×××××, r×××××, farage, joojooflop, belgium, …) where the bots will see them might have an effect on some.

Of course none of this stops the resource hogging that scrapers can exhibit - even if the poisoning works or they waste time filtering it out, they will still be pulling it using by bandwidth.


You may be underestimating the powers of trillions of parameters in a model. With this many parameters overfitting is inevitable. Overfitting here means you are plotting (or outputting) the errors in your data instead of interpolating (or inferring) any trends in the model.

In fact, given this many parameters, poisoning should be relatively easy in general, but extremely easy on niche subjects.

https://www.youtube.com/watch?v=78pHB0Rp6eI


>With this many parameters overfitting is inevitable.

Nope. Go look up double descent. Overfitting turns out not to be an issue with large models.

Your video is from a political activist, not anyone with any knowledge about machine learning. Here's a better video about overfitting: https://youtu.be/qRHdQz_P_Lo


I am not a professional statistician (only a BSc dropout) so I won‘t be able to gain the expertise required to evaluate the claim here: That double descent eliminates overfitting in LLMs.

That said, I see red flags here. This is an extraordinary claim, and extraordinary claims require extraordinary evidence. My actual degree (not the drop-out one) is in Psychology and I used statistics a lot during my degree, it is only BSc so again, I cannot claim expertise here either. But this claim and the abstracts I scanned in various papers to evaluate this claim, ring alarm bells all over. I don‘t trust it. It is precisely the thing that we were told to be aware of when we were taught scientific thinking.

In contrast, this political activist provided an example (an anecdote if you will) which showed how easy it was for an actual scientist to poison LLM models with a made up symptom. This looks like overfitting to me. These two Medium blog posts very much feel like errors in the data set which the models are all to happy to output as if it was inferred.

EDIT: I just watched that video, and I actually believe the claims in the video, however I do not believe your claim. If we assume that video is correct, your errors will only manifest in fewer hallucinations. Note that the higher parameter models in the demonstration the regression model traversed every single datapoint the sample, and that there was an optimal model with fewer parameters which had a better fit then the overfitted ones. This means that trillions of parameters indeed makes a model quite vulnerable to poison.


Almost certainly those weren't even in the training data. They showed up too soon; LLMs are retrained only every 6-12 months.

Instead, the LLM did a web search for 'bixonimania' and summarized the top results. This is not an example of training data poisoning.

>This is an extraordinary claim, and extraordinary claims require extraordinary evidence.

Well, I don't know what to tell you; double descent is widely accepted in ML at this point. Neural networks are routinely larger than their training data, and yet still generalize quite well.

That said, even a model that does not overfit can still repeat false information if the training data contains false information. It's not magic.


> even a model that does not overfit can still repeat false information

A good model will disregard outliers, or at the very least the weight of the outlier is offset by the weight of the sample. In other words, a good model won’t repeat false information. When you have too many parameters the model will traverse every outlier, even the ones who are not representative of the sample. This is the poison.

To me it sounds like data scientists have found an interesting and seemingly true phenomena, namely double descent, and LLM makers are using it as a magic solution to wisk away all sorts of problem that this phenomena may or may not help with.

> Instead, the LLM did a web search for 'bixonimania' and summarized the top results. This is not an example of training data poisoning.

Good point, I hadn’t considered this, Although it is probably more likely it did web search with the list of symptoms and outputted the term from there especially considering the research papers which cited the fictitious disease probably did not include a made-up term in its prompt.


You should check out "model collapse". It seems that an abundance of content, that is more and more AI generated these days, may not be a viable option. There is also a vast amount of data that is increasingly going private or behind paywalls

People love harping on this one, but model collapse hasn't turned out to be an issue in practice.

“It’s been a whole year or two and nothing bad has happened, checkmate doomers!”

It’s pretty shocking how much web content and forum posts are either partially or completely LLM-generated these days. I’m pretty sure feeding this stuff back into models is widely understood to not be a good thing.


It feels like if it does happen, it will take a lot longer to show up. Also, I doubt they would ship a model that turns out this corrupted stuff.

It wont mean we see the model collapse in public, more we struggle to get to the next quality increase.


There's been symptoms of it that have shown up such as the colloquially called "piss filter" and the the anime mole nose problem, but so far they've been symptoms rather than a fatal expression of a disease. That they are symptoms however shows they can be terminal if exploited properly and profusely. So far we haven't seen anyone capable of the "profusely" part.

Besides models get distilled for fun and profit all the time, which on its own does not support the theory of model collapse.

It doesn't seem like anything has changed to preclude it as a possible outcome yet.

I don't really understand why model collapse would happen.

I understand that if I have an AI model and then feed it its own responses it will degrade in performance. But that's not what's happening in the wild though - there are extra filtering steps in-between. Users upvote and downvote posts, people post the "best" AI generated content (that they prefer), the more human sounding AI gets more engagement etc. All of these things filter AI output, so it's not the same thing as:

AI out -> AI in

It is:

AI out -> human filter -> AI in

And at that point the human filter starts acting like a fitness function for a genetic algorithm. Can anyone explain how this still leads to model collapse? Does the signal in the synthetic data just overpower the human filter?


> Users upvote and downvote posts, people post the "best" AI generated content (that they prefer), the more human sounding AI gets more engagement etc. All of these things filter AI output

At the same time though AI generated content can be generated much much faster than human generated content so eventually AI slop downs out anything else. You only have to check the popular social media platforms to see this in action and AI generated posts are widely promoted and pushed on users the same way most web searches return results with AI generated pages ranked highly.

Humans can't keep up and companies are actively working to bypass the human filter and intentionally promote AI generated content.


The past is not a good predictor of future performance.

>You should check out "model collapse". It seems that an abundance of content, that is more and more AI generated these days, may not be a viable option.

Doom-saying about "model collapse" is kind of funny when OpenAI and Anthropic are mad at Chinese model makers for "distilling" their models, ie. using their outputs to train their own models.


Totally different use cases. If you have nothing, getting 90% of a SOTA model is very valuable. If you have a SOTA model, it's just a worse model.

Isn't there a difference between: distilling specific AI input/output vs scraping whatever random AI output (with unknown input)?

I'm looking forward to Claude starting to talk like a Nigerian prince

Like the other commentor already pointed out, almost every AI bot out there thinks Fortnite is real, yet it is completely made up poison.

Waiting for 6502 implemented on top of Sudoku!

How would that work?

Wonder what really stops them to have an agent dig for a night, and have this compatibility in place. Even if it means them say - this is very unstable, use with caution.

> Even if it means them say - this is very unstable, use with caution.

AFAIK, the entire point of that reference platform is that nothing is "very unstable" or even "unstable" but instead a stable target to develop against. I'm guessing adding something like that would defeat the purpose somehow, and risk getting studios vary enough to make it not worth it.


SideFX have some compatibility flexibility around this with Houdini but they're the exception. Autodesk have very tight annual release schedules for Maya (and other DCC), where the actual feature development only has months allocated (and several months for beta). They rarely skip years too with 2020 being the last one.

in Teknium's words: "Hermes Agent is now the #1 coding app and closing in fast to be the #1 app globally on OpenRouter!!"

> We are strongly, strongly evolutionary oriented away from 'murder' - it's the original sin

Very strong statement given the massive killing of kettle and poultry per second.

Also given all the wars including those currently raging - I think is rather untrue.

Besides the killing a lion does is not over resources, it’s the resource itself.


Since you're using Biblical language, I just want to point out that you're not Biblically accurate. Murder isn't the original sin.

no for sure its not. its giving birth or rather - conception.

Psalm 51:5 — "Behold, I was shapen in iniquity; and in sin did my mother conceive me."


This is a pretty terrible misrepresentation of original sin. I see you managed to copy it straight out of the Wikipedia page and completely ignore the sentence before it:

> In Christian theology, original sin is the condition of sinfulness that all humans share, which they inherit from the the Fall of Adam and Eve. > https://en.wikipedia.org/wiki/Original_sin


what does it mean a condition of sinfulness, the tendency?

like, if you ask a theosophist, which a I did the other day, he claimed it is not about sexuality or human nature at all. It is the sin of attempting to create a reality without God.

so, go figure.


Kettle? Cattle?

Cattle of course. But not an LLM Text .) which is good

Here’s one: scratches are officially not an argument anymore for a price discount on a second hand Mac.

Drop them like it’s hot!


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: