Hacker Newsnew | past | comments | ask | show | jobs | submit | vorticalbox's commentslogin

It is learning though. It’s not just copying the code.

Code gets turned into tokens and then it learns the next most likely token.

The issue that I see most people talk about it the scale at which is learnt.

A human will learn from other people’s code but not from every persons code.


The issue is that of copyright law WRT to derivative works. Machine transformations on original works does not create a new copyright for the person that directed the machine transformation. That's why you can't pirate a bunch of media by simply adding a red pixel to the righthand corner or by color shifting the video.

Copyright law is very clear that if a machine does it, the original copyright on the input is kept. This is why your distributed binaries are still copyrighted, because the machine transformed, very significantly, the source code into binary which maintains the copyright throughout.

It would be inconsistent for the courts to suddenly decide that "actually, this specific type of machine transformation is actually innovative."

I know this is generally really bad for the AI industry, so they just ignore it until a court tells them they can't anymore. And they might get away with it as I don't have faith that the courts will be consistent.


Shredding is a machine transformation. Does it mean that shreds retain original copyright even if the content can't be restored and the provenance can't be traced? Just an example that treating all machine transformations equally with no regard to the specifics doesn't make much sense.

And the specifics of autoregressive pretraining is that it is lossy compression. Good luck finding which copyrighted materials have made it into the final weights.


> Does it mean that shreds retain original copyright even if the content can't be restored?

Yup, it absolutely does. In fact, that's why you are still violating copyright law by using bittorrent even though each of the users is only giving out a small slice or shred of the original content.

The US has a granted defense in the case of something like shredding called "Fair Use" but that doesn't mean or imply that a copyright is void simply because of a fair use claim.

> And the specifics of autoregressive pretraining is that it is lossy compression.

That doesn't matter. Why would it? If I take a FLAC recording and change it to an MP3. The fact that it was a lossy transform doesn't suddenly give me the legal right to distribute the MP3.

> Good luck finding which copyrighted materials have made it into the final weights.

That's what the NYT v. OpenAI lawsuit is all about. And for earlier models they could, in fact, pull out full NYT articles which proved they made it into the final weights.

Further, the NYT is currently in discovery which means OpenAI must open up to the NYT what goes into their weights. A move that, if OpenAI loses, other litigants can also use because there's a real good shot that OpenAI also included their works in the dataset.


> Yup, it absolutely does

Well, it's not the first time when the law contradicts laws of nature (for the entertainment of the future generations). Bittorent is not a relevant example, because the system is designed to restore the work in its fullness.

> in fact, pull out full NYT articles

That's when they used their knowledge of the exact text they wanted to "retrieve" to get the text? It wouldn't be so efficient with a random number generator, but it's doable.


> Bittorent is not a relevant example, because the system is designed to restore the work in its fullness.

You can restore shredded documents with enough time and effort. And if you did that and started making photo copies, even if they are incomplete, you will run afoul of copyright law.

Bittorrent is a relevant example because it shows that shredding doesn't destroy copyright.

Remember, copyright is about the right to copy something. Simply shredding or destroying a thing isn't applicable to copyright. Nor is giving that thing away. What's applicable is when you start to actually copy the thing.


I've meant idealized shredding: a destructive transformation, which is still a machine transformation (think blender instead of shredder). When you need the exact knowledge of a thing to make its (imperfect) copy using some mechanism, it doesn't mean that the mechanism violates copyright.

EDIT: I don't say that neural networks can't rote learn extensive passages (it's an effect of data duplication). I'm saying that they are not designed to do that and it's possible to prevent that (as demonstrated by the latest models).


I'd assume it's still a copyright violation if you copied and distributed the shredded copy.

The way I arrive at that is imagine you add just 1 pixel of static to a video, that'd still be a copyright violation. Now imagine you slowly keep adding those random pixels. Eventually you get to the point where the whole video is just static, but at some point it wasn't.

Now, would any media company or court sue over that? Probably not. But I believe that still falls under copy right (but maybe fair use?).

The issue with neural networks is they aren't people. Even when you point your LLM at a website and say "summarize this" the output of that summation would be owned by the website itself by nature of it being a machine transformed work.

Remembered, it's not just mere rote recitation which violates the law, any transformation counts as well. The fact that AI companies are preventing it doesn't really solve the problem that they are in fact transforming multiple copyrighted works into their responses.


When you point your browser at a website the browser creates a (transformed) local copy of the information that is owned by the website itself. The browser needs to do that to render the website on your screen. Is it a violation of copyright (that the website is willing to tolerate because it profits from advertisements)?

A human is not a commercial product. Here we have commercial product that was created by using a lot of various copyrighted and protected IP, without licensing agreements, without paying, without even citing it.

This reminds me of https://dnhkng.github.io/posts/rys/

David looks into the LLM finds the thinking layers and cut duplicates then and put them back to back.

This increases the LLM scores with basically no over head.

Very interesting read.


Jeff Dean says models hallucinate because their training data is "squishy."

But what's in the context window is sharp, the exact text or video frame right in front of them.

The goal is to bring more of the world into that context.

Compression gives it intuition. Context gives it precision.

Imagine if we could extract the model's reasoning core and plug it anywhere we want.


LLMs "hallucinate" because they are stochastic processes predicting the next word without any guarantees at being correct or truthful. It's literally an unavoidable fact unless we change the modelling approach. Which very few people are bothering to attempt right now.

Training data quality does matter but even with "perfect" data and a prompt in the training data it can still happen. LLMs don't actually know anything and they also don't know what they don't know.

https://arxiv.org/abs/2401.11817


> they also don't know what they don't know

they sort of do tho:

https://transformer-circuits.pub/2025/introspection/index.ht...


I won't quibble even though I likely should. Have to remember this is HN and companies need to shill their work otherwise ... Yes.

I will play along and assume this is sound. 10-40% +/- 10% is along the lines of "sort of" in a completely unreliable, unguaranteed and unproven way sure.


That’s not the only issue. They also have the problem that they’re built to always give an affirmative answer and to use authoritative wording, even when confidence is low. If they were trained to answer “I don’t know” instead of guessing, they’d hallucinate a lot less, but nobody seems to want that.

It calls to mind the issue of search engines that refuse to return “0 results found” anymore. Now they all try to give you related but ultimately incorrect results.

To me, that feels like gaslighting. It’s like if you ask someone to buy cheddar cheese at the store and they come back with mozzarella, and instead of admitting that the store was out of cheddar, they try to convince you that you actually really want mozzarella.


> If they were trained to answer “I don’t know”

If they were trained that an answe of "I don't know" was an acceptable answer, the model would be prone to always say "I don't know" because it's a universally acceptable answer.

It's a better answer even if it does "know".


That could be fixed with the right scoring scheme in training. The SAT exam (for college-bound high school students in the US) used a scheme like this for multiple choice questions. Correct answers are awarded 3 points (with choices a,b,c,d), incorrect answers are penalized with -1 point, and leaving the answer blank (equivalent to "I don't know") is worth 0 points. This way, the expected value of guessing a random answer when the student doesn't know is 0 points so you might as well leave it blank if your confidence in the answer is no better than a random guess.

That just sounds like a very fancy/marketing way of saying "models will hallucinate because you cannot compress all the facts in the world into the model size." (Without even getting into any other things that could cause plausible-but-incorrect output.)

>Imagine if we could extract the model's reasoning core and plug it anywhere we want.

Aren't a lot of the latest model variants doing something very similar? Stuff more domain-relevant knowledge into the model itself on top of a core generally-good reasoning piece, to reduce need to perfectly handle giant context?


I would assume if you are invited to join this round you will be send the questions. I would assume they would also fall under nda

Some ide’s already have this. In zed you can stick it “ask” mode.

Being able to use it as a rubber duck while it can also read the code works quite well.

There are a few APIs at work I have never worked on and the person that wrote them no longer works with us so AI fills that gap well.


extra high burns tokens i find. ( run 5.4 on medium for 90% of the tasks and high if i see medium struggling and its very focused and make minimum changes.

Yeah but it also then strikes the perfect balance between being meticulous and pragmatic. Also it pushes back much more often than other models in that mode.

Rework burns tokens.

Note mini-high is similar perf/latency to medium, but much cheaper

Not a problem if they're offering unlimited, lol

it allows you to track a browser forever because it is stable fingerprint point. This helps with long term tracking a great deal.

If I understand correctly, it was only stable until you restarted Firefox / your computer.

Ok that’s change it a bit but on the other hand I’ve had my browser open for weeks now and I only restart it when the “update” button turns red lol

correct. the ordering persists for as long as the original process continues to run

for agent agents we have ACP [0] surely their time would be better spent builing this sort of abstraction for computer use then simple teaching an AI to use a mouse?

The computer UI is the way it is because that is optimal for humans, if your plan is to replace humans why not just replace the whole stack os and all to something these models already know how to use?

[0] https://zed.dev/blog/acp-registry



Most APIs provide some sort of documentation. If it’s swagger you can just update the application from that.

It’s never free your shifting costs from paying a company for their api use vs the power costs of running it locally.

Sure, but it’ll be orders of magnitude cheaper in a few years. The consumer industry is already moving in this direction, with Apple leading the pack

Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: