for something to show up verbatim in the output of a textual AI model it needs to be an input many times.
I wonder if the problem is not copilot, but many people using this person's code without license or credit, and copilot being trained on those pieces of code as well. copilot may just be exposing a problem rather than creating one.
I don't know much about AI, and I don't use copilot.
Microsoft have a public statement that they don't use proprietary code, only public code with public licenses. They have a lot of companies as customers who uses github, and they also use a lot third-party code in their own products.
Even BSD et. al. have attribution requirements - that must be a vanishingly small amount of code to be used. Me thinks the people who run GitHub (who have apparently decided to abandon the core business for the latest fun project) aren't being entirely upfront.
With the amount of resources that Microsoft has, how hard can it be for them to exclude proprietary code that other people have stolen? I’d bet it is easy for them, but they won’t do it. Because they don’t care, because who is gonna take on them?
Will they “accidentally” include proprietary code from say, Oracle? Nope. They’ll make sure of it. But Joe Random? Sure
because Microsoft is known to be extremely protective of their code. there is just no way they would expose their internal code to being straight up decoded from the model, while they can just train the model on the huge public data of GitHub
for something to show up verbatim in the output of a textual AI model it needs to be an input many times.
I wonder if the problem is not copilot, but many people using this person's code without license or credit, and copilot being trained on those pieces of code as well. copilot may just be exposing a problem rather than creating one.
I don't know much about AI, and I don't use copilot.