If it is true that Copilot only generates small snippets that arent under copyright, then why doesnt Microsoft train it on their own internal source code? Having more training data is good, and they claim that there is nothing to worry about. Seems very hypocritical.
The output of a machine simply does not qualify for copyright protection – it is in the public domain. That is good news for the open movement and not something that needs fixing.
This is great. Someone should train a machine learning model on leaked windows source code, and use it to generate a public domain implementation of windows. The same should be possible with music or movies. But it cant be a way to strip open source licenses while leaving proprietary copyright intact.
If Copilot will lead to copyright being abolished completely, I am all for it.
Lets take those AI generated anime faces… they are under public domain because a machine created them (great news, I wasn’t sure about it before), but if you narrow the AI model down in it’s parameters so that the resulting image looks (nearly) exactly like an existing copyrighted image then you (!) are still doing a copyright infringement.
Or to put it differently, a robot pen that paints random but nice looking lines can’t create a copyrighted work, but if you restrict the pen to make a picture that looks very close to an existing copyrighted artwork then that is a copyright infringement regardless of how that image was created.
Edit: AI generated code and art will make copyright mostly worthless, but not void.
How is a programmer who uses copilot going to know that they snippet they are getting suggested comes from a GPL-licensed project? At the moment thats impossible, so it cant be the standard assumption tha tthe output is public domain.
How is a programmer going to know that the person who posted code on stackoverflow hasn’t taken it from a GPL licensed project? But the question is besides the point and irrelevant to the question if the ML model itself is a (legally speaking) derivative of the training data used. IANAL this is currently not the case under existing copyright legisation around the globe.
As for the output itself: There is the legal concept in copyright law that really small snippets of text or sound can not be copyrighted. If the AI then assembles genuinely new code and functionality from these snippets (theoretically feasible, but not what the co-pilot does), then this resulting code is in the public domain as IANAL currently a machine can not have copyright (and the legal case of it’s owners being able to claim copyright AFAIK hasn’t been fully established in courts). But if a human programmer uses a tool like the co-pilot to assemble these snippets he or she can claim copyright of it.
But if the result is nearly indistinguishable from a copyrighted piece of code than that programmer will not be able to proof that is wasn’t in fact a copyright violation and thus in praxis it is.
Posting code on stackoverflow doesnt magically put it in the public domain, as copilot allegedly does.
(theoretically feasible, but not what the co-pilot does)
I am not considering what copilot could or might do in the future. I am talking about what it does now, and that is generating exact copies of 10+ lines. Including license texts which it certainly didnt assemble on its own.
No one is claiming that the co-pilot is magically putting all code it suggests in the public domain. That is just a strawman argument.
If the code sippets it suggests have insufficient technical complexity to be considered a copyrightable piece of information, then like any other such text snippet (regardless of the source) is in the public domain. This is half or single line type of auto-completion level stuff.
If the programmer choses to continually pressing the autocomplete button so that a sufficiently complex piece of code is pasted into their editor, then that programmer has to be aware that this is likely a copyright violation, just like if he or she was cut and pasting large code pieces from stackoverflow or any other source where the license isn’t clear.
Will copilot warn the original author of the stolen code in that case, so that they can sue the copyright violator? Why does copilot even allow inserting more than one line in that case? If you are right that means that it is actively enouraging copyright violation, which puts it on the same level as thepiratebay.org.
Will your preferred code editor warn the original author that you just cut an pasted some copyrighted code into it? How would it even know?
It allows inserting more than one line because it is dumb and can not know if the piece of code it referenced is copyrighted or not and who wrote it. It just looks at the immediate context of the place of your cursor, then looks at its database where it says “usually these three words are followed by these other three words or letters” and then suggests that (very simplified speaking).
And no it is not anywhere close to the Piratebay ;)
If it is true that Copilot only generates small snippets that arent under copyright, then why doesnt Microsoft train it on their own internal source code? Having more training data is good, and they claim that there is nothing to worry about. Seems very hypocritical.
This is great. Someone should train a machine learning model on leaked windows source code, and use it to generate a public domain implementation of windows. The same should be possible with music or movies. But it cant be a way to strip open source licenses while leaving proprietary copyright intact.
If Copilot will lead to copyright being abolished completely, I am all for it.
Well, I wouldn’t want my code autocomplete to learn from MS’s code…
No you misunderstood that sentence.
Lets take those AI generated anime faces… they are under public domain because a machine created them (great news, I wasn’t sure about it before), but if you narrow the AI model down in it’s parameters so that the resulting image looks (nearly) exactly like an existing copyrighted image then you (!) are still doing a copyright infringement.
Or to put it differently, a robot pen that paints random but nice looking lines can’t create a copyrighted work, but if you restrict the pen to make a picture that looks very close to an existing copyrighted artwork then that is a copyright infringement regardless of how that image was created.
Edit: AI generated code and art will make copyright mostly worthless, but not void.
How is a programmer who uses copilot going to know that they snippet they are getting suggested comes from a GPL-licensed project? At the moment thats impossible, so it cant be the standard assumption tha tthe output is public domain.
How is a programmer going to know that the person who posted code on stackoverflow hasn’t taken it from a GPL licensed project? But the question is besides the point and irrelevant to the question if the ML model itself is a (legally speaking) derivative of the training data used. IANAL this is currently not the case under existing copyright legisation around the globe.
As for the output itself: There is the legal concept in copyright law that really small snippets of text or sound can not be copyrighted. If the AI then assembles genuinely new code and functionality from these snippets (theoretically feasible, but not what the co-pilot does), then this resulting code is in the public domain as IANAL currently a machine can not have copyright (and the legal case of it’s owners being able to claim copyright AFAIK hasn’t been fully established in courts). But if a human programmer uses a tool like the co-pilot to assemble these snippets he or she can claim copyright of it.
But if the result is nearly indistinguishable from a copyrighted piece of code than that programmer will not be able to proof that is wasn’t in fact a copyright violation and thus in praxis it is.
Posting code on stackoverflow doesnt magically put it in the public domain, as copilot allegedly does.
I am not considering what copilot could or might do in the future. I am talking about what it does now, and that is generating exact copies of 10+ lines. Including license texts which it certainly didnt assemble on its own.
No one is claiming that the co-pilot is magically putting all code it suggests in the public domain. That is just a strawman argument.
If the code sippets it suggests have insufficient technical complexity to be considered a copyrightable piece of information, then like any other such text snippet (regardless of the source) is in the public domain. This is half or single line type of auto-completion level stuff.
If the programmer choses to continually pressing the autocomplete button so that a sufficiently complex piece of code is pasted into their editor, then that programmer has to be aware that this is likely a copyright violation, just like if he or she was cut and pasting large code pieces from stackoverflow or any other source where the license isn’t clear.
Will copilot warn the original author of the stolen code in that case, so that they can sue the copyright violator? Why does copilot even allow inserting more than one line in that case? If you are right that means that it is actively enouraging copyright violation, which puts it on the same level as thepiratebay.org.
Will your preferred code editor warn the original author that you just cut an pasted some copyrighted code into it? How would it even know?
It allows inserting more than one line because it is dumb and can not know if the piece of code it referenced is copyrighted or not and who wrote it. It just looks at the immediate context of the place of your cursor, then looks at its database where it says “usually these three words are followed by these other three words or letters” and then suggests that (very simplified speaking).
And no it is not anywhere close to the Piratebay ;)