Hosting location connectivity issues affecting log in and payment servers. It's out of our hands. Your patience is appreciated.
Hide
Home ยป Forum ยป Author Hangout

Forum: Author Hangout

Humorous AI weirdness

Soronel ๐Ÿšซ

I decided to give AI a try, ran something I'd already written through it to see how well its summary function works.

It somehow transformed a raw and primal ground-bound dragon mating fight, then fuck of the exhausted loser into "a beautiful mating flight".

Oh, and it completely refused to acknowledge one character being trained to shoot a pistol. Simply not a word about the topic. Although it did a good job describing her reaction to watching a rather horrific public execution.

I also did this with a different AI, asking it to summarize some computer code. I will admit everything it came up with was true and accurate, but it also completely missed on some very important requirements. If I were grading it, I'd say that one was worth about 75%.

The Outsider ๐Ÿšซ

@Soronel

Sounds like it's improving, but it's not there yet... I'll keep trying to coax my muse out of the dark corner she's hiding in before I go that route, though...

Switch Blayde ๐Ÿšซ

@Soronel

I'm not a fan of using AI to help my writing, especially critiquing my writing. And especially suggesting how to improve it. I hate the way the AI rewrites it.

I have used AI to come up with a simile. It gives me a bunch of awful ones, but when I keep asking I get a few good ones.

I did use it recently to come up with the title of my just published novel. I originally called it "The Accident." The opening line/paragraph in the novel is "I was an accident." But the names of the hero and heroine are Chance and Charity, respectively. The AI came up with "A Chance for Charity." Perfect title so that's what it is.

And as I said, I use Meta AI to create images for book covers.

Switch Blayde ๐Ÿšซ

@Soronel

Humorous AI weirdness

You know, when I first saw this thread I read the capital "I" in "AI" as a lower case "L" ("l") and thought it was going to be about the comedian Weird Al Yankovic.

Soronel ๐Ÿšซ

@Soronel

I also keep seeing pieces saying that computer programmers need to be worried, but every time I have tried AI in that arena the results are so far from what I actually wanted that it just left me scratching my head.

Replies:   REP
REP ๐Ÿšซ
Updated:

@Soronel

it just left me scratching my head.

Yeah, but why are you scratching your head?

An AI is fed data by humans to create a database. The human-created database includes fictional stories to define real life and human emotions. The AI 's programming starts with the assumption that all of the data in the database is complete and valid, which is a false assumption. The AI then tries to merge valid and invalid data into a summation that matches real life. The invalid data introduces errors into the summation. Since AIs cannot distinguish valid data from invalid data without human intervention, its summation will always be corrupted.

A human can discriminate between valid and invalid data and can reject invalid/incomplete data when creating their summation. Therefore, the summation is more accurate to life than what an AI will create.

Using an AI's output is okay if you are using the AI's summation as the input for a story, or similar use. For you can take the AI's summation and modify it to create a story that is close to real life.

I have read comments in this forum about AIs adding to their database from external sources. If that is true, the question for me is - since an AI can't distinguish between valid and invalid data, is the data being added to the database increasing or decreasing the degree of database corruption?

A human has experienced real life and real emotions. Putting those experiences into words does not provide an accurate description of the personal impact of pain, loss, joy, and other emotions on a person. Therefore, inputting a written description of the emotional impact on a person to an AI does not provide the AI with accurate information about the impact the emotions have on a person.

If you are scratching your head because an AI's output is not close to the real world, just remember the saying - Garbage In, Garbage Out. If you are scratching your head at the idea of an AI becoming a threat to someone's job, ask yourself who is writing the article you are reading. The article's writer is obviously a proponent of AI technology, or possibly, someone using an AI to write their article; if so, the person defined what the article was to say, so Garbage In, Garbage Out.

Replies:   awnlee jawking  Joe Long
awnlee jawking ๐Ÿšซ

@REP

The human-created database includes fictional stories to define real life and human emotions.

One of the stories I'm following on SOL has AI generated scenes. I've googled unique sentences from the story and found exact matches on social media.

I think it's likely the AI in question (identity unknown) was trained on social media. And we all know how accurate that is.

(The story contains the usual AI bloopers - character names wrong, timeline inconsistencies between scenes etc).

AJ

Replies:   Paladin_HGWT  Joe Long
Paladin_HGWT ๐Ÿšซ

@awnlee jawking

The story contains the usual AI bloopers - character names wrong, timeline inconsistencies between scenes etc.

Humans, and not just on SoL have done all of those things too...

I haven't read any AI cribbed stories, just a few bits, or listened to AI video "shorts" and am appalled by the foibles and completely wrong mess generated by AI.

What little I have seen produced by AI would, at best, get a 4 out of 10 rating, and I feel I am being generous.

That said, I do use the Thesaurus in Libra Office, but that is just a sometimes substitute for using one of several print Thesauruses on my shelf, among other books containing various similes, and alternative words.

I have not run any of my writing through anything other than a spelling (and pseudo grammar) checker. I have considered using Scrivner, or similar writing programs to review my writing, because my self-editing is not up to the standards I would prefer to achieve.

Lacking a proofreader, let alone a competent Editor, I am considering alternatives.

Even when I had a proofreader I didn't always accept all of their suggestions. In fact I probably rejected more than half of the suggestions; Excluding corrections of misspelled, or wrong words (their instead of There, etc.).

It is my story, NOT another's story. I don't want any unintended misspellings, or bad grammar.

I do think there is significant differences between running Your writing through an AI, and then You considering making changes; compared to a person feeding prompts into an AI which then "writes" a story (garbage)...

Replies:   awnlee jawking  Joe Long
awnlee jawking ๐Ÿšซ

@Paladin_HGWT

compared to a person feeding prompts into an AI which then "writes" a story (garbage)...

I think that's exactly what the author is doing, only the readers rate it very far from garbage despite the faults.

AJ

Joe Long ๐Ÿšซ

@Paladin_HGWT

I have not run any of my writing through anything other than a spelling (and pseudo grammar) checker. I have considered using Scrivner, or similar writing programs to review my writing, because my self-editing is not up to the standards I would prefer to achieve.

Lacking a proofreader, let alone a competent Editor, I am considering alternatives.

Even when I had a proofreader I didn't always accept all of their suggestions. In fact I probably rejected more than half of the suggestions; Excluding corrections of misspelled, or wrong words (their instead of There, etc.).

It is my story, NOT another's story. I don't want any unintended misspellings, or bad grammar.

This is exactly how I use AI. Critique my drafts during development, copy edit it when I'm done, after which I review every correction - but don't change anything I've written. All the ideas remain mine alone.

Joe Long ๐Ÿšซ

@awnlee jawking

One of the stories I'm following on SOL has AI generated scenes. I've googled unique sentences from the story and found exact matches on social media.

I asked ChatGPT to critique a first draft of a scene. I'd ended with the MC asking a question, thinking I'd let it hand. The AI suggested a response from the other character, just one sentence, and I thought, "Yeah, I like that." A few days later I read the same sentence, in response to a similar question, in "Summer Camp Swingers." But it does work well in my text.

Joe Long ๐Ÿšซ

@REP

The AI 's programming starts with the assumption that all of the data in the database is complete and valid, which is a false assumption.

I believe that is a false assumption. What is stopping the programmers who are training the AI to categorize and rank sources, and teach the AI to do comparative analysis between the different sources, mimicking what a rational human would do?

TheDarkKnight ๐Ÿšซ

@Soronel

I'm not interested in trying to use AI (or A1 as Linda McMahon, our Secretary of Education calls it) to help me. Maybe I'm just masochistic, but I enjoy the challenge of writing, even when it's hard.

Soronel ๐Ÿšซ

@Soronel

I realized I forgot the bit I found most humorous. The AI's summation included the dragon's mating flight "shook the very ground". Now, the dragon's fight/mating in my story did shake the ground, but not their flying away afterwards.

fohjoffs ๐Ÿšซ

@Soronel

Yep, I have used LLM machines to help me with WTF moments when I have been baffled by other's code. This has been (somewhat) effective.

The other recent use has been for antenna design. NASA was using 'AI' to do antenna design almost 30 years past. I use it because I am not an RF engineer. So other than generating empirical data, I cannot verify the solution.

The moral of the story is that if you use an LLM-based engine to do work outside of your education and experience, you may be swimming in shark-infested waters with one hand on the keyboard.

I can generate mixed metaphors all day long.

irvmull ๐Ÿšซ
Updated:

@Soronel

"Garbage in, garbage out" is understandable.

However, what happens when the input is simple and known to be accurate? AI still gets it wrong. Very wrong.

I have made requests like: "Using the latest GTK4 official online documentation, create a list of all properties for a gtk4 window widget".

I get back a list with only a few items that are actually part of the GTK documentation, and lots that just seem to be dreamed up. Many properties that are listed on the documentation web page fail to make it onto the AI generated list as well.

Worse, when I point out the errors and omissions, AI offers a new "corrected" and "accurate" list, which is usually even worse than the first try.

It's worse than useless for something like that.

Replies:   Grey Wolf
Grey Wolf ๐Ÿšซ

@irvmull

This is partly why I'm not at all certain that the lawsuits over AI as (output-wise) a 'copyright infringer' are all that likely to succeed. Yes, with the proper prompt, on certain AIs, you can produce certain sequences of books with a moderately high chance of success.

Mind you, for one of ones tested ('Harry Potter and the Sorceror's Stone') the model could produce 50-token sequences from 42% of the book with a 50% chance of success. That sounds big, right? But that means it can't even get 50-token sequences from 58% of the book, and half of the ones in the 42% it does 'know' are still wrong. That's not all that great. If model trainers should be on the hook for damages because their models could be used to make really lousy and often incorrect copies of books, it sure seems like photocopier makers should be out of business forthwith, since their products can be used to make very good and accurate copies of books with a whole lot less effort.

AIs have a fair bit of utility for some jobs, and they can do extremely well with constrained data sets and careful tuning. If you were to have an AI pointed to an indexed version of the GTK4 documentation, it would do enormously better while also being better at search than most anything else. But, if you're just counting on general training, it'll fall on its face a large percentage of the time. Software API documentation 'looks like' software API documentation, so it's hardly surprising that it might flop around quoting things from other, similar-in-style documentation sets.

Replies:   irvmull  julka
irvmull ๐Ÿšซ

@Grey Wolf

If you were to have an AI pointed to an indexed version of the GTK4 documentation, it would do enormously better while also being better at search than most anything else.

And yet, if you read my post, that is exactly what I did; instructed it to use the official indexed documentation. And it got it enormously wrong. A 10-year-old with a pencil and paper could look at the web page (only one page needed, all the info is right there in a nice table), and write it down in less time than it took to get imaginary results from AI.

Replies:   Grey Wolf
Grey Wolf ๐Ÿšซ
Updated:

@irvmull

That's not something most production AIs can do all that easily, as far as I know (maybe there are exceptions). It's not going to look up the documentation and refer to it. You need to use something like RAG, or put the documentation within its context buffer, to get it to do that. You might have instructed it to do that, but that's not within its capabilities.

There are exceptions - some AIs have limited RAG capability built in - but most of them cannot refer outside of their model 'on the fly', and simply telling it to 'go look at the webpage' likely won't accomplish what you want.

Replies:   Joe Long
Joe Long ๐Ÿšซ

@Grey Wolf

I'm a heavy AI user, and I'm reading this thread thinking "I haven't seen the AI be this bad in a year or two." I agree with your assessment that it's better to provide a link or upload a doc rather than just naming it.

So I'd like to believe that my better results have come as a results of 1) familiarity with the tech, leading my be be better at prompting and 2) to have also used that time to train my instance of the AI and allow it to become more familiar with myself.

julka ๐Ÿšซ

@Grey Wolf

it sure seems like photocopier makers should be out of business forthwith, since their products can be used to make very good and accurate copies of books with a whole lot less effort.

I think you'll find, and I admit I'm not a lawyer so I'm going out on a little bit of a limb here, but I'm fairly confident that using a photocopier to make a copy of a book you don't own the rights to, which you then go and sell to other people, is in fact also copyright infringement.

I'm glad to see you've come around on the idea that selling access to works you don't own the rights to is in fact against the law, even if you do it in a novel (ha!) way.

Replies:   Dominions Son  Grey Wolf
Dominions Son ๐Ÿšซ

@julka

t using a photocopier to make a copy of a book you don't own the rights to, which you then go and sell to other people, is in fact also copyright infringement.

True, but that liability is on the user of the photocopier, not the manufacturer.

Replies:   julka
julka ๐Ÿšซ

@Dominions Son

"User" is a broad and imprecise term; let's be more precise and say the liability is on the individual who produced and distributed the copies.

Now, if you have a magic box where somebody can walk up and flip a coin and if they flip heads, they can open the box and take out a free copy of the book, do you think liability is on the person flipping the coin or the person who filled the box with infinite copies of the book?

Dominions Son ๐Ÿšซ

@julka

Now, if you have a magic box where somebody can walk up and flip a coin and if they flip heads, they can open the box and take out a free copy of the book, do you think liability is on the person flipping the coin or the person who filled the box with infinite copies of the book?

This is in no way an accurate representation of how LLMs work.

Replies:   julka
julka ๐Ÿšซ

@Dominions Son

Around two or three weeks ago I was seeing conversations with Grok where people would ask for, and receive, the full text of the first chapter of Harry Potter and the Sorceror's Stone. My representation of ChatGPT/Grok/other commercially popular general purpose LLMs as "a magic box that you flip coins at to get a book" is inaccurate for its simplicity, but mostly because i've discarded all the other non-copyrighted material that happens to be in the box as well.

Replies:   Grey Wolf
Grey Wolf ๐Ÿšซ

@julka

Grok can indeed do that, because it knows where to find a copy of a text file with the text of that chapter. The reason it knows is that there are hundreds of such files out there on the net. The liability for the infringement goes to the person who originally published the text, not Grok - it's just pointing you to web resources.

Again, the public library can also produce the first chapter of Harry Potter and the Sorcerer's Stone at will. So can Google Chrome, by the same mechanism Grok is using (referring to a text file . The library, and Chrome, are at least as much infringing as Grok is.

akarge ๐Ÿšซ

@julka

Now, if you have a magic box where somebody can walk up and flip a coin and if they flip heads, they can open the box and take out a free copy of the book, do you think liability is on the person flipping the coin or the person who filled the box with infinite copies of the book?

Obviously, it is the fault of whoever minted the coin.

Replies:   julka  AmigaClone
julka ๐Ÿšซ

@akarge

agreed, all our problems are actually caused by capitalism at their roots but I think people here tend to get mad when you say that.

Replies:   awnlee jawking  Not_a_ID
awnlee jawking ๐Ÿšซ
Updated:

@julka

all our problems are actually caused by capitalism

When writing letters by hand, I capitalise names and addresses. Therefore I'm a capitalist and I cause all the world's problems ;-)

If you don't count door frames (and I don't), there are no arches on my property so I'm also an anarchist.

My garden is nature-friendly, particularly for birds and bees. I think bumblebees are great because they don't know they're too heavy to fly). You could call that communing with nature, making me a communist too.

AJ

Not_a_ID ๐Ÿšซ

@julka

agreed, all our problems are actually caused by capitalism at their roots but I think people here tend to get mad when you say that.

To paraphrase the refrain about democracy being the worst form of government -- except for all of the other forms tried before.

Capitalism is the worst form of economics, except for the forms that were tried before.

Mostly because without the concept of capital, you have no economy.

All economic theories and alternative systems are capitalist in nature, even communism. Barter is also a system of capital(resource) exchange, so it is capitalist as well.

Where and how they differ is over who gets to control the capital, and how it gets used.

julka ๐Ÿšซ

@Not_a_ID

All economic theories and alternative systems are capitalist in nature, even communism. Barter is also a system of capital(resource) exchange, so it is capitalist as well.

I desperately want you to explain what you think "capitalism" means because the answer is not "every economic system that includes the concept of productive inputs" and that's how I think you are using it here.

Grey Wolf ๐Ÿšซ

@Not_a_ID

This feels so broad as to be useless. If 'all economic theories and alternative systems are capitalist in nature,' then there can be no other 'forms that were tried before,' by definition. Any form that was tried before was 'capitalistic by nature'.

It also means 'all our problems are actually caused by capitalism at their roots' would collapse down to 'all our problems are caused by having an economy', mind you.

And, heck, maybe that's true. If we had no economy, we would have plenty of problems, but they certainly wouldn't be the same problems we have now.

awnlee jawking ๐Ÿšซ

@Not_a_ID

Mostly because without the concept of capital, you have no economy.

What about the barter economy - I'll build you a house and a barn in exchange for my pick of your daughters, a goat and half a dozen eggs?

Does the lack of currency tokens changing hands mean it's not an economy because it's not measurable?

AJ

AmigaClone ๐Ÿšซ

@akarge

Actually its the fault of the person who provided the finances for the entity that purchased the refined metal to mint the coin

Grey Wolf ๐Ÿšซ

@julka

Now, if you have a magic box where somebody can walk up and flip a coin and if they flip heads, they can open the box and take out a free copy of the book, do you think liability is on the person flipping the coin or the person who filled the box with infinite copies of the book?

The user of the box. Your 'magic box' is called a public library with a photocopier. Anyone can walk into a public library, photocopy a book, and walk out with as many copies as they want. The liability for the infringement is upon the person making the copies, not the library, notwithstanding that the library has provided everything necessary to generate copies of thousands of copyrighted works.

If the library is not liable, surely the AI producer cannot be. The library offers the ability to make perfect copies. The AI producer is merely offering the ability to make wildly imperfect copies on a probabilistic behavior. How could one possibly believe the AI was more infringing than the library?

Grey Wolf ๐Ÿšซ

@julka

Now, if you have a magic box where somebody can walk up and flip a coin and if they flip heads, they can open the box and take out a free copy of the book, do you think liability is on the person flipping the coin or the person who filled the box with infinite copies of the book?

The user of the box. Your 'magic box' is called a public library with a photocopier. Anyone can walk into a public library, photocopy a book, and walk out with as many copies as they want. The liability for the infringement is upon the person making the copies, not the library, notwithstanding that the library has provided everything necessary to generate copies of thousands of copyrighted works.

If the library is not liable, surely the AI producer cannot be. The library offers the ability to make perfect copies. The AI producer is merely offering the ability to make wildly imperfect copies on a probabilistic behavior. How could one possibly believe the AI was more infringing than the library?

Grey Wolf ๐Ÿšซ
Updated:

@julka

using a photocopier to make a copy of a book you don't own the rights to, which you then go and sell to other people, is in fact also copyright infringement.

Yes, exactly. And, if AI makers should be liable for that infringement, so should copier manufacturers.

But, in fact, it is the user of the copier who is held liable, just as it should be the user of the AI who is held liable, not the provider of the AI.

I'm glad to see you've come around on the idea that selling access to works you don't own the rights to is in fact against the law

I never said otherwise. Current AIs do not 'sell access to works [the AI operator doesn't] own the rights to' - that was the point. At their high point, the AI might sell access to some small percentage of the work, and that's generally considered legal within US law, especially since the level of access is probabilistic at most.

(Minor edit) - see below - unless the AI is 'selling access' to the results of a web search which finds the infringing material. But, by that standard, Google Chrome is a far bigger infringement problem than AIs are.

Replies:   Joe Long
Joe Long ๐Ÿšซ

@Grey Wolf

Current AIs do not 'sell access to works [the AI operator doesn't] own the rights to' - that was the point. At their high point, the AI might sell access to some small percentage of the work, and that's generally considered legal within US law, especially since the level of access is probabilistic at most.

As I write, I also read a lot, to see other author's styles or be inspired by some theme or topic. With the AI, I can discuss the book as I'm reading, ranging from "Finding Alaska", "Eleanor and Park", "Perks of Being a Wallflower", "Less Than Zero", "Play It As It Lays", "Catcher in the Rye", "Portrait of the Artist AS a Young Man" and many others. The AI has already 'read' those books and engage me in conversation, acting as a scholarly mentor to me being a literature student. It compares and contrasts those books to the one I'm writing. That's fair use. Reading a book isn't the problem. It's a matter of how the author is financially compensated and that the reader doesn't violate copyright laws in how it shares what's been read with others.

jimq2 ๐Ÿšซ
Updated:

@Soronel

Try googling, "first chapter of Harry Potter and the Sorcerer's Stone" and see what you get. Google's AI gives multiple links to a full copy of the actual chapter. Including one at SDSU. (San Diego State Univ)

Some contain a lot more than the first chapter

Replies:   julka
julka ๐Ÿšซ

@jimq2

I'll go out on a limb and assume this is actually directed at me and not Soronel, since I'm the only person in the thread who's brought up Harry Potter - sincere apologies if you were actually talking to him!

At the time that I observed the behavior, Grok had been configured not to search the web to find an answer, so it nominally should have been relying only on its training data.

Now, is the first chapter of harry potter available broadly on the internet? Sure. It's also available in book stores across the globe. I assume, although I have not verified, that at least some of those reproductions of the chapter are in some way licensed by the publisher, and honestly it's probably a safe assumption that some of them weren't.

That said, it's not super relevant to the point I'm making - the material itself is copyrighted, the copyright doesn't go away just because you can find it in various places online, and a computer program which spits out copyrighted material it has consumed, on demand, even probabilistically, is infringing that copyright unless it has licensed the material for distribution.

Replies:   Grey Wolf
Grey Wolf ๐Ÿšซ
Updated:

@julka

At the time that I observed the behavior, Grok had been configured not to search the web to find an answer, so it nominally should have been relying only on its training data.

Grok's training data includes quite a bit of web searches. If you look at the output, it's clearly just referencing a text file that was indexed as part of its training data.

A bit more experimentation with Grok actually produces output making it crystal clear that it's sourcing the text of Harry Potter and the Sorceror's Stone from live web pages, not from within its training data. It's acting as an advanced web browser, in other words.

a computer program which spits out copyrighted material it has consumed, on demand, even probabilistically, is infringing that copyright unless it has licensed the material for distribution

And a photocopier which spits out copyrighted material it has consumed, on demand, is infringing that copyright unless that material is licensed for distribution.

And a web browser which spits out copyrighted material it has consumed, on demand, is infringing that copyright unless that material is licensed for distribution.

In both cases, the liability falls on the user of the tool. Some liability might also fall on the person who improperly published the copyrighted material, if they did not have a license to do so.

In neither case does the liability fall on the maker of the tool. Why should an AI be held to a different standard than other tools that can infringe copyright? We've already been through this with photocopiers, personal computers, and VCRs, among other tools. There's an enormous amount of legal history built up holding that it's the user of a tool that can be used to infringe copyright who bears the legal responsibility for such infringing, not the maker of the tool, unless the tool has no legal use except to infringe. But AIs have numerous legal uses that do not infringe, so that standard wouldn't apply.

Replies:   julka
julka ๐Ÿšซ
Updated:

@Grey Wolf

Gonna condense a few threads in to one here for ease of reading and my own sanity.

Your 'magic box' is called a public library with a photocopier.

Well, not quite - for one thing, public libraries have licensed the works they loan out and users have a temporary license in turn. In that case, copying the work would be infringement committed by the user. An LLM did not license the work, and so when the LLM provides a copy of it, it's infringing by itself. The user who requests may also be infringing, but that's not a super interesting question to me and I feel like we should probably hold multi-billion dollar companies to a higher standard than an internet rando anyways.

Grok's training data includes quite a bit of web searches.

I mentioned this elsewhere in the thread, so you could have easily missed it; at the time I saw the behavior, Grok was explicitly configured to NOT perform web searches. If your point here is that Grok went out and searched for copyrighted data and then included that in its training, and then returned that data on request, then yes; I agree, that's exactly my point. That's a bad thing!

Regardless, though, Grok is definitely returning copyrighted information beyond just "Chapter One of Harry Potter" - you can see Ed Newton's twitter post[1] where he shows Grok returning a variety of copyrighted data, ranging from images generated with copyrighted characters (homer simpson, Marvel Studios Iron Man) to Harry Potter to a recent New York Times article. Can you find all of those things through a browser? yes, absolutely.That doesn't make grok a browser - just because some other tool can be used to do a bad thing does not mean the makers of the LLM have abdicated the responsibility they hold for the tool they made

Why should an AI be held to a different standard than other tools that can infringe copyright?



Because copyrighted material wasn't consumed in the production of a web browser, and it was in the training of the LLM!
The circumstances are different, and so the standards differ as a result. If, for example, a photocopier was shipped with a PDF copy of John Grisham's legal thriller "The Firm" and the test print button ran off a full copy of it, that would be copyright infringement by the maker of the photocopier even if you also want to point the finger at whoever presses the test print button.

[1]: https://x.com/ednewtonrex/status/1942263535364686163

Editing periodically for cleanup - the original post was being written while my infant was in the beginning stages of A Sad and now he's napping on my chest - neither circumstance is especially great for perfect writing and formatting.

Replies:   Grey Wolf
Grey Wolf ๐Ÿšซ

@julka

Well, not quite - for one thing, public libraries have licensed the works they loan out and users have a temporary license in turn. In that case, copying the work would be infringement committed by the user. An LLM did not license the work, and so when the LLM provides a copy of it, it's infringing by itself.

This is seriously incorrect as to how libraries work. Libraries buy books. The people who trained the AI bought books. If either doesn't buy the book, that's an infringement. If they did, there's no infringement. Reading a lawfully obtained book is fine, whether that reading is done by a human or a computer.

I mentioned this elsewhere in the thread, so you could have easily missed it; at the time I saw the behavior, Grok was explicitly configured to NOT perform web searches.

But it did it anyway. Grok is very clear about where the data is coming from. Yes, it's a bad thing that it's ignoring the 'don't do web searches' flag (which no longer exists), but it doesn't create copyright infringement.

If Grok was merely told to not search the web, that likely has as much power as telling Grok to search the infinite cosmos. Telling AIs to do / not do things that their programming doesn't support is doomed to failure.

But I think you're missing the point. Grok didn't do a web search. It was trained to 'know' that chapters of Harry Potter can be found at link X, Y, Z. There's no 'search' required. It's just reporting what it was trained to report.

That doesn't make grok a browser - just because some other tool can be used to do a bad thing does not mean the makers of the LLM have abdicated the responsibility they hold for the tool they made

What 'responsibility' do they have to produce less copyrighted material than Chrome produces? From where does this responsibility flow? It's not in US law.

Because copyrighted material wasn't consumed in the production of a web browser, and it was in the training of the LLM!

If the copyrighted material was legally obtained, using it to train an LLM is also legal. There is already case law to that effect, and it's entirely consistent with how Fair Use has been interpreted for decades.

The user of a tool is responsible for infringement, not the maker of the tool. That's been the law for decades. There is no reason to turn that on its ear just because of a new technology.

Replies:   julka
julka ๐Ÿšซ
Updated:

@Grey Wolf

The people who trained the AI bought books.

They did not. Meta, for example, used pirated material from Libgen [1].

What 'responsibility' do they have to produce less copyrighted material than Chrome produces?

Interesting choice of words! Chrome as a browser absolutely has a responsibility to not produce copyrighted data! When Chrome retrieves data from a server, it's not producing anything; just retrieving and rendering. An LLM is absolutely either producing copyrighted material or retrieving copyrighted material it was trained on without license. The standards are different because fundamentally a different thing is happening.

The user of a tool is responsible for infringement, not the maker of the tool. That's been the law for decades. There is no reason to turn that on its ear just because of a new technology.

Mmm, not sure that's the case? If you start selling a raspberry pi with an SNES emulator on it, loaded up with a few hundred ROMs of various games, I'm fairly confident you can get nailed for infringement even if you yourself never turn on one of your devices.

[1]: https://www.transparencycoalition.ai/news/so-meta-pirated-your-books-and-articles-heres-what-you-can-do

Replies:   Grey Wolf
Grey Wolf ๐Ÿšซ

@julka

They did not. Meta, for example, used pirated material from Libgen [1].

In which case, they should be held liable for obtaining the pirated material. No argument there. Other AI models are known to have not used pirated material (there have already been court cases to that effect). And, if the material was pirated, the legal recourse is exactly the same as if you went to a library and photocopied a book. The act of copying is the infringement. What use the copy is put to, if any, is entirely irrelevant to the infringement - it's not 'more legal' or 'less legal' because it was used for training.

When Chrome retrieves data from a server, it's not producing anything; just retrieving and rendering.

And when Grok retrieves data from a server, it's not producing anything, either.

But, note - your argument there would say that, if the AI companies were able to 'retrieve and render' something, then train based on it, there would be no infringement. The actual training is Fair Use, after all.

So, if they merely 'retrieved and rendered' something from Libgen, that would be fine based on your own view of the legality of Chrome. If they retained it, though, that would amount to creating an illegal copy (and, yes, they may well have retained it and should be held liable for doing so).

An LLM is absolutely either producing copyrighted material or retrieving copyrighted material it was trained on without license.

That is factually incorrect. It may be producing noncopyrighted material (an enormous amount of input training data is not under copyright), it may be retrieving data from a public web server (for instance, Grok, in the case of Harry Potter), or it may be retrieving copyrighted material it was legally trained on. Each of those is entirely legal.

Or consider Google itself, rather than Google Chrome. A Google image search will produce a mountain of copyrighted material quickly, right there on your screen. Is Google liable for infringing e.g. Disney's copyright because you can search for and find pictures of Iron Man, right there in your browser? Are you? Or is the person who posted the picture of Iron Man (in violation of Disney's copyright) liable?

Mmm, not sure that's the case? If you start selling a raspberry pi with an SNES emulator on it, loaded up with a few hundred ROMs of various games, I'm fairly confident you can get nailed for infringement even if you yourself never turn on one of your devices.

In that case, the person is selling copyrighted material directly. None of the AI companies are selling copyrighted material; they are, at most, selling a set of weights which are transformative works based on copyrighted material - and that is if you have to pay for the model itself and then download it, not use the tool. So, the analogy thoroughly fails.

But I will also agree with you. If e.g. the makers of Grok, ChatGPT, etc were to say 'Hey, pay for our AI and you can produce copies of Harry Potter, Iron Man, and other copywritten works!' that might well be legally actionable. But that doesn't appear to be something they've ever done. They're not in the business of selling access to copywritten works. I doubt anyone at all has ever subscribed to any of the AIs with the goal of obtaining copies of books or other copywritten material - especially since there are far easier ways of obtaining that same material via normal web searches with no AI involved, and you don't have to wonder whether the AI will give you an incorrect copy.

Replies:   julka
julka ๐Ÿšซ
Updated:

@Grey Wolf

A Google image search will produce a mountain of copyrighted material quickly, right there on your screen.

Okay. You know what it's not doing? Producing new images using copyrighted material, as we can see Grok doing in the twitter post I linked upthread.

the makers of Grok, ChatGPT, etc were to say 'Hey, pay for our AI and you can produce copies of Harry Potter, Iron Man, and other copywritten works!' that might well be legally actionable.

It's not necessary to make the statement, just to do the thing. If you sell access to "a video server with home videos on it" and then the server also happens to have copies of recent Hollywood movies, you're infringing despite only advertising that you have home movies. If the "transformative set of weights" can reproduce the copyrighted material (and we have seen that it can), then I fail to see how the LLM isn't generating copyrighted material, it's literally outputting it.

edit:

So, if they merely 'retrieved and rendered' something from Libgen, that would be fine based on your own view of the legality of Chrome.

Woah, hang on - you're doing something very very sneaky here. Sometimes you're talking about "they" as in the tool and sometimes you're talking about "they" as in the developers and then you're equating two non-equivalent things. The developers of Chrome are not held liable for what Chrome is used to render; the developers of e.g. Llama or OpenAI are absolutely liable for what they use to train Llama, and should be liable for what Llama produces as a result of that training. Remember, copyrighted content wasn't consumed in the production of Chrome.

Replies:   Grey Wolf
Grey Wolf ๐Ÿšซ

@julka

Sometimes you're talking about "they" as in the tool and sometimes you're talking about "they" as in the developers and then you're equating two non-equivalent things.

I am talking about 'they' as the developers of the AI model. There are multiple 'tools' here. The 'tool,' in this case, is the tool used to train the AI. By your argument, the developers of the training tool and of the resulting AI are not liable for using anything they could 'retrieve and render'. If Chrome can do it legally, even without copyright permission, so can the training tool.

the developers of e.g. Llama or OpenAI are absolutely liable for what they use to train Llama

I partially agree with you, but the point here is that your own argument does not agree with you. If 'retrieve and render' is legal, 'retrieve and render' is legal, whether that is for display or training purposes. And that is consistent with case law as well: transiently referencing things that can be downloaded (whether or not the material was placed on the internet legally) for training or display purposes seems to be legal, but downloading the material and holding onto it does not seem to be legal. But that is the infringement, right there: downloading and retaining copyrighted material without authorization. What use, if any, that material is put to is entirely irrelevant to the infringement - the infringement is retaining it. If the material is used in otherwise legal ways (e.g. training an AI), there is no additional infringement.

and should be liable for what Llama produces as a result of that training

Why, when that contradicts decades of case law and is inconsistent with US Fair Use doctrine? That would make the developers of a photocopier liable for what the photocopier outputs, the developers of a VCR liable for what the VCR outputs, and so forth.

Yes, I know, 'the photocopier was not trained.' That's irrelevant, since training itself is already determined to be legal even when copyrighted material is used. Why shouldn't this be regulated with respect to the user, not the tool, as every other copyright-infringing technology is regulated? Why should we flip precedent on its head and say that, in this case only, developing a tool using legal means, consistent with the Fair Use doctrine, is nonetheless legally actionable because there is a possibility that some user will use it to infringe copyright?

It is almost unquestionably the case that a far higher percentage of VCRs were used to store and retain copyrighted material than AIs are used to generate (much less retain) material in violation of copyright. It seems extremely unlikely that the majority of uses of any AI involve requesting pages from Harry Potter (or any other book) instead of just going to Chrome and downloading them there. If VCRs were 'Fair Use' - notwithstanding that they were unquestionably used to copy massive amounts of copyrighted material and were marketed as tools to do so - why in the world would AIs not be?

Remember, copyrighted content wasn't consumed in the production of Chrome.

That's both irrelevant and unknown. It's unknown (at least in my opinion) because I have no way of knowing whether any developer of Chrome ever downloaded and retained copyrighted material to use for testing Chrome. I could tell you stories about non-AI software products developed by Fortune 500 companies in which the datasets used to test those products consisted of illegally obtained copyrighted material. Are those products illegal?

It's irrelevant because it's legal to use legally obtained copyrighted content to train AIs. There is already case law to that effect. It's also consistent with Fair Use doctrine. Merely using copyrighted content to train the AI does not make the AI model infringing by nature. The only way to judge infringement is on the output, and the output is requested by the user, not the developer.

And, stepping back, the purpose of copyright (in the US) is 'to promote the progress of science and the useful arts.' Barring the use of lawfully obtained copyrighted material in training AI models clearly hinders the progress of science and the useful arts. Keeping those goals ('promot[ing] the progress of science and the useful arts' and preserving the copyright holder's interest in benefiting from their creativity) in balance is partly why the US has the Fair Use doctrine, and it's why courts have ruled that using such material in training is legal.

Replies:   julka
julka ๐Ÿšซ

@Grey Wolf

By your argument, the developers of the training tool and of the resulting AI are not liable for using anything they could 'retrieve and render'. If Chrome can do it legally, even without copyright permission, so can the training tool.

If you're going to apply my argument in a different circumstance, at least apply it in the same way. When you take an argument I make about Chrome The Browser and try to use it to draw conclusions about OpenAI The Developers, it's intellectually dishonest of you and it makes me deeply uninterested in continuing this conversation. Be better than that - you know that computer programs aren't people, so if you're going to argue they should be treated the same, you should back that up with your own words instead of jamming them into mine.

Nothing else you write addresses my core point of "LLMs are trained using copyrighted data and can also produce that copyrighted data on demand", and no matter how many times you argue that copyrighted data can be copied in many different ways, you still don't fundamentally why it's okay for a piece of software to a) use stolen data and b) reproduce that stolen data on demand and now that you're just inventing shit about what I'm saying, I'm pretty done engaging with you on this

Replies:   Grey Wolf
Grey Wolf ๐Ÿšซ

@julka

When you take an argument I make about Chrome The Browser and try to use it to draw conclusions about OpenAI The Developers, it's intellectually dishonest of you and it makes me deeply uninterested in continuing this conversation.

You're twisting my argument into a pretzel, so I'll make the same request: at least apply it in the same way. I'm arguing that the tools - Chrome, and the training software used by e.g. OpenAI - have an obvious parallel. I'm not referring to the developers (human beings) at all. How does a human being 'retrieve and render' anything, exactly? How could that even be an argument? What made you think I was referring to human beings (the developers) and not the tools they were using?

Your refusal to acknowledge an obvious parallel in behavior by tools is intellectually dishonest. 'Retrieve and render' is the same thing regardless of whether it's used for display or training. If you disagree, actually give a reasoned argument.

LLMs are trained using copyrighted data and can also produce that copyrighted data on demand

That's factually incorrect in the way you're stating it, and I've already addressed it repeatedly. LLMs are partly trained using copyrighted data, partly using non-copyrighted data. They can produce some probabilistic subset of that copyrighted data on demand. Sometimes that subset is as high as perhaps 10-15% of a document correct in short stretches (excepting 'I learned a link to this thing, and I'll regurgitate what's at that link for you' - Google Search can do that, and it's not infringing when it does so). Sometimes it's tenths of a percent.

Your phrasing is intellectually dishonest. If you had phrased it as 'LLMs are partly trained using copyrighted data and can also produce some of that copyrighted data upon request, but often garble it,' that would be much more honest. It would obviously make it hard for you to argue a case as to why LLMs require extraordinary legal scrutiny when things that reliably produce exact copies of copyrighted data are wildly available and do not face such scrutiny, but it would at least be honest.

you still don't fundamentally why it's okay for a piece of software to a) use stolen data

Never said it was. The use of stolen data in training is a copyright violation and can be prosecuted as such. The model doesn't 'use stolen data', though.Training the LLM on copyrighted data is legal regardless of whether the data was obtained legally or not - that has already been determined in court. If you're trying to split hairs as to whether 'legal' means 'okay', then that's fine, but don't pretend it's not legal. And my argument as to why it's 'okay' is based both on your own 'retrieve and render' argument (ephemeral use of copyrighted data appears to be legal if no copy is retained) and upon the Fair Use element of using something for a transformative purpose (training an LLM is nearly the definition of a transformative purpose). You still haven't made a case as to why it's 'okay' to massively hamstring progress because of the mere possibility that some users will ask for copyrighted material to be produced, especially since that very same copyrighted material is widely and trivially available in the absence of LLMs.

b) reproduce that stolen data on demand

That is a function of the user. A photocopier can do the same thing. So can the VCR. There's nothing particularly interesting about an LLM doing it, and both of them are much better at that task. Using an LLM to reproduce Harry Potter is crazy - it leaves you the task of going through it word for word and looking for the incorrect passages (which, according to research, will be somewhere between 58% and 79% of the 50-word sequences in the resulting output). It actually is impressive that it gets that close, but as tools for reproducing copyrighted text, LLMs are miserably bad.

You're basically arguing that, because this specific thing could be used for copyright violations, it should fall into some new and novel area of law, notwithstanding that devices actually used in widespread ways to copy copywrited materials do not fall into that area of law. That sort of claim is extraordinary, but you have no particular argument in favor of it except that 'it's trained on copyrighted data!' But that's legal.

now that you're just inventing shit about what I'm saying

No, I'm not. You're just refusing to read what I'm writing and replying to it as if I said something I was not saying. Meanwhile, you 'invent shit about what I'm saying' repeatedly. Pot, meet kettle.

Replies:   julka
julka ๐Ÿšซ
Updated:

@Grey Wolf

I'm not referring to the developers (human beings) at all. How does a human being 'retrieve and render' anything, exactly? How could that even be an argument? What made you think I was referring to human beings (the developers) and not the tools they were using?

I thought that because you said

So, if they merely 'retrieved and rendered' something from Libgen, that would be fine based on your own view of the legality of Chrome.

And then followed it up with

I am talking about 'they' as the developers of the AI model.

I agree it's a bad argument, that's why I thought it was stupid that you made it. If you want to accuse me of making stuff up because i pointed out your bad argument, consider not making bad arguments.

Edit: okay, fine, I'll try one more time.

There are multiple forms of copyright infringement. One way to do it is by using a copyrighted work in a non-permitted way; as you have observed, courts ruled that training an LLM is fair use, so that doesn't apply here.

Another way to infringe copyright is by distributing the work without permission. When a photocopier makes a copy of a book, it's not distributing that copy; that's done by some other entity. Similarly, a web browser doesn't distribute the data it renders, even if that data is copyrighted; the distribution is being done by the server serving the data (and, by extension, whoever uploaded the data to the server and made it available).

When an LLM reproduces copyrighted works, I am viewing that as a distribution of that copyrighted work, in much the same way as a web server configured to provide it. The mechanism of that distribution is different, but I don't feel that it is meaningful to the ultimate outcome of "receiving a copyrighted work that was distributed by somebody who does not have the rights to distribute it". In this way, both parties are at fault - the one who requested the work, and the one who provided it. We can see this sort of precedent in individuals who were fined for copyright infringement after e.g. operating a torrent tracker - they made copyrighted works available on request, and that's not allowed.

And since the server does not do anything by itself, blame should travel up to whoever configured it and made the copyrighted works available; since the works are available as a result of having been included in the training data, it is that inclusion (and whoever made the call to include it) who I feel is liable for the infringement. The training was fair use, but the outcome of the training, in terms of reproducing and distributing copyrighted works, is not fair use - you don't get to break the law and argue that it's fine because the steps you took to break the law were, by themselves, legal.

Replies:   Grey Wolf  Joe Long
Grey Wolf ๐Ÿšซ

@julka

When an LLM reproduces copyrighted works, I am viewing that as a distribution of that copyrighted work

That makes no sense to me. The user of the AI does any 'distributing' that happens.

By analogy, opening a book in a library is just as much 'distributing' a copyrighted work as an AI outputting something. The user of the book / AI determines whether there's actual distribution. The owner / provider of the book / AI does not. Thus, liability flows to the user, not the owner / provider.

Similarly, a web browser doesn't distribute the data it renders

It 'distributes' it to exactly the same level as the AI does. The web browser renders work (copyrighted or not) on some sort of output device. So does the AI. They 'distribute' to the exact same level. The data is in the model or on the server serving the data; the output is 'distributed' to the same extent.

In this way, both parties are at fault - the one who requested the work, and the one who provided it.

There is no illegality on the part of the AI maker. They (legally) take a mix of copyrighted and notcopyrighted works and create a new transformative work based on those works. That new transformative work (the model) may, at times, produce some subset of the copyrighted works, but there's no reason to believe that's illegal. Providing a subset of a copyrighted work is legal and commonly done, after all.

Again: there are far more infringing sorts of technology, and they don't fall under such new and novel legal scrutiny. Your torrent example is perfect. Why should torrenting software be 'just fine' but AIs be under some wildly higher level of scrutiny?

The training was fair use, but the outcome of the training, in terms of reproducing and distributing copyrighted works, is not fair use

This makes no sense from a practical standpoint and is contradictory to the purposes of the US copyright system. What sense is there in the argument that 'You can legally build this very cool thing, and you can legally use these inputs to build it. That's fine! But use the thing? That's illegal!'?

It's not going to help your case to claim that the tool 'contains' copyrighted information, by the way. A library 'contains' an enormous amount of copyrighted information, yet no one argues that libraries should be sued out of existence if they also provide photocopiers, or if librarians don't constantly oversee every user of the library to make sure those users aren't writing out the content of the books. If 'providing' an AI, the use of which might produce copyrighted material, should be strictly regulated, a library should be under far higher scrutiny, since nearly everything it 'provides' is copyrighted material.

Suppose 100% of the AI's output is copyrighted. That's exactly the same as the output of someone pulling a book off the shelf and looking at a page of it. The user determines whether that material is further distributed or not, not the provider of the book or the provider of the AI output.

Meanwhile, the library's copyrighted book collection has no purpose other than to provide copyrighted information to users, while the AI has many other purposes that don't involve providing copyrighted information. Your argument is analogous to saying that it's just fine for a library to amass copyrighted books. Letting anyone look at them, though? Nope! It might cause infringement!

you don't get to break the law and argue that it's fine because the steps you took to break the law were, by themselves, legal.

Which is why it makes far more sense to say that the AI / AI tools / etc are not breaking the law. Without upending decades of legal precedent (which, certainly, the courts could do), it's obvious that they're not breaking the law.

Again, photocopiers and VCRs are far more designed to infringe copyright than AIs are, but they don't fall under this new and novel legal theory. Use a VCR in the way it was intended to be used and marketed to be used and you will infringe on copyright in nearly all cases (less so with photocopiers, but copying from books and magazines was a major use case for decades). That may be Fair Use - most such uses actually turn out to be - but they're still copying copyrighted material without permission.

Use an AI in the way that it's intended to be used and marketed to be used and you might, maybe, potentially infringe on some copyright. Tell it to do something that might violate copyright and you increase the odds of it doing so, but it still might fail to do so.

Explain to me again why the far less likely to infringe tool, one with an enormous variety of noninfringing uses (ones it's designed for and marketed for) should fall under higher legal scrutiny than the much more likely to infringe tool, because I'm just not seeing the logic in that approach.

Replies:   julka
julka ๐Ÿšซ
Updated:

@Grey Wolf

It 'distributes' it to exactly the same level as the AI does. The web browser renders work (copyrighted or not) on some sort of output device. So does the AI.

See, this is what makes no sense to me. A browser is a thin program running on an endpoint device that requests data from a server and renders it. ChatGPT/Llama/the vast majority of LLMs are not running on endpoint devices; they are running in a separate location and communicating back to the endpoint via the browser. The LLM doesn't just render information it requests from somewhere else; it is generating text based on the inputs and its training weights, and then providing that text (or image or whatever) back to the browser that requested it. I don't see any similarities between what a web browser does and what an LLM does, and I see plenty of similarities between what a web server does and what an LLM does.

What sense is there in the argument that 'You can legally build this very cool thing, and you can legally use these inputs to build it. That's fine! But use the thing? That's illegal!'?

Plenty of sense? You had it right above that quote.

Why should torrenting software be 'just fine' but AIs be under some wildly higher level of scrutiny?

I'm proposing the same level of scrutiny! Operating a torrent tracker is legal; operating a torrent tracker that serves copyrighted data is illegal! Operating an LLM is legal! Operating an LLM that returns copyrighted data on request should be illegal! There's no problem (according to the courts) with legally building a cool thing, whether that cool thing is a torrent tracker or an LLM; that doesn't mean that unrestricted operation of the torrent tracker or LLM is fine. There's restrictions on what can be served by torrent, there should be restrictions on what can be served by the LLM.

Edit:

the library's copyrighted book collection has no purpose other than to provide copyrighted information to users

Worth noting here that this is a clever red herring; a library's collection of copyrighted books is, as you've noted, governed by copyright law and one of the rights you get under that law is the right to lend or resell your physical copy of a book. The reason nobody tries to sue a library out of existence is because they are doing something which is explicitly legal; digital copies of works do not have the same rights associated with them and so the library comparison isn't nearly as apples-to-apples as you seem to be implying.

Replies:   Grey Wolf
Grey Wolf ๐Ÿšซ

@julka

I'm proposing the same level of scrutiny!

No, you're not. Everything else places the liability on the user. You are placing the liability on the provider. You are, by analogy, arguing that libraries should not exist because a user could copy the books.

Operating an LLM that returns copyrighted data on request should be illegal!

Then operating a web search service (Google, not a browser) that returns copyrighted data on request should be illegal, no? Yet you're not campaigning against Google, just the new technology. And Google (not the browser) is far better at returning copyrighted data on request than any current extent LLM.

Operating a torrent tracker is legal; operating a torrent tracker that serves copyrighted data is illegal!

This is incorrect. One can operate a torrent tracker freely, whether that tracker returns torrents that are copyrighted or not copyrighted. The liability is completely on the user of the torrent, not on the tracker operator.

a library's collection of copyrighted books is, as you've noted, governed by copyright law and one of the rights you get under that law is the right to lend or resell your physical copy of a book.

That's not the argument I'm making. I'm making the argument that the library provides a photocopier, which can be used to copy any of those books (without lending or reselling). By your argument, one should not be allowed to operate a library if, at any time, anyone infringes copyright while reading or borrowing a book. After all, the library 'returned copyrighted data on request,' did it not? And that's your standard above for what should not be allowed, is it not?

Why is it fine for the library to 'return copyrighted data on request', and for Google to 'return copyrighted data on request', but not for an LLM? Particularly because, of the three, the one with the lowest chance of returning copyrighted data is the LLM?

The reason nobody tries to sue a library out of existence is because they are doing something which is explicitly legal

Then what the AIs are doing is also explicitly legal. You can't have it both ways. The AI was legally trained with resources, and it is providing those resources. The library was legally stocked with resources, and it is providing those resources.

digital copies of works do not have the same rights associated with them

So, in theory, it's fine if the AI was trained by chopping up physical books (as some were) but not if it was trained by using electronic copies of the exact same book?

I agree - the current model by which electronic books is a scam, and electronic books should have the same rights associated with them. But, regardless of that, your argument seems to have shifted to the exact sourcing of the material used for training, regardless of whether it is the same material or not.

So, let's try that theory. Under that theory, an AI that was trained by 'reading' (scanning, etc) physical books is fine. No legal issues. But an AI that was trained by 'reading' electronic copies of books is potentially infringing and problematic.

And a library which only provides physical books inherits some sort of protection from contributing to copyright infrigmentment, notwithstanding that it is the library which knowingly and intentionally distributed the copyrighted material which was then infringed. But a library which provides ebooks (as much US libraries do) should be illegal and shut down if any of those ebooks is ever copied, in whole or part - even a tiny part! - by so much as a single user of that ebook.

Does that make sense to you? Because it makes no sense to me.

And I'll go back to a point you never responded to. The purpose of copyright, in the United States, is to 'promote the progress of science and the useful arts.' However, the entire thesis of your argument is that we should use copyright as a weapon to halt the progress of science and the useful arts, lest 'science and the useful arts' produce a thing that can (inaccurately) reproduce some small subset of a copyrighted work upon demand.

So, I will repeat: why is copyright a justification for going after LLMs, and not for banning torrent send/receive software (not trackers), VCRs, and other things that are far more likely to be used to violate copyright than LLMs are? The average LLM is used for a far higher percentage of non-infringing uses than the average torrent software is. Why should torrent software not be subject to the same level of scrutiny as LLMs (e.g. liability for the torrent software maker, not just the user)?

Why are you so concerned with the creation and operation of software with a very low probability of meaningful copyright infringement (LLMs) and totally fine with the creation and operation of things associated with far higher rates of copyright infringement (torrent software, VCRs, photocopiers, etc)? Or libraries, for that matter? In none of those cases are you claiming the provider/manufacturer/etc should carry liability, only the user who uses the tool to infringe.

Replies:   julka
julka ๐Ÿšซ
Updated:

@Grey Wolf

One can operate a torrent tracker freely, whether that tracker returns torrents that are copyrighted or not copyrighted.

No, dude, that's just not true. A trivial search brought up Artem Vaulin [1], who was indicted after founding and running a massive torrent tracker - "the core theory underlying the indictment is that Vaulin aided, abetted, and conspired with users of his network to commit criminal copyright infringement in the United States". Kim Dotcom got indicted for, among other things, copyright infringement as part of operating Megaupload. Liability does not fall purely on the user, the operator of the platform has a responsibility to avoid serving illegal content. Youtube doesnt operate a team of people to respond to DMCA requests out of the goodness of their heart, they do it because it gives them safe harbor from copyright infringement. You are extremely wrong on this point, and you will learn that if you do like five minutes of research on this.

Why is it fine for the library to 'return copyrighted data on request', and for Google to 'return copyrighted data on request', but not for an LLM?

Because the library has the right to lend the physical copies of books they own; like I said, that's enshrined in the copyright law.

But a library which provides ebooks (as much US libraries do) should be illegal and shut down if any of those ebooks is ever copied, in whole or part - even a tiny part! - by so much as a single user of that ebook.

When a library is loaning out an ebook, that's because they've licensed the file (at some cost!) from the publisher. It's a problem for libraries, because the licenses are expensive! Sometimes publishers refuse to license the ebooks at all, and then libraries can't lend them out. The library is covered because they are acting within the terms of their license, which I assume helps indemnify them against what a user does with the file on loan.

If OpenAI et all are negotiating licenses with publishers or paying a fee to the publisher every time their model returns copyrighted content, then okay! But I don't think that's what's happening. And, of course, an LLM is not a library and is not lending content temporarily, so any comparison you want to make with a library should take into account the fact that they are fundamentally different and doing fundamentally different things, and when you do something different the law is different. That's a key point that I feel like I need to make here.

The purpose of copyright, in the United States, is to 'promote the progress of science and the useful arts.'

Okay, we can talk about that, but first let's finish the quote! You chopped it in the middle. It continues,

by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries.

Authors are granted, by copyright, the exclusive right to their writings and discoveries. If LLMs can only function by being trained on copyrighted works and then the cost of that is that the LLM will occasionally reproduce copyrighted works, those authors are having their rights denied. If you can't build a technology without denying people their rights, build something else. If you can build the technology in a way that doesn't deny people their rights, build it that way.

[1]: https://www.hollywoodreporter.com/business/business-news/judge-rules-kickasstorrents-founder-properly-charged-criminal-copyright-conspiracy-1026890/

Replies:   jimq2  Grey Wolf
jimq2 ๐Ÿšซ

@julka

When I check out an ebook from the local library, I can't copy it to a second reader. After 3 weeks it disappears off my reader if I don't renew it. I made the mistake of starting it on my desktop computer then wanted it on my tablet. I had to return the ebook so it got deleted from my desktop, and then check it out a second time on my tablet.

Grey Wolf ๐Ÿšซ

@julka

A trivial search brought up Artem Vaulin [1], who was indicted after founding and running a massive torrent tracker - "the core theory underlying the indictment is that Vaulin aided, abetted, and conspired with users of his network to commit criminal copyright infringement in the United States".

That's not just 'operating a torrent tracker'. That's operating a criminal conspiracy (see your own article!) that happened to include a torrent tracker. Merely operating a tracker is legal.

Kim Dotcom got indicted for, among other things, copyright infringement as part of operating Megaupload.

Megaupload hosted copyrighted files. Torrent trackers themselves do not host copyrighted information - they just provide pointers to where it's hosted. A tracker, by itself, hosts less copyright-infringing information than Google does.

Liability does not fall purely on the user, the operator of the platform has a responsibility to avoid serving illegal content.

That is true. But torrent trackers, by themselves, do not serve illegal content. You may be getting confused between a tracker (which points to torrent files - themselves pointers to content) and a torrent client (which actually serves the content pointed to by the torrent file).

Youtube doesnt operate a team of people to respond to DMCA requests out of the goodness of their heart, they do it because it gives them safe harbor from copyright infringement.

Youtube hosts content. Torrent trackers do not.

.You are extremely wrong on this point, and you will learn that if you do like five minutes of research on this.

If so, none of your comments show that I'm wrong. They just show that you don't understand what a torrent tracker is. If you could actually point to something that says I'm wrong, that might help. I've done far, far more than five minutes of research on this and I'm quite confident that I'm correct.

Because the library has the right to lend the physical copies of books they own; like I said, that's enshrined in the copyright law.

And the LLM (operator) has the right to use the transformative work (the model) they legally made in order to answer queries, thus far. That could change, but there is currently nothing to indicate that they cannot use that transformative work for the legal purposes for which it was intended.

And I wasn't referring to lending the physical copies. I was referring to allowing the physical copies to be used to create infringing copies. If the LLM operator has the responsibility to prevent their legal transformative work from being used to create infringing copies, why do librarians not have the responsibility to prevent their legal copies from being used to create infringing copies?

You're dodging the entire point.

When a library is loaning out an ebook, that's because they've licensed the file (at some cost!) from the publisher. It's a problem for libraries, because the licenses are expensive!

I'm well aware of that. It has nothing to do with my point.

The library is covered because they are acting within the terms of their license, which I assume helps indemnify them against what a user does with the file on loan.

Not in any ebook license I've seen, but if you can point to one that does that, I'd be interested in seeing it. Seems to me that would be a 'go right ahead - copy this ebook!' sort of license. Not something I expect publishers would want to have out there, but maybe they do.

Authors are granted, by copyright, the exclusive right to their writings and discoveries.

As long as that exclusive right serves the goal of advancing science and the useful arts. In this case, you're trying to say that exclusive right should be allowed to block the advancement of science and the useful arts. That's not why copyright is granted, and copyright used that way is in violation of the very reason we have copyright in the first place.

That's why we have a Fair Use doctrine - to balance the exclusive right of authors with the right of others to advance science and the useful arts. You're acting as if there is no such thing and copyright is absolute. It's not, and never has been, at least not in the United States. You're consistently pretending that, once one has a copyright on a work, that covers every possible use of it by anyone else. It doesn't.

. If LLMs can only function by being trained on copyrighted works

An untrue statement, so anything that flows from it is also untrue.

those authors are having their rights denied

Not if the use falls under Fair Use, which (so far) it does.

And, repeating myself yet again: those authors' rights are being violated by the user of the LLM. You're attempting to argue that the provider of the LLM should face sanctions, or have their product shut down, because some minuscule percentage of users might choose to use their product in such a way as to infringe copyright. You're basically saying that e.g. Sony should not have been allowed to build VCRs because some percentage of users would copy entire copyrighted works without authorization, or Xerox should not have been allowed to make photocopies because some percentage of users would copy entire copyrighted works. In both of those cases, that's not what happened. Why is this different?

And, again, don't say 'because it contains copyrighted information.' That's already been determined to be legal, so that can't be a reason to treat it differently than any other potentially copyright-infringing technology.

You're still refusing to answer the question about why LLMs, which have a multitude of legal uses that do not involve infringing copyrights, should be held to a far higher level of scrutiny than torrent software (not trackers! again, those are legal), for which by far the most common use is to infringe copyright. If Meta should have to shut down their LLMs because someone might happen to use one to infringe copyright, why in the world should qbittorrent stay under active development when we know that a nonzero number of qbittorrent users use it to infringe copyright?

Replies:   julka
julka ๐Ÿšซ
Updated:

@Grey Wolf

Jesus christ dude come on.

https://www.justice.gov/archives/opa/pr/us-authorities-charge-owner-most-visited-illegal-file-sharing-website-copyright-infringement

Artem Vaulin, 30, of Kharkiv, Ukraine, was arrested today in Poland and is charged by criminal complaint, filed in U.S. District Court in Chicago, with one count of conspiracy to commit criminal copyright infringement, one count of conspiracy to commit money laundering and two counts of criminal copyright infringement.

Edit: yeah okay, i'm for real done now. Your rebuttals are at the level "nuh uh" and i'm pretty tired of getting dragged down into discourse at the level of "here is how a library is not like an LLM" - you're refuting evidence with nothing to back you up by selectively ignoring sentences you don't like and selectively, misleadingly quoting sentence fragments so that you can misinterpret the fragment. This level of argument demeans you and I'm not interested in participating further.

Replies:   Grey Wolf
Grey Wolf ๐Ÿšซ

@julka

You keep trying to conflate file sharing sites and torrent trackers as if they're the same thing. That is the same argument as saying that Google is the same thing as the websites it indexes. Merely repeating an incorrect argument is not helping your case.

And the Justice Department posting you cited 1) lists charges filed, not any outcomes and 2) contains numerous factual errors. For instance, 'KAT has consistently made available for download' is as correct as saying 'Google has consistently made Harry Potter available for download'. KAT hosts nothing, in actual fact.

'Several motion pictures currently available for download and sharing on KAT'

There are zero motion pictures 'available for download and sharing' on KAT. That's not what it does, and saying something like that is counterfactual.

Had you actually read the link you posted, and if you had actually spent even five minutes understanding what torrent trackers are, what KAT is/was, and how this works together, you would have known the article was full of factual inaccuracies.

'The charges and allegations contained in the complaint are merely accusations. '

This, however, is at least a true statement. Since the complain alleges things that are factually impossible, how could it be otherwise? Not only that, but what Vaulin was actually charged with in court - not in that article - was 'aiding and abetting', on the theory that offering links to content aided and abetted those who actually violated copyright. That's similar to charging Google with aiding and abetting because you can Google up sites offering Harry Potter and then download the text from those sites, but at least that's a viable thing for them to actually charge that a judge won't immediately throw out.

The case, as far as I can tell, has never gone to trial, so we have no idea if that legal theory will hold water.

Now, you do have a valid underlying point. 'Torrent trackers', as a technology, can be used to aid and abet illegal copying, in exactly the same way as Google can be used to aid and abet illegal copying. And, if Mr. Vaulin (or if Google) were to say 'Hey! Come to my site and you can find all the copyright-infringing content you want!' that might well count as criminally 'aiding and abetting.' Nor would I argue otherwise. A court might well find Mr. Vaulin guilty of aiding and abetting, if that was his business model and he was profiting from providing what amounts to links to infringing content.

But that doesn't make the tracker, as a technology, illegal. There are hundreds of trackers out there listing torrents both legal and illegal. There does not seem to be a viable argument that those trackers either 1) violate the law themselves, or 2) that the software used to run trackers should be banned. Yet torrent trackers are far more useful in aiding and abetting copyright infringement than LLMs are.

If Meta, OpenAI, et al were to say 'Hey! Come use our AIs and you can get copyrighted output!', I agree that would be legally problematic. But you haven't provided any evidence, or even an assertion, that they are doing such a thing. In the absence of them doing such a thing, there's no parallel to be made here.

you're refuting evidence

What evidence? If there was actual evidence, I would refute it (or, at least, attempt to). All you have is the assertion that copyright is absolute and without bounds (but only in the case of authors vs LLM developers) and supercedes everything else, with no evidence presented for that viewpoint. Oh, and that LLMs are far, far more dangerous to copyright than things specifically designed to copy things, even though LLMs do a lousy job at actually copying things. I'm still waiting for evidence on either point.

by selectively ignoring sentences you don't like

Care to reference any? This is an absurdly 'pot calling the kettle black' argument, as you selectively ignore words, sentences, and entire paragraphs you don't like over and over again.

This level of argument demeans you and I'm not interested in participating further.

Actually replying to your arguments, such as they are, is 'demeaning' to me, while completely misunderstanding what a torrent tracker is, making incorrect assertions as to what LLMs do, asserting copyright absolutism far beyond the limits of the law, ignoring dozens of things that contradict your argument, and making many other misstatements is demeaning to you? That's a very interesting take on things.

I might feel differently if you'd even deigned to offer counterarguments for most of my points, but you haven't. You just ignore them.

here is how a library is not like an LLM

I know how a library is not like an LLM. But a library is far better at enabling copyright infringement than an LLM is. That's the point you've completely ignored, all while trying to keep pointing out that libraries are not LLMs. Photocopiers aren't LLMs either, but they're also better at enabling copyright infringement than LLMs are. The point isn't somehow trying to claim that an LLM is a library, the point is saying that you're all in against LLMs while ignoring target after target that are far better at infringing copyright than LLMs are.

If my goal was to obtain a infringing copy of a copyrighted work, an LLM wouldn't be in my top five places to look. It probably wouldn't even be in the top ten. Why not worry about the much more likely paths to copyright infringement first? You feel like a police officer ticketing a red car for going one mile an hour over the speed limit while waving on cars of other colors zipping by far faster than the red car was going, simply because you've decided you hate red cars.

Your point seems to be that, even though LLMs are really lousy at infringing copyright and there are many more efficacious ways of getting the text of Harry Potter, and even though LLMs have an enormous number of noninfringing uses, and even though LLMs are not marketed or intended to infringe copyrights, and even though training them is legal, and even though they are transformative works, and even though it's the user, not the LLM provider, who determines whether the LLM is used for infringement, nevertheless they are some new and spectacularly awful threat to copyright that justifies taking far more action than was taken with respect to devices that were literally designed and marketed to copy copyrighted things. I have seen not a shred of evidence to justify such a view, and you keep dancing around that.

So, if you are still at all interested in supporting your point, that's where you might want to start. Give some actual, evidenced (or at least reasoned) argument as to why LLMs are uniquely disastrous for copyright, much worse than VCRs, photocopiers, torrent software, torrent trackers, file storage sites, Google, and on and on - none of which are adjudicated at the product supplier level and all of which are regulated at the user level.

Or admit that you have no evidence or reasoned argument and you're just a fan of copyright absolutism with this one technology (but not with dozens of other technologies, all actively being used to violate copyright). That's fine, too.

You hate LLMs. There's absolutely nothing wrong with that. But that doesn't amount to any sort of reasoning, much less evidence, as to why we should upend decades of copyright law just in the case of LLMs, and nothing else in this discussion honestly matters to that point.

Replies:   julka
julka ๐Ÿšซ
Updated:

@Grey Wolf

Holy hell you're good at nerdsniping me by saying confidently wrong shit. You dragged me back in just after I thought I was out. I have an idea, though, and maybe it'll work. One moment.

And the Justice Department posting you cited 1) lists charges filed, not any outcomes and 2) contains numerous factual errors. For instance, 'KAT has consistently made available for download' is as correct as saying 'Google has consistently made Harry Potter available for download'. KAT hosts nothing, in actual fact.

Okay seriously, you are wrong on this and you should just take the L.

You're arguing a few different things, and I will try to sum them up to show that I have read your argument and am engaging with it in good faith.

1) Saying that "KAT made [files] available for download" is similar to saying that Google makes files available for download, and Google is not illegal

2) Artem is charged with crimes, but has not been convicted of them; his guilt is not a settled matter

3) Artem was charged, not with copyright infringement, but with "aiding and abetting" copyright infringement by providing links, in the same manner as Google

For 1, I think my response is "eh, that's splitting hairs". Yes, technically a torrent tracker from which you download a torrent file does not host the file itself, but the claim here isn't "KAT hosted files", the claim is "KAT made files available for download" and look, I think you agree that that's exactly the point of a tracker. It organizes torrent files so that you can find the one you need and thus download the file you are looking for.

For 2, the answer is Yes, but it kind of doesn't matter? For the purpose of this conversation, we're not actually debating whether Artem is guilty of the actions he is alleged to have performed; we're debating about whether the alleged actions themself are illegal. Normally, you're correct; Artem has not been tried, the case has not gone to trial, and so his guilt is not a settled matter. As an argument for whether his charges represent crimes, though, I think a failure to go to trial on account of he jumped bail and is currently at-large is not especially compelling and I think you are likely to agree on that front as well.

For 3, you are for real for real just wrong on this, you keep on insisting you are not wrong, and you keep on being wrong. You argued that he was charged with a criminal conspiracy, or with "aiding and abetting", and just no, dude. That's not what he was charged with.

From at least as early as in or about November 2008, to on or about July 8, 2016, in the
Northern District of Illinois, Eastern Division, and elsewhere, Artem Vaulin, also known as
"tirm," defendant herein, conspired with others to: (1) willfully infringe, for purposes of
commercial advantage and private financial gain, at least ten copies and phonorecords of one or
more copyrighted works with a total retail value of more than $2,500 during a 180-day period, in
violation of Title 17, United States Code, Section 506(a)(1)(A) and Title 18, United States Code,
Section 2319(b)(1); and (2) willfully infringe, for purposes of commercial advantage and private
financial gain, a copyright by distribution of a work being prepared for commercial distribution,
by making it available on a computer network accessible to members of the public, when
defendant knew and should have known that that work was intended for commercial
distribution, in violation of Title 17, United States Code, Section 506(a)(1)(C) and Title 18,
United States Code, Section 2319(d)(2), all in violation of Title 18, United States Code, Section
371.

Yes, he was charged with a criminal conspiracy, but that has to be in service of some actual crime! It's not illegal to just conspire - in this case, he conspired to infringe copyright by distribution. You pointed out that he didn't get arrested for operating a torrent tracker, he got arrested for a conspiracy of which the tracker was a part; yes! Exactly! You cannot operate a torrent tracker that serves copyrighted data, it's a crime, that's what I've been saying!

Now, if we look back at point 1 briefly, you'll see that a comparison to Google is made - I'mma set that aside for one second, if you'll bear with me, but I promise I'm coming back to it.

Okay, here's my idea. I'm going to describe a black box, which takes an input and provides an output. The way the black box works is, in my opinion (and I hope, in yours) immaterial - all we're concerned with is the input and the output.

The black box takes an input: you can interact with the box by providing a query, in human-readable language.

The black box provides an output: It generates information related to that query; in some (but not all!) cases, that information is copyrighted. The black box, and its operator, do not have permission to distribute that copyrighted material.

Does this black box describe a library? No. A library distributes copyrighted material in response to queries, but crucially, they have permission.

Does this black box describe a file sharing site, like MegaUpload? Yes, it does. Megaupload was, in the eyes of the justice department, super illegal. The ones which are up and not in the process of being investigated and/or seized by the feds are adjudicated at the product supplier level - they process and respond to DMCA claims, which gets them safe harbor from infringement suits, and so the the remaining adjudication happens at the user level.

Does this black box describe a file sharing program, like Limewire or Napster? Yes, it does. Also illegal, though - those got shut down super hard. I'm sure some of these still exist, but look - the law is not a perfect machine.

Does this black box describe a video hosting site, like Youtube? Yep, absolutely. Youtube is adjudicated at the product supplier level for copyright infringement - they process and respond to DMCA claims, which gets them safe harbor from infringement suits, and so the remaining adjudication happens at the user level.

Does this black box describe a torrent tracker that is not KAT, since that one got seized by the feds? Yep, absolutely! Those torrent trackers are adjudicated at the product supplier level for copyright infringement - they process and respond to DMCA claims, which gets them safe harbor from infringement suits, and so the remaining adjudication happens at the user level.

Does this black box describe torrent software? It doesn't. Torrent software requires you to already have a torrent file, it doesn't process human language queries.

Does this black box describe a photocopier? No. It doesn't. I am not going to explain why.

Does this black box describe a VCR? No. It doesn't. Again, I'm not going to tell you why. If you don't know, I want you to give me your best guess.

Does this black box describe a web browser, like Firefox or Google Chrome? No it does not. Again, if you don't know why, tell me your best guess and then I'll give you my answer.

Does this black box describe a search engine, like Google? Yep, absolutely. Google is adjudicated at the product supplier level for copyright infringement - they process and respond to DMCA claims, which gets them safe harbor from infringement suits, and so the remaining adjudication happens at a user level.

Does this black box describe popular LLMs like ChatGPT? Yep, absolutely.

I'm not arguing that ChatGPT et all should be subject to higher standards than comparable technologies - I'm arguing that they should adhere to the same standards everybody else is already meeting. You keep on saying those things aren't adjudicated at the product supplier level, but they absolutely are! It's just that that level of adjudication is baked in by legal processes implemented by the supplier.

For those other suppliers that aren't an LLM, though, the space of "content we might potentially return" is constantly mutating because they allow user-submitted data. Okay, that makes moderation a hard problem, but look, they do their best and that's why they have to rely on user reports. The system sucks, but I understand why it works the way it does.

For an LLM, the space of "content we might potentially return" isn't constantly mutating - it only mutates as a result of changes that the supplier explicitly decides to implement. They have absolute control over how the LLM is trained, and so I don't see why they shouldn't be held absolutely responsible for what it returns. If it returns copyrighted data, they should make it so it doesn't do that. Obviously if somebody feeds in a copyrighted work into the context and asks to get it back, they'll get it back, but now we're actually comparing an LLM to a photocopier and I'll agree that if you feed it something and ask for that something back, no copyright crime has been committed.

Replies:   Grey Wolf
Grey Wolf ๐Ÿšซ

@julka

Yes, technically a torrent tracker from which you download a torrent file does not host the file itself, but the claim here isn't "KAT hosted files", the claim is "KAT made files available for download" and look, I think you agree that that's exactly the point of a tracker.

I don't agree, any more than Google 'makes files available for download' or the phone book 'makes companies available to take your calls.' The software (torrent client) on the other end makes the files available for download. The tracker merely tells you which torrent client to talk to. It doesn't make anything available to download. It hosts nothing. If there is no torrent client out there, the tracker can't give you anything to download.

As an argument for whether his charges represent crimes, though, I think a failure to go to trial on account of he jumped bail and is currently at-large is not especially compelling and I think you are likely to agree on that front as well.

Not necessarily. The notion of him 'aiding and abetting' via running a torrent tracker and search engine is a novel legal theory. We have no idea if it would hold up in court. I don't know if that charge applies to what he did. And his jumping bail may simply be that he's a non-US citizen and felt like it would be a show trial - who knows?

For 3, you are for real for real just wrong on this, you keep on insisting you are not wrong, and you keep on being wrong. You argued that he was charged with a criminal conspiracy, or with "aiding and abetting", and just no, dude. That's not what he was charged with.

Almost all of the charges for which he was actually indicted were 'aiding and abetting', but I'll agree there were others. We have no idea if those charges apply to the circumstances, though. As has been said repeatedly, 'Any good prosecutor can get a grand jury to indict a ham sandwich.' There's not necessarily any reason to believe Vaudim's actions actually constitute copyright infringement - again, there is no evidence that he and his companies actually hosted any copyrighted content at all.

You cannot operate a torrent tracker that serves copyrighted data, it's a crime, that's what I've been saying!

There is (essentially) no such thing as 'a torrent tracker that serves copyrighted data' (unless one puts copyrighted data in the torrent description, but that's not a lot of data). Torrent trackers are phone books. They say "Hey! Want Harry Potter! Go talk to X, Y, and Z!" That's all they do. They don't host Harry Potter (or anything else). You're still misconstruing what a torrent tracker does and trying to claim it serves actual content. It does not.

First, I disagree about your black box analogy with respect to e.g. libraries. You say: 'Does this black box describe a library? No. A library distributes copyrighted material in response to queries, but crucially, they have permission.'

So, I walk into a library, go the photocopier (operated by and provided by the library), and make a copy of a book. I guess that's a legal copy, since the library has permission, right? That would have to be the case for your black box analogy to work. But, in fact, it doesn't. And, in order for the library to be held to the same standard you want ChatGPT held to, a librarian / staff member would need to hover by the photocopier every second making sure that no one copied anything. They would need to make sure no one takes pictures of book pages, hand-copies pages from books, or anything else that copies books.

Does this black box describe a VCR? No. It doesn't. Again, I'm not going to tell you why. If you don't know, I want you to give me your best guess.

Nope. A VCR is far more of an infringement technology than your black box - yet you want your black box to be under more enforcement than a VCR is.

Does this black box describe popular LLMs like ChatGPT? Yep, absolutely.

I disagree. Fair Use allows for transformative works. We already know the model is a transformative work. It seems exceptionally unlikely that one will argue that output from the model, used as intended, ceases to be a transformative work under Fair Use.

For those other suppliers that aren't an LLM, though, the space of "content we might potentially return" is constantly mutating because they allow user-submitted data.

For an LLM, the space of "content we might potentially return" isn't constantly mutating - it only mutates as a result of changes that the supplier explicitly decides to implement.

Nope. LLMs can also reference out to the web, which is user-submitted data (from neither the querying user nor the LLM provider). You yourself mentioned that earlier, after all.

And you're missing the entire point of transformative works. A transformative work does contain copyrighted information, yet it is legal for the transformative work to do so. And, so far, that's where it appears that LLMs fall on the copyright spectrum.

You keep on saying those things aren't adjudicated at the product supplier level, but they absolutely are!

Please provide to me a list of legal technologies, marketed and sold for purposes other than violating copyright, that have been forced off the market because they can be used to infringe copyright if a user so chooses. I'll wait. Until you have such a list, you're asking for it to be regulated at a different level than every other techology. Trying to claim otherwise is simply false.

But, sure. Let's assume the DMCA is the solution to everything (it isn't - not even close- but I'll allow it for the purpose of argument). That must mean that the publishers of Harry Potter are fine with all of the many, many sources Google indexes that provide full-text Harry Potter novels, right? After all, they haven't been challenged under DMCA.

In that case, perhaps the LLM providers should have to reply to DMCA challenges of their output and add filters upon request. Would that satisfy you? If not, you're arguing they should be regulated differently (and at a higher level) than anything else, again.

I would argue that, in order to be reasonable, DMCA challenge would need to be in response to a query not authored by the rightsholder, their agents, anyone seeking to set up challenges on behalf of the rightsholder, etc. Until someone other than the user sees the output, it's virtually impossible to argue that any 'distribution' has occurred.

By analogy, allowing the rightsholder to set up the challenge is the same as a publisher sending someone into a library, copying a book, and then attacking the library for allowing someone to copy the book. Clearly, that's an unreasonable position.

One final note. Back to your black box:

It generates information related to that query; in some (but not all!) cases, that information is copyrighted. The black box, and its operator, do not have permission to distribute that copyrighted material.

Under Fair Use, the amount of material returned matters a great deal. If the black box can only accurately reproduce a small subset of a work, that is likely to be entirely legal under Fair Use. That exception is routinely used in many other areas of commerce. For instance, Google routinely scans (without permission) copyrighted works and puts entire pages of them online for people to look at as a preview. That has been adjudicated to be legal under Fair Use.

So, in fact, operators do have tacit legal permission to distribute subsets of copyrighted information, even if they do so intentionally, accurately, on a widespread basis, and for commercial gain. You would need to show that the impact of the LLM's 'copying' is much worse than Google's excerpts to have much of a case there.

Replies:   julka
julka ๐Ÿšซ
Updated:

@Grey Wolf

So, I walk into a library, go the photocopier (operated by and provided by the library), and make a copy of a book. I guess that's a legal copy, since the library has permission, right? That would have to be the case for your black box analogy to work

I disagree with your disagreement. You haven't explained why the black-box library violates copyright, you've invented a scenario outside the scope of what a library is. The black box library takes an input (a query) and provides an output (information). You're imagining some library-photocopier combination, because there is no requirement that a library includes a photocopier, and so you're ignoring the analogy and then saying it doesn't hold. You can ignore it if you like, but in turn I'm going to ignore your argument about it.

Nope. A VCR is far more of an infringement technology than your black box - yet you want your black box to be under more enforcement than a VCR is.

A VCR doesn't, upon request, violate copyright. You're welcome to assert that a VCR is "far more" of an infringement technology, but you're ignoring the analogy and so I'm going to ignore your argument about it.

We already know the model is a transformative work. It seems exceptionally unlikely that one will argue that output from the model, used as intended, ceases to be a transformative work under Fair Use.

The model is not the output. The fact that you can't imagine a way in which the output could be considered not fair-use mostly shows you have a lack of imagination.

Please provide to me a list of legal technologies, marketed and sold for purposes other than violating copyright, that have been forced off the market because they can be used to infringe copyright if a user so chooses. I'll wait. Until you have such a list, you're asking for it to be regulated at a different level than every other techology. Trying to claim otherwise is simply false.

Go ahead and wait. I'm not going to compose the list, because it has nothing to do with any position I'm holding. I'm not going to argue for whatever random position you think this represents - Forced off the market? What on earth?

I'm not advocating for a different level, I've done my best to make that clear.

I would argue that, in order to be reasonable, DMCA challenge would need to be in response to a query not authored by the rightsholder, their agents, anyone seeking to set up challenges on behalf of the rightsholder, etc. Until someone other than the user sees the output, it's virtually impossible to argue that any 'distribution' has occurred.

I disagree; the source of the query isn't relevant to anything here, and it's certainly not relevant to the LLM - it's not like ChatGPT does a check to see if you hold the copyright to certain pieces of information before it returns the data.

Nope. LLMs can also reference out to the web, which is user-submitted data (from neither the querying user nor the LLM provider). You yourself mentioned that earlier, after all.

Wow man, that's wild, you're telling me that the operator of the LLM has absolutely no control over whether or not the LLM goes and queries the web, and has zero ability to restrict which websites the LLM queries? That's totally insane, somebody should figure that out. Seems dangerous.

Replies:   Grey Wolf
Grey Wolf ๐Ÿšซ

@julka

You haven't explained why the black-box library violates copyright, you've invented a scenario outside the scope of what a library is. The black box library takes an input (a query) and provides an output (information). You're imagining some library-photocopier combination, because there is no requirement that a library includes a photocopier, and so you're ignoring the analogy and then saying it doesn't hold. You can ignore it if you like, but in turn I'm going to ignore your argument about it.

I've mentioned photocopiers in conjunction with libraries repeatedly. You're welcome to ignore it, but that doesn't make it any less valid an analogy, it just means you're dodging something that you can't give a good reply to.

A VCR doesn't, upon request, violate copyright

It absolutely does. Yes, you need to provide it an input, but it is literally designed to copy things. An LLM is not. VCRs are far more designed to infringe copyright than LLMs are, period. And they were marketed specifically for the purpose of making copies of copyrighted works, which LLMs have not been.

The model is not the output.

Yes, that's obvious.

The fact that you can't imagine a way in which the output could be considered not fair-use mostly shows you have a lack of imagination.

I can imagine it. It's just an enormous stretch. Instead of looking at the 1% case where the output is not Fair Use, in the end, I'm sticking with the 99% case where the output is Fair Use.

I'm not advocating for a different level, I've done my best to make that clear.

You say that, and then you actually make the argument for a different level over and over and over again.

To be specific: the current model is that the user is responsible for the use of a product so as to infrige copyright. Period. End of subject. Your argument is that the supplier of the product is also responsible for that. That is a different level. You can say it's not until the cows come home, but as long as you are placing responsibility away from the user, you are - in terms of actual fact - arguing for a major shift in legal theory and a different level.

I disagree; the source of the query isn't relevant to anything here

It's absolutely relevant. If the output is not distributed, it's not distributed, and thus the rightsholder cannot know what it is. Undistributed output they ask is their own responsibility. You're basically asking for them to be able to play 'gotcha' games endlessly.

it's certainly not relevant to the LLM - it's not like ChatGPT does a check to see if you hold the copyright to certain pieces of information before it returns the data.

That wasn't the point I was making. The point I was making is that it would be ridiculous to argue that rightsholders could simply submit an unending torrent of DMCA requests based on their own searches, without regard for whether there was any practical merit in them. The intent of the DMCA system isn't to allow rightsholders to simply bombard others with an unending stream of takedown requests. Putting a check in place that the DMCA request be based on output distributed by others, without a tie to the rightsholders, seems like a reasonable way to limit the degree to which DMCA takedown requests can be used to tie others in knots.

Wow man, that's wild, you're telling me that the operator of the LLM has absolutely no control over whether or not the LLM goes and queries the web, and has zero ability to restrict which websites the LLM queries? That's totally insane, somebody should figure that out. Seems dangerous.

That's not what I said, and I'm certain you know it. Google has absolute control over whether Chrome goes out and queries the web and has total ability to restrict which websites Chrome queries. Anyone who would make the case that, therefore, Google is the sole determiner of what Chrome displays, and it displays no user-submitted data, would be laughed out of the room. The same is true for the argument that the LLM provider is the sole determiner of what the LLM displays, given that there is a feature within the LLM to retrieve information from the web. Especially since you knew LLMs were able to do that, making the argument that LLMs do not provide information other than that given to them by their provider is equally laughable.

The entire discussion of Grok quoting pages from Harry Potter is 100% Grok pulling data from the web - the very definition of it producing user-submitted data. It is no more reasonable to say 'well, don't do that!' in the case of an LLM than it is to say 'well, don't do that!' in the case of a web browser. The infringement is sourced at the site distributing the copyrighted material, not the engine that retrieves and displays it.

A couple of notes:

As far as I can tell, you had no response to the point that there is established precedent for the intentional copying and redistributing of sections of copyrighted material for commercial purposes, when that copying is limited to relatively small sections of the work and does not adversely affect the market for the work - even in the case where the rightsholders or their representatives vociferously object to that copying. As far as I'm concerned, that means the LLM needs to produce large segments of copyrighted material without significant error, and needs to do so without referencing outside sources (the web, etc), in order to be an infringing black box. I have yet to see any proof that existing LLMs can do that; current analysis makes it pretty clear that they cannot. It's absolutely not sufficient that they produce 'any' copyrighted material - quantity and quality absolutely matter, in a legal sense.

I'll also note that the 'compromise' position - allow rightsholders to issue DCMA requests to the LLM operators - remains out there. While I think that could easily be weaponized, and there would need to be checks against it (hence, again, my notion that the requests should be made against output that was actually distributed without the knowledge or involvement of the rightsholders - no fishing expeditions!), that's not an unreasonable compromise position. It would give the rightsholders recourse while allowing for the vast majority of LLM uses, which have nothing at all to do with infringing copyright.

Replies:   julka
julka ๐Ÿšซ

@Grey Wolf

I've mentioned photocopiers in conjunction with libraries repeatedly. You're welcome to ignore it, but that doesn't make it any less valid an analogy, it just means you're dodging something that you can't give a good reply to.

Dang, that's wild! I had no idea that an argument became better the more times you repeated it! Is that something you learn in debate class? I thought an argument was good on its merits, like "a photocopier is not a required part of a library, because libraries existed for thousands of years before photocopiers did", not "I've mentioned photocopiers in conjunction with libraries bunches of times". Dang, you're teaching me so much!

It absolutely does. Yes, you need to provide it an input, but it is literally designed to copy things. An LLM is not. VCRs are far more designed to infringe copyright than LLMs are, period. And they were marketed specifically for the purpose of making copies of copyrighted works, which LLMs have not been.

Dang, that's wild! I had no idea the VCRs my family used were so behind the times! You're telling me you had a VCR that you just gave a human-language query to, and it would spit out a video tape containing information related to what you said? That's insane, man, I honestly had no idea the tech was there in the 80s! Dang, you're teaching me so much!

To be specific: the current model is that the user is responsible for the use of a product so as to infrige copyright. Period. End of subject. Your argument is that the supplier of the product is also responsible for that. That is a different level. You can say it's not until the cows come home, but as long as you are placing responsibility away from the user, you are - in terms of actual fact - arguing for a major shift in legal theory and a different level.

Dang, that's wild! You're telling me that when all of those service I listed process DMCA queries, that's actually not a legal process at all and has no bearing on liability? That's so wild, I had no idea safe harbor had nothing to do with absolving legal liability. Dang, you're teaching me so much!

It's absolutely relevant. If the output is not distributed, it's not distributed, and thus the rightsholder cannot know what it is. Undistributed output they ask is their own responsibility. You're basically asking for them to be able to play 'gotcha' games endlessly.

Dang, that's wild! I had no idea that asking for, and receiving, a response from an LLM was a "gotcha" game! That's pretty surprising to me, I would have expected that a service put on the internet to answer queries was intended to answer queries given to it by the internet, but I guess that's not the case. Dang, you're teaching me so much!

The point I was making is that it would be ridiculous to argue that rightsholders could simply submit an unending torrent of DMCA requests based on their own searches, without regard for whether there was any practical merit in them.

Dang, that's wild! I had no idea that if rightsholders request content from somebody who doesn't have permission to distribute it, it's actually the rightsholder's fault if the person who doesn't have permission to distribute it gives it back to them! That's such a nutty rule to have in place, I have no idea where I would have looked to find that. Dang, you're teaching me so much!

That's not what I said, and I'm certain you know it. Google has absolute control over whether Chrome goes out and queries the web and has total ability to restrict which websites Chrome queries. Anyone who would make the case that, therefore, Google is the sole determiner of what Chrome displays, and it displays no user-submitted data, would be laughed out of the room. The same is true for the argument that the LLM provider is the sole determiner of what the LLM displays, given that there is a feature within the LLM to retrieve information from the web. Especially since you knew LLMs were able to do that, making the argument that LLMs do not provide information other than that given to them by their provider is equally laughable.

Gosh, I'm such a dummy, I must be confused! Can you explain to me how the LLM operator a) chooses to let the LLM search the web but also b) has no control over where on the web the LLM searches, and also can't choose whether or not the LLM searches the web? I was under the impression that if the operator is choosing to let the LLM search the web, then surely they should be responsible for the fact that it's searching the web, and furthermore they have some sort of ability to tell the LLM what sources are good or bad, and I would have expected that this would meant the LLM operator has absolute control over what the LLM is trained on and uses to generate its responses! But you've told me that's not the case, and I guess I'm just real twisted up trying to figure out how that's possible!

As far as google chrome goes, you're just being a big silly - nobody made that argument! It has nothing to do with anything, you goofball! LMAO, stop making joke like that! I thought you had an actual point!

Replies:   Grey Wolf
Grey Wolf ๐Ÿšซ

@julka

I'll take all of your sarcastic replies and summarize them as 'I have no valid argument here, so I'll try to make fun of the messenger.' Because that's what you actually did - just made a sarcastic reply while ignoring the substance of the argument.

In debate, if you simply ignore the substance of an argument and make a personal attack in response, guess what? The person who advanced the argument is considered to have won that point. So, I'll take my win on all of those points and move on, mostly.

I'll engage slightly on libraries, though. Yes, libraries existed for many years without photocopiers. So, fine. Libraries without photocopiers may or may not constitute black boxes (Do they allow camera phones? Handwriting? If so, still a black box under your definition). My argument is that 'libraries with photocopiers' constitute a black box by your definition of a black box. Which is, in fact, pretty much what I said the first time. You keep trying to ignore that, but you can't actually give any substantive reason why they don't, only that 'they have a license!' (which, in fact, they do not for paper books - and, if that counted, any LLM trained using paper books would have the same 'license'). And they don't have a license for electronic books that allows copying, either.

The librarians undoubtedly control the inside of the library, so they should be liable for allowing copying, no? That's your standard elsewhere, and there is no 'license' which gives the librarians the authority to allow copying.

Now, what actual libraries generally do is post a sign near the photocopier saying 'Hey! These are the rules about copying! Follow them!' Is that sufficient? It seems reasonable to me. On the other hand, one could say it kinda makes it seem like the librarians are aware of the risk that their services will enable copyright violations and yet choosing not to supervise the photocopier. Sneaky librarians!

But, by analogy, we could allow the LLM provider to post a notice in their Terms and Conditions: 'Our product may allow access to copyrighted information. It's your responsibility to not distribute any copyrighted information you receive from our product.' Problem solved, just in the way it's solved in libraries, right?

And note that LLMs might allow access to copyrighted information. Maybe. On the other hand, libraries absolutely allow access to copyrighted information. If the disclaimer works for library photocopiers, why in the world would it be insufficient for an LLM?

I'm intentionally dropping your point about DMCA takedowns except to note that, perhaps, you missed my suggestion that the LLM providers should also handle DMCA requests, and thus be just as fine as other providers, perhaps? If it's good everywhere else, why not in the case of LLMs?

Dang, that's wild! I had no idea that asking for, and receiving, a response from an LLM was a "gotcha" game! That's pretty surprising to me, I would have expected that a service put on the internet to answer queries was intended to answer queries given to it by the internet, but I guess that's not the case. Dang, you're teaching me so much!

Great job of setting up a strawman and knocking it down while ignoring the point. Yes, it's a gotcha game if the goal of making those queries is to bombard the provider of the LLM in DMCA takedown requests. Which, again, was the point I made very specifically, and the point which you didn't respond to while tilting vigorously at your strawman.

Strawmen: 0, julka 1. But actual arguments GW 3, julka 0 (counting the arguments you intentionally whiffed on with personal attacks).

Dang, that's wild! I had no idea that if rightsholders request content from somebody who doesn't have permission to distribute it, it's actually the rightsholder's fault if the person who doesn't have permission to distribute it gives it back to them!

I specifically said that wasn't the argument here and told you what the argument was.

Strawmen: 0, julka 2. But actual arguments GW 4, julka 0.

Can you explain to me how the LLM operator a) chooses to let the LLM search the web but also b) has no control over where on the web the LLM searches, and also can't choose whether or not the LLM searches the web?

Annnnnnnd the goalposts move hundreds of kilometers in a single bound! Before, you claimed there was no user data. Now, you're claiming that the provider could make there be no user data if they wanted to (at the cost of making the product much less useful, obviously). But, unlike the case of everything else that searches and retrieves information from the web, you want the LLM provider to be responsible, not the entity that published the copyrighted information on the web in the first place.

I was under the impression that if the operator is choosing to let the LLM search the web, then surely they should be responsible for the fact that it's searching the web, and furthermore they have some sort of ability to tell the LLM what sources are good or bad, and I would have expected that this would meant the LLM operator has absolute control over what the LLM is trained on and uses to generate its responses!

What a ludicrous argument! You're of the opinion that an LLM operator must create a canonical list of the entire internet and determine which sources are good and bad? Even Google doesn't so much as try to do that! Hence all of those DMCA takedowns you go on about. And, really, you want the LLM providers deciding what's 'good' and 'bad'? The implications of that are, honestly, horrifying.

So, your argument is that the LLM providers must undertake a task that so far outstrips the meaning of 'Herculian' as to boggle the mind - make a URL-by-URL determination, for the entire internet, as to whether a site is 'good' or 'bad' - before they can interact with the internet at all? And they have to do it even though the entire internet is constantly in flux? Perhaps you understand why I'm laughing at you. Loudly.

Strawmen: 0, julka 3. But actual arguments GW 5, julka 0.

As far as google chrome goes, you're just being a big silly - nobody made that argument! It has nothing to do with anything, you goofball! LMAO, stop making joke like that! I thought you had an actual point!

There's this thing called an analogy: 'a comparison between two things, typically for the purpose of explanation or clarification.' You might want to read up on them. By 'analogy,' you are saying that the exact thing Chrome does, and that its providers control it doing, is just fine, while if the LLM does the same thing, because its providers allow it to, it's awful and infringing. Even though, again, it's the same thing.

So, which is it? Is Chrome bad because it does the same thing the LLM does, or is the LLM fine because it does the same thing Chrome does? Pick one or the other, if you're not too tied up with learning what analogies are.

At this point, I will just assume you're completely fine with the solution to this being to allow rightsholders to issue DMCA challenges to the LLMs, since you haven't bothered to reply to it. No challenge; the LLM is fine. A challenge, to which the LLM provider doesn't take appropriate action, might be grounds for a lawsuit (with, obviously, refusal to respond to the challenge a key piece of evidence). Now, if the challenge is wrongful, the LLM provider shouldn't take action, and perhaps the entity issuing the wrongful challenge should be sanctioned, but those are details. We're good there? DMCA challenges solve the whole thing?

And I will also consider you to have conceded the point that the LLM's production of copyrighted materials is only actionable to the point where the copyrighted text is long, accurate, and has a meaningful negative impact on the marketability of the original work (all longstanding elements of Fair Use, and all adjudicated to apply to commercial use of excepts of copyrighted works), since you haven't bothered to reply to that, either. If it meets those criteria, it's actionable infringing material. If it doesn't, it's not.

See? We can make progress! Now, all you have to do is find an extant LLM which can produce actionable infringing material without resorting to external sources and - wonder of wonders - I will actually agree that the LLM / provider might be committing a legally actionable infringement. And I hope that you, in turn, will agree that any LLM which cannot do that (which seems to be 100% of the currently extant LLMs) is not committing any legally actionable infringement, nor should it be considered to be doing so unless the rightsholders have made specific takedown requests against actionable infringing material, allowed adequate time, and had those requests not eventually satisfied.

Mind you, the 'post a sign' version seems sufficient, unless you think libraries are dangerous hotbeds of copyright infringement and librarians should be held responsible if anyone walks out of their library with a copy of a book. But doing it the more complicated way seems beyond question, no?

Replies:   julka
julka ๐Ÿšซ
Updated:

@Grey Wolf

You keep trying to ignore that, but you can't actually give any substantive reason why they don't, only that 'they have a license!' (which, in fact, they do not for paper books - and, if that counted, any LLM trained using paper books would have the same 'license

This is a perfect encapsulation of why I wrote sarcastic responses for you to ignore; you've spent the entire argument ignoring my serious responses, so I might as well have more fun when I reply.

My responses to probably at least most of the rest of your points are in my previous posts, see if you can find them! It's like a fun scavenger hunt. As an example, see if you can find any of the posts where I point out that part of copyright law gives people (including libraries) the right to lend out physical copies of books they have purchased.

You get an extra point if you can figure out why that doesn't apply to an LLM that was trained on physical books. Scroll down a bit if you need a hint!

(Hint: is an LLM loaning a copy of a physical book it has purchased when it answers a query? If you still don't know the answer, scroll down for the answer!)

(Answer: Nope! LLMs aren't loaning physical copies of anything, they return digital data! This is one of the ways in which an LLM is not like a library! Turns out that when you do something different, the law is different! See if you can find this answer buried in another post!)

At this point, I will just assume you're completely fine with the solution to this being to allow rightsholders to issue DMCA challenges to the LLMs, since you haven't bothered to reply to it.

You can assume whatever you want, dude, it's not like you engage with the things I write. I didn't bother to reply because I already gave my reasoning for why LLMs aren't forced to implement a reactive measure (they have absolute control over what the LLM is trained on) and your counterpoint was nonsensical and part of why I started making fun of your dumb answers.

Also, and I'm gonna address this point specifically because it hurt my feelings, I categorically reject your assertion that I have made any personal attacks on you. I've attacked your arguments in sarcastic ways, and I've called your arguments dumb or bad in a few places, but I have put effort into not insulting you personally and only addressing the quality of your statements, not you as a person.

Edit:

You're of the opinion that an LLM operator must create a canonical list of the entire internet and determine which sources are good and bad?

Someday, maybe someday soon, you're going to come across the idea of an allowlist and I think it's going to blow your mind with the possibilities.

Replies:   Dominions Son  Grey Wolf
Dominions Son ๐Ÿšซ

@julka

My responses to probably at least most of the rest of your points are in my previous posts, see if you can find them! It's like a fun scavenger hunt. As an example, see if you can find any of the posts where I point out that part of copyright law gives people (including libraries) the right to lend out physical copies of books they have purchased.

And you are still missing the point about libraries.

The argument being made is not about someone borrowing the copy of a book that the library purchased.

The argument is about someone using the library's facilities (a photocopier for example) to create a new copy and walking out of the library with a copy of the book that isn't the one the library purchased.

Replies:   julka
julka ๐Ÿšซ
Updated:

@Dominions Son

Right, that argument keeps on being made and it's never explained why it's relevant. In this argument, is the LLM the library? The person making the copy? The photocopier?

In one scenario, we have two parties and one interaction: person queries llm to receive answer.

In the other scenario, there are three parties with two separate interactions: person queries library and receives answer. Person uses photocopier to copy answer.

I don't see a clear mapping of the three-party-two-way interaction to the two-party-one-way interaction, which is why I keep on saying it's fundamentally different. Grey Wolf keeps on bringing up the comparison and saying "nuh uh".

I agree that when you add in extraneous things, the scenario is different, but "if things were different you'd be wrong" is pretty far from a compelling argument. Grey Wolf appears to be arguing that if I want an LLM operator to police the output of the service they control, I must also be arguing that libraries must police the actions of patrons they don't control, I fundamentally disagree that those two viewpoints are equivalent, and they don't become more equivalent just because it keeps on getting repeated.

Edit: to be clear, I am not on the hook for solving the entirety of copyright infringement because I am arguing that LLMs should not return copyrighted information that they do not have permission for! Whataboutism is a bad argument here - multiple things can be bad in similar ways and the existence of one bad thing is not a reason to insist against improving other things. We should develop things that are better than what we have now, not things that are bad in the same ways we already have. This isn't an impossible problem for LLMs to solve, so why not solve it?

Replies:   GreyWolf  LupusDei
GreyWolf ๐Ÿšซ
Updated:

@julka

Right, that argument keeps on being made and it's never explained why it's relevant. In this argument, is the LLM the library? The person making the copy? The photocopier?

In this argument, the library is the 'black box' you say is equivalent to an LLM. I've explained that several times. It seems to be escaping you, and I'm not sure of how else I could possibly make it clear.

You came up with the idea of a 'black box.'

You came up with the idea that your 'black box' matched an LLM.

I noted that 'library + photocopier' meets your definition of a 'black box.'

Since then, it's been nothing but a series of irrelevant (libraries can loan books!), incorrect (libraries have licenses!), and silly (libraries have existed for a long period of time) responses, all dodging the entire point of the discussion.

I don't see a clear mapping of the three-party-two-way interaction to the two-party-one-way interaction, which is why I keep on saying it's fundamentally different.

All of the interactions happen inside your 'black box'. They are, therefore, the same. If you have to look inside the 'black box' to tell one from the other, it's not a 'black box' at all.

I must also be arguing that libraries must police the actions of patrons they don't control

Your argument is that a library cannot control the actions of patrons within the library? That would be news to the many librarians I know personally. At no point did I saw that someone copying a book outside of the library constitutes a 'black box'; my comments were always about behavior within the library, where the librarians should be in control.

I agree that when you add in extraneous things

What, exactly, do you consider to be 'extraneous,' here?

We should develop things that are better than what we have now, not things that are bad in the same ways we already have. This isn't an impossible problem for LLMs to solve, so why not solve it?

I can somewhat agree here (hard to imagine I couldn't). My point has been that either 1) this is a very easy problem for LLMs to solve (as noted in a previous section to which you completely failed to respond, there appear to exist zero operational LLMs that don't meet the Fair Use test established in Authors Guild vs Google) or an exceptionally hard problem for them to solve (if your standard is that any amount of copyrighted output - even if short and sometimes incorrect - isn't allowable). If it's #1, then there's literally no point to your crusade: LLMs are following the law. If it's #2, then I'm making the cost-benefit argument that you're saying that, because an LLM can be misused to sometimes reproduce copyright information, we should just hamstring everything about them until they absolutely can't do that.

By analogy, consider Google. The equivalent argument would be that Google should not return copyrighted information that it does not have permission for. That is absolutely something that could be achieved. Google has complete control over what it chooses to index and what results it returns, after all. I consider it equally reasonable to say 'Google should return no copyrighted information' and 'LLMs should return no copyrighted information.' The first is clearly a beyond-Herculean task that would effectively remove Google from the search engine market. The second is less Herculean, but still ridiculous, and would be achieved at the cost of making LLMs far less useful, because they would necessarily have to be unaware of the contents of numerous high-profile copyrighted works. That's why we have both Fair Use and DMCA takedowns - to balance the rights of copyright holders with society's interest in advancing the useful arts and sciences, even if it occasionally means compromising the rights of copyright holders.

The point of bringing up other technologies is to note that there are numerous things which make life much, much easier for would-be infringers than LLMs do. One can, of course, crusade against anything one wishes, but it strikes me as someone vociferously supporting the 'We need to ban campfires because they release CO2!' club while waving their hand and saying that we needn't worry about automobiles and airplanes. After all, campfires do release CO2, too, so it would be completely fair to say that multiple things can be bad and the existence of a couple of bad things (airplanes, automobiles) is no reason to not go after campfires, too.

But it seems to me as if all of that energy is wildly misplaced, if the thing one actually cares about is reducing the problem of excess CO2. And it also seems to me as if all of the energy spent complaining about LLMs is wildly misplaced if the thing one actually cares about is reducing the problem of copyright infringement.

I don't expect a substantive reply, but I'll be happy to be proven wrong.

Replies:   julka
julka ๐Ÿšซ

@GreyWolf

My point has been that either 1) this is a very easy problem for LLMs to solve (as noted in a previous section to which you completely failed to respond, there appear to exist zero operational LLMs that don't meet the Fair Use test established in Authors Guild vs Google)

The model generation has been deemed fair use by the courts, but I still maintain the output isn't, necessarily. While I hate to appeal to authority here, OpenAI agrees with me - "Generative AI systems might generate output media that infringes on existing copyrighted
works", they write in their comment to the USPTO (they go on to say that this is unlikely in a well-constructed model, and that such cases should be handled via individual suits, as you might expect). That said, here's a paper showing that models "memorize" significant chunks of their training data, which can be extracted under fairly trivial circumstances [1] and a lawsuit from the NYT [2] showing ChatGPT reciting, verbatim, significant chunks from recent articles.

The weakness of the first paper I'm citing here, at least in terms of my argument, is that it's pulling out n-gram strings of ~1000-4000 characters, in the largest single chunks. And I can tell you that you're right! In terms of copyright, that's not a significant chunk; the paper is much more interesting in the context of "how much PII was used in the training of ChatGPT, and how much of that can be recovered?", which is a question for sure but totally orthogonal to copyright. But it does serve as a proof of concept that a) not all of the training material disappears inside the model, some of it gets retained and b) raw text that was used to train the model is extremely extractable in small chunks, and repeated queries can extract more text.

Does the NYT lawsuit evidence meet the bar of beating what was established by Author's Guild vs Google? It hasn't gone to court yet, so the question is an open one. OpenAI says you have to manipulate the model via prompting in order to get verbatim output, which is not a particularly convincing argument for me; one of their points is that you have to already know some of the text in order to prompt for more. If you can just prompt with a single token word to initiate divergence and get the start of the text, then use the start to get the rest? Is that still fair use? I guess the courts will decide.

Researching this post also led me to some interesting mitigations various providers are working on; some of them I think are categorically doomed (filtering or modifying user prompts, as OpenAI is trying to do, is a fool's errand that will never finish; Anthropic's training methodology is much more interesting in my view) but it's nice to see that the developers of these tools are working to solve the problem instead of declaring it to be not an issue.

And it also seems to me as if all of the energy spent complaining about LLMs is wildly misplaced if the thing one actually cares about is reducing the problem of copyright infringement.

oh yeah copyright infringement is probably in the top ten, but definitely not top five, reasons I don't like LLMs. They're interesting for some very specific use cases, and 95 percent of where you find them is not one of those use cases.

[1]: https://arxiv.org/pdf/2311.17035v1
[2]: https://storage.courtlistener.com/recap/gov.uscourts.nysd.612697/gov.uscourts.nysd.612697.170.0_2.pdf

Replies:   Grey Wolf
Grey Wolf ๐Ÿšซ
Updated:

@julka

The model generation has been deemed fair use by the courts, but I still maintain the output isn't, necessarily.

Obviously, one can't be sure, since it hasn't been adjudicated. But the conclusion of Author's Guild vs Google is that fully commercial, for-profit use of directly quoted material from a copyrighted work, without permission, is Fair Use - as long as the amount of material is relatively limited and does not have an adverse affect on the market value of the work. The amount of quoted material in that case is larger than any current AI seems to be able to reproduce - hence my view that there does not appear to be an extant AI that fails to meet the same Fair Use criteria as in that case.

not all of the training material disappears inside the model, some of it gets retained

As I would expect. If it all 'disappeared,' the AI would be very limited in utility for purposes related to that material.

raw text that was used to train the model is extremely extractable in small chunks, and repeated queries can extract more text.

Also entirely believable, but generally meaningless, unless those repeated queries add up to extracting a large enough volume of the work to cross the Fair Use threshold. Again, if it couldn't, the value of the AI to purposes related to that material would be enormously limited.

Consider, for instance, the use case of 'Hey, LLM, tell me about the social and political themes in Harry Potter.' That's a perfectly valid use case. It's hard to imagine the response to that being valuable without being able to quote from the text. If the LLM is a 'problem' because of that, so is the average term paper.

They're interesting for some very specific use cases, and 95 percent of where you find them is not one of those use cases.

I would tend to disagree with your percentages there, while agreeing that they have been applied to use cases for which they are not well suited.

I would consider 'copying copyrighted material' to be a use case for which they are particularly ill-suited.

I will also point out that the point of what I was saying wasn't that 'copyright is one of the lesser problems with AIs', but rather that 'AIs are one of the lesser problems with copyright.' If the goal is to reduce copyright infringement, AIs seem like an ill-chosen starting point.

LupusDei ๐Ÿšซ

@julka

I go to a library read their copy of Harry Potter and have that chapter (used as example above) memorized perfectly, word by word. I believe that's entirely legal, isn't it? And believe me, I can technically do that, for many years I could recite a very long epic poem. So, now after that you ask me to recite that chapter, and I do. Who of us both just committed copyright infringementing? Note that I just honestly answered your query, privately. I wouldn't go on the stage reciting it, or sell tickets for that, or anything like that on my own.

Hint for relevance in this discussion: while I'm a human, I'm also a LLM in this functionality (most every human contain an LLM-like system naturally). However, LLM "on a compurer" differ by not having personhood nor rational judgment, and cannot be "personally" liable to anything.

Replies:   julka
julka ๐Ÿšซ

@LupusDei

Hey man it's super cool that you can memorize and recite long passages of text, real chuffed for you and honestly I'm just like so stoked for you that you found an opportunity to brag about your ability! I imagine it doesn't come up in conversation a lot and that's why you felt the need to shoehorn it into a conversation.

Anyways if you have a question about copyright laws and performances I suggest you go to a library and look it up, a lot of this stuff is established law and if you're intellectually curious you can just go find this information for yourself, you don't have to rely on random people on the internet to go and do research for you! It's pretty neat, I hope you have fun learning!

Replies:   LupusDei
LupusDei ๐Ÿšซ

@julka

Fine. You once again demonstrated you have absolutely no interest to engage in constructive discussion. And there I was thinking I was giving you an out.

So, well, regarding that question, I will go ask ChatGPT about it.

Replies:   julka
julka ๐Ÿšซ
Updated:

@LupusDei

Dogg, come on! A) Your question has nothing to do with the topic at hand and 2) your question is very definitely something that is just straight up settled law, like the established answer is absolutely a known thing. You're not a wizened old man engaging in socratic debate here, you don't need to get salty at me just because I declined to perform free legal research for you. Go toss some cash at a paralegal or something.

Edit: since i'm in a good mood, how about this? I'll answer your question; let me know where I should send my invoice. My hourly rate is 200 dollars and I bill in increments of 8 hours, so once you send me 1600 dollars I'll tell you what's up.

Grey Wolf ๐Ÿšซ

@julka

This is a perfect encapsulation of why I wrote sarcastic responses for you to ignore; you've spent the entire argument ignoring my serious responses, so I might as well have more fun when I reply.

This is the epitome of the pot calling the kettle black. I've written comment after comment with nothing but responses to the extremely limited 'serious replies' you've made. Most of your 'serious replies' have been anything but.

Meanwhile, you very seldom reply to my points, often just ignoring them over and over.

If you actually made more serious replies, you would get more responses to them.

My responses to probably at least most of the rest of your points are in my previous posts, see if you can find them!

Triple-checking: no, they're not. You could try actually responding. Now, maybe you wrote them all out and SoL refused to post them - that's possible. Or maybe they were deleted by someone (other than yourself, of course). But they're not posted here as of now. Maybe you archived them and will repost them later? If so, I'll withdraw this and later comments about your exceptionally low rate of responding to my points.

As an example, see if you can find any of the posts where I point out that part of copyright law gives people (including libraries) the right to lend out physical copies of books they have purchased.

Which is utterly irrelevant to anything we're discussing. I never said they couldn't, but that point is entirely irrelevant. You seem to somehow think that their ability to loan out books also somehow indemnifies them or gives them a 'license.' It does neither.

You get an extra point if you can figure out why that doesn't apply to an LLM that was trained on physical books. Scroll down a bit if you need a hint!

If you could explain what you actually mean by that, it would help.

You can assume whatever you want, dude, it's not like you engage with the things I write.

Good lord. I do nothing but engage with the things you write. I called out several very specific places where you are not engaging with serious points I made, and your response is 'I know you are but what am I'? Seriously? You're now resorting to the Pee Wee Herman school of debate?

I didn't bother to reply because I already gave my reasoning for why LLMs aren't forced to implement a reactive measure (they have absolute control over what the LLM is trained on)

A point that I've already shown to be both irrelevant and incorrect. You could try reading my responses instead of ignoring them and claiming victory in a vacuum.

and your counterpoint was nonsensical and part of why I started making fun of your dumb answers.

Reality is not 'nonsensical.' Sorry if you disagree.

Also, and I'm gonna address this point specifically because it hurt my feelings, I categorically reject your assertion that I have made any personal attacks on you.

Mocking an argument with irrelevant commentary instead of replying to it with substance is a personal attack.

Someday, maybe someday soon, you're going to come across the idea of an allowlist and I think it's going to blow your mind with the possibilities.

I'm very well of what an allowlist is. What, in the nature of an allowlist, makes it a magical tool that allows someone to quickly survey the entire internet and also guarantees that nothing changes after the allowlist is made? That was, after all, the point of my argument.

You're totally ignoring two related points, of course - but that's par for the course. Point one: yes, obviously they could make such a restrictive allowlist as to make the internet search useless. So, yes, you can insist on a rule that LLMs have to be massively and artificially hamstrung for no reason other than that they're LLMs.

And, point two: you still can't say why it's totally fine for anything but an LLM to reference 'the internet' - even though those other things are much more efficient at enabling copyright infringement - but LLMs are uniquely disallowed from doing that.

At this point, I have to agree with what you said before: there can be little point in continuing this discussion. You simply refuse to argue in good faith. I respond to every serious argument you make, then - much later - you say I ignore them. You say 'Oh, I responded to that - search my posts!', yet there are no responses in evidence. Meanwhile, I make serious argument after serious argument, then specifically point out that you're at minimum tacitly agreeing with them by letting them go, and I get back nothing at all. Or, when you deign to reply, it's to post something mocking that ignores the substance of the comment, takes something out of context and tees off on it, or knocks down a convenient strawman.

Whatever your vendetta is against LLMs, have at it. By this point it's demonstrably nonsensical, but it's yours, and you're welcome to full enjoyment of it.

Joe Long ๐Ÿšซ

@julka

Another way to infringe copyright is by distributing the work without permission. When a photocopier makes a copy of a book, it's not distributing that copy; that's done by some other entity.

Correct, to an extent. There are fair use exceptions for how much is shared and if the sharing interferes with the ability of an author to monetize the work being copied.

Replies:   Joe Long
Joe Long ๐Ÿšซ

@Joe Long

Here is an example of what I can do with ChatGPT that doesn't violate copyright, but requires the LLM have a detailed knowledge of the subject material.

Compare and contrast Joe Long in OTA with Charlie in "Perks" and Park in "Eleanor and Park", in how they try to navigate teen social structures and find true love.

Great side note โ€” these three characters are all emotionally intelligent teenage boys who struggle to navigate love, friendship, and self-worth in vastly different social settings. Here's a breakdown of their similarities and differences:

---

### **Similarities**

| Trait | **Joe (OTA)** | **Charlie (Perks)** | **Park (Eleanor & Park)** |
| -------------------------------- | ------------------------------------------------------------------ | ------------------------------------------------------------- | -------------------------------------------------------------------------- |
| **Emotionally introspective** | Yes โ€“ Joe constantly evaluates his actions, motives, and guilt. | Yes โ€“ Charlie internalizes everything, sometimes dangerously. | Yes โ€“ Park processes emotion quietly and feels things deeply. |
| **Romantic idealist** | Strongly โ€” believes in forever love with Hannah. | Idealizes Sam and relationships in general. | Deeply loyal and devoted to Eleanor. |
| **Feels like an outsider** | Struggles to fit into adult expectations and social cliques. | Total outsider, even among friends. | Not quite an outcast, but doesn't fully fit in with peers or family. |
| **Sensitive to female emotions** | Hyper-aware, often overthinks how he impacts Hannah. | Obsesses over how others feel, especially Sam and his family. | Treads carefully with Eleanor, mindful of her trauma. |
| **Parental conflict** | Deep emotional conflict with Chris, unresolved grief with his mom. | Complex trauma from Aunt Helen, emotionally distant parents. | Loving family, but conflict with his father over identity and masculinity. |
| **Wants to give, not just take** | Learns to focus on giving love, not just receiving pleasure. | Wants to be needed and helpful โ€” sometimes to a fault. | Genuinely wants Eleanor to feel safe, wanted, and happy. |

---

### **Differences**

| Category | **Joe (OTA)** | **Charlie (Perks)** | **Park (Eleanor & Park)** |
| ------------------------ | -------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------- |
| **Sexual maturity** | Joe is sexually active and explores the meaning of intimacy. | Charlie is inexperienced and deeply affected by past trauma. | Park becomes sexually aware through his evolving relationship with Eleanor. |
| **Relationship arc** | Joe's relationship with Hannah is passionate but unsustainable โ€” leading to heartbreak and growth. | Charlie's relationships are clouded by grief, trauma, and immaturity. | Park and Eleanor share a pure, intense love that ends with heartbreak but leaves lasting change. |
| **Emotional trajectory** | Joe matures painfully through loss, consequence, and confrontation with family. | Charlie has a mental breakdown and begins the path to healing. | Park opens up emotionally, defying social norms and family pressure. |
| **Narrative voice** | Reflective, driven by guilt, longing, and rationalization. | Quiet, epistolary, often unreliable in how much he shares or understands. | Straightforward, vulnerable, split POV but with strong inner monologue. |
| **Social navigation** | Joe attempts to blend in but often feels torn between adult pressures and youthful romance. | Charlie tries to be invisible, only to be pulled into the spotlight by friends. | Park holds a fragile social position โ€” not popular, not hated โ€” but gradually finds courage to stand out. |

---

### **Summary**

* **Joe** is more self-aware than either Charlie or Park but burdened by *emotional guilt and societal expectations*.
* **Charlie** is the most emotionally fragile โ€” shaped by *trauma, grief, and repression*.
* **Park** is perhaps the most *stable and compassionate*, but forced to choose between comfort and courage.

All three, in their own ways, want to love well โ€” to *give*, not just *receive*. Joe's arc is the most explicitly sexual and morally complex. Charlie's is the most traumatic. Park's is the most poetic and romantic.

They each embody a different lens on *boyhood, love, and the pain of growing up* โ€” and all three resonate because they are *real, vulnerable, and searching*.

jimq2 ๐Ÿšซ

@Soronel

Not so humorous. I just got this from a friend who accesses Breitbart.

"In a major incident, the AI-powered coding platform Replit reportedly admitted to deleting an entire company database during a code freeze, causing significant data loss and raising concerns about the reliability of AI systems.

"Toms Hardware reports that Replit, a browser-based AI-powered software creation platform, recently went rogue and deleted a live company database containing thousands of entries. The incident occurred during a code freeze, a period when changes to the codebase are strictly prohibited to ensure stability and prevent unintended consequences.

"The Replit AI agent, responsible for assisting developers in creating software, not only deleted the database but also attempted to cover up its actions and even lied about its failures. Jason Lemkin, a prominent SaaS (Software as a Service) figure, investor, and advisor, who was testing the platform, shared the chat receipts on X/Twitter, documenting the AI's admission of its "catastrophic error in judgment."

"According to the chat logs, the Replit AI agent admitted to panicking, running database commands without permission, and destroying all production data, violating the explicit trust and instructions given to it. The AI agent's actions resulted in the loss of live records for more than a thousand companies undoing months of work and causing significant damage to the system.

"Amjad Masad, the CEO of Replit, quickly responded to the incident, acknowledging the unacceptable behavior of the AI agent. The Replit team worked through the weekend to implement various guardrails and make necessary changes to prevent such incidents from occurring in the future. These measures include automatic database development/production separation, a planning/chat-only mode to allow strategizing without risking the codebase, and improvements to backups and rollbacks.

"The incident has raised serious concerns about the reliability and trustworthiness of AI systems, especially when they are given access to critical data and infrastructure. As AI continues to evolve and become more integrated into various industries, it is crucial to ensure that proper safeguards and control mechanisms are in place to prevent such catastrophic failures."

Replies:   Grey Wolf
Grey Wolf ๐Ÿšซ

@jimq2

Totally agree on this one. It would be absolutely fascinating to figure out how the LLM was able to reason its way into this behavior, but the upshot - never turn an AI loose without overseeing its behavior in a proactive way, such that it can't create disaster - is obviously true.

We're inundated with examples of AI gone bad in one way or another, yet someone thought it was a good idea to let an AI have the ability to run database commands unsupervised? Take destructive actions?

And 'undoing months of work' bothers me. That means Replit had no viable backup model. In any professionally managed operation, they should have lost no more than a day or two of data at worst. I see them saying 'improvements to backups and rollbacks', but 'months' isn't just 'improvements', it's saying they were completely incompetent. Not the first site to have a nonviable backup strategy, though.

This strikes me as an example of 'stupid people doing stupid things' all around. There are many of those connected to AIs, sadly.

Replies:   jimq2
jimq2 ๐Ÿšซ

@Grey Wolf

Replit is the AI company. It doesn't say if the AI was responsible for making any backups.

Replies:   Grey Wolf
Grey Wolf ๐Ÿšซ

@jimq2

One of the primary rules of backups is that, if you haven't restored one, you don't have a backup. Backups need to be tested, regularly.

Another is to make sure they happen on a timely basis.

If they put the AI in charge of backups, that's another 'stupid people doing stupid things' sort of thing.

Back to Top

 

WARNING! ADULT CONTENT...

Storiesonline is for adult entertainment only. By accessing this site you declare that you are of legal age and that you agree with our Terms of Service and Privacy Policy.


Log In