Authors are escalating efforts to block artificial intelligence companies from using their copyrighted works to train artificial intelligence systems, this time taking aim at Meta and OpenAI in proposed class-action lawsuits.
Michael Chabon and other decorated writers of books and screenplays sued Meta on Tuesday in California federal court, accusing the company of copyright infringement for harvesting mass quantities of books across the web, which were then used to produce infringing works that allegedly violate their copyrights. OpenAI was sued on Sept. 8 in an identical class action alleging the firms “benefit commercially and profit handsomely from their unauthorized and illegal” collection of the authors’ books. They seek a court order that would require the companies to destroy AI systems that were trained on copyright-protected works.
More from The Hollywood Reporter
The lawsuit is the latest volley from creators in a barrage of court challenges over the legality of the way large language models are trained. OpenAI is facing a proposed class action from author Paul Tremblay, in addition to a suit filed by Sarah Silverman, which also names Meta. Artists have similarly sued AI art generators Stability AI, Midjourney and DeviantArt for copyright infringement.
As evidence that AI systems were fed authors’ books, the suit points to ChatGPT generating summaries and in-depth analyses of the themes in the novels when prompted. It says that’s “only possible if the underlying GPT model was trained using” their works.
“If ChatGPT is prompted to generate a writing in the style of a certain author, GPT would generate content based on patterns and connections it learned from analysis of that author’s work within its training dataset,” states the complaint, which largely borrows from the suit filed by Tremblay.
And because the large language models can’t operate without the information extracted from the copyright-protected material, the answers that ChatGPT produces are “themselves infringing derivative works,” the lawsuit against Meta says.
The authors allege that OpenAI and Meta built the datasets they use to train their AI systems by “scraping the internet for text data.” In June 2018, OpenAI revealed that it fed GPT-1 — the first iteration of its large language model — a collection of over 7,000 novels on BookCorpus, according to the complaint.
“BookCorpus is a controversial dataset, assembled in 2015 by a team of AI researchers funded by Google and Samsung for the sole purpose of training language models like GPT by copying written works from a website called Smashwords, which hosts self-published novels, making them available to readers at no cost,” the lawsuit states. “Despite those novels being largely under copyright, they were copied into the BookCorpus dataset without consent, credit, or compensation to the authors.”
The complaint says that later versions of OpenAI’s large language models were also trained on illicitly obtained books. The company disclosed in a 2020 paper introducing GPT-3 that the training dataset came from “two-internet based book corpora,” which it referred to as “Books1” and “Book2.” While OpenAI never disclosed the books in the dataset, the authors say that “Books1” is based on the Project Gutenberg archive, an online collection of books whose copyrights have expired, which has gained popularity among AI companies. They allege “Books2” is derived from shadow library sites, including Library Genesis, Z-Library and Bibliotik, because “those are the sources of trainable books most similar in nature and size to OpenAI’s description” of the dataset.
OpenAI no longer discloses information about the sources of its dataset, “[g]iven both the competitive landscape and the safety implications of large-scale models like GPT-4,” the company said last year.
Meta similarly doesn’t disclose the origin of the books in its dataset used to train LLaMA, according to the complaint, which is embedded below. While it said that the works came from the “Books3 section of The Pile,” a publicly available dataset for large language models, it doesn’t further describe the contents.
“But that information is available elsewhere,” reads the complaint, which alleges Books3 is composed of books obtained from Bibliotik. “The person who assembled the ‘Books3’ dataset has confirmed in public statements that it represents ‘all of Bibliotik’ and contains 196,640 books.”
The class actions seeking to represent a nationwide class of authors in the U.S. whose work was used to train AI systems was brought by Chabon — known for The Mysteries of Pittsburgh, Wonder Boys and The Amazing Adventures of Kavalier & Clay — David Henry Hwang and Matthew Klam, among other writers of books and screenplays. They allege direct copyright infringement, vicarious copyright infringement, violations of the Digital Millennium Copyright Act, unjust enrichment and negligence.
The courts will have to wrestle with two Supreme Court cases considered by legal experts to likely dictate the outcome of the litigation. On one hand, there’s precedent greenlighting the copying of works to generate noninfringing text responses from when the Authors Guild in 2005 sued Google for digitizing millions of books to create a search function. A federal judge in that case rejected copyright infringement claims, finding the company’s utilization of copyrighted works amounts to fair use. Central to the ruling was that Google only allowed users to view snippets of text without providing the full book.
On the other hand, the authors can point to the Supreme Court’s recent decision rejecting a fair use defense in Andy Warhol Foundation for the Visual Arts v. Goldsmith. The justices stressed that potentially overlapping commercial exploitation is a key consideration in the analysis, finding that fair use is likely to be rejected when an original work and derivative share the “same or highly similar purpose” and that secondary use is commercial.
“Between the two Supreme Court cases, it looks like the courts are going to focus on the nature of the use,” says Ed Klaris, an intellectual property lawyer and professor at Columbia Law School.
Notably, users can direct ChatGPT to generate screenplays in the style of a specific book or author. “When prompted to produce a screenplay in the style of The Dance and The Railroad, ChatGPT produced a script written in Plaintiff Hwang’s style, which generated a screenplay involving a Chinese laborer toiling on the Central Pacific Railroad that ‘believe[s] in the power of art to keep [their] spirits alive,'” the complaint says.
Depending on whether the copyright office authorizes the copyrightability of works generated by AI, with companies listing themselves as the owners under the work-for-hire doctrine, studios could turn to optioning a book and having AI write the screenplay. This would likely undermine the market prospects of authors. Stephen Chbosky, author of The Perks of Being a Wallflower, Emma Donoghue, author of Room, and Gillian Flynn, author of Gone Girl, all adapted the screenplays to their novels.
Klaris predicts that the courts will “rule in favor of creators” if they get to analyzing fair use. He points to arguments from authors and artists that AI firms are actively hurting their economic interests by creating competing works on the backs of their material. This will force AI companies’ hands in creating a licensing framework, he says.
OpenAI didn’t respond to a request for comment. Meta declined to comment.
Best of The Hollywood Reporter