This Is How Meta AI Staffers Deemed More Than 7 Million Books to Have No “Economic Value”

April 15, 2025

As more than a dozen lawsuits churn ahead, newly unsealed case files reveal the company’s stance: The pirated books Meta used to train its AI, including ones by Beverly Cleary, Jacqueline Woodson, and Andrew Sean Greer, are individually worthless.

April 15, 2025

Image may contain Mark Zuckerberg Adult Person People Performer Solo Performance Face Head Crowd and Surprised
Mark Zuckerberg during the Acquired LIVE event at the Chase Center in San Francisco, California, US, on Tuesday, Sept. 10, 2024.By David Paul Morris/Bloomberg/Getty Images.

Last month, I prompted Meta AI with a question: “Is it okay to use pirated books to train AI?” The machine’s response was unequivocal. “Using pirated books to train AI is not okay,” it admonished. “In fact, it’s a serious violation of copyright laws and can have significant consequences for authors and publishers. Recently, it’s been reported that Meta used millions of pirated books to develop its AI programs, which has sparked widespread outrage and condemnation from the writing community.”

A few days after my original query, for reasons best known to itself, the algorithm lawyered up. In response to the same prompt, it hedged, “Using pirated books to train AI is a complex issue, and the answer isn’t straightforward. While some companies, like Meta, argue that their use of copyrighted material falls under fair use, others claim it infringes on copyright laws.” (In a 2024 statement, Meta describes such varying responses as “hallucinations,” a challenge for all generative AI systems.)

Lawyers for Meta are indeed invoking that very “fair use” defense in a copyright suit that’s been wending its way through the Northern District of California legal system for nearly two years. Richard Kadrey et al. v. Meta Platforms—and the reams of confidential Meta communications, newly in the public record as exhibits for the plaintiffs—offers an unprecedented look at the internal maneuverings behind the company’s decision to train its model on a database containing more than 7 million pirated books.

Just last week, plaintiffs—including Pulitzer Prize–winners Andrew Sean Greer and Junot Díaz, and comedian Sarah Silverman—filed a summary judgment motion that reads, “It is now undisputed that Meta torrented tens of millions of pirated books and other copyrighted works, including over 650 copies of Plaintiffs’ Books, for free and without consent from the rightsholders because it did not want to pay for them.” The plaintiffs, figureheaded by Richard Kadrey, the bestselling author of, among other books, the Sandman Slim series, claim that Meta’s “unlawful conduct,” used in service of training its large language model (LLM), infringed on their work. In its own motion, filed last month, Meta claims, as it has since its first motion to dismiss filed in September of 2023, that its Llama (large language model Meta AI) project is “highly transformative” and therefore fair use. (Asked for comment, a Meta spokesperson provided a statement that said, in part, that fair use of copyrighted materials is vital to the development of the company’s open-source AI models. “We disagree with Plaintiffs’ assertions, and the full record tells a different story.” An amicus brief filed last week by the Association of American Publishers on behalf of the plaintiffs argues against this claim: “There is nothing transformative about the systematic copying and encoding of textual works, word by word, into an LLM. It does not involve criticism or commentary, provision of a search or indexing utility, software interoperability, or any other purpose recognized as transformative under fair use precedents.”)

The lawsuit is one of more than 16 copyright cases concerning generative AI tools and the multibillion dollar entities that create them currently rippling through the US court system, ranging from musicians suing Anthropic for using lyrics to train its AI, to visual artists suing Stability AI, to The New York Times suing Microsoft, to Authors Guild v. OpenAI, which is being heard in the Southern District of New York and is expected to go to summary judgment this fall. (Condé Nast, Vanity Fair’s parent company, is a plaintiff in one class action lawsuit against the enterprise AI platform Cohere.) The cases raise existential questions about art and literature—their inherent worth and what it means to commodify them—and arrive at a time when generative AI tools are making technical strides.

Kadrey et al. has attracted particular attention. One of Meta’s most prominent lawyers, Mark Lemley, quit the case earlier this year—not because he doesn’t believe in its merit, but because of what he described in a LinkedIn post as the company and its CEO Mark Zuckerberg’s “descent into toxic masculinity and Neo-Nazi madness.” Then, last month, Meta tried to block the promotion of a tell-all memoir by a former employee, which did not further endear the corporation to the literary community. Perhaps most importantly, the et al. plaintiffs are a cadre of big names—besides Greer, Silverman, and Díaz, they include the satirist Matthew Klam, and the National Book Award–Winners Ta-Nehisi Coates and Jacqueline Woodson. (Coates and Woodson are VF contributors.)

A court case, like a work of literature, relies on a good story told persuasively. An interesting wrinkle in this case is that part of the story Meta needs to tell is how little individual books and authors meant in the creation of Llama. (“Is it—do you pronounce it ‘Llama?’” the judge wondered, early on.) Accordingly, a noteworthy argument from the defense was revealed in a court filing last week: “There is no allegation or evidence that the copies Meta made were used for reading Plaintiffs books by Meta employees or anyone else.”

Commodifying books is intrinsic to commercial publishing, but there’s something particularly stunning about seeing how thoroughly Meta researchers reduced literature into a pure asset, devoid of meaning. “Fiction is great” for training the language model, one researcher wrote, but noted that there was only about “700GB” of it in the LibGen database. The same researcher gives Hemingway a run for his money, describing the fiction database as “mostly novels, easy to parse, what we use.” In an internal memo, researchers point out the problems with the converted pirated data: page numbers found their way into the body text, the line breaks were incorrect, and there was missing “whitespace” between words. To illustrate, a Meta employee quoted these lines, “Now be nice to Willa Jean, said Mrs. .nQuimby, as…” and “Ramona, 33nnwould you like…,” which were unmistakably ripped from Beverly Cleary’s beloved 1981 children’s book, Ramona Quimby, Age 8, a book, notably, that remains under copyright. “Goals: get as much long form writing as possible in the next 4–6 weeks,” reads one directive. Articles, movie scripts, magazines, and “books—all genres.”

Kadrey et al. argues that Meta “torrented at least 81.7 terabytes of data across multiple shadow libraries through the site Anna’s Archive, including at least 35.7 terabytes of data from Z-Library and LibGen”—illegal databases of pirated books, the latter of which received, in September 2024, a permanent injunction against it in a federal court for copyright infringement, and which has allegedly also been used by OpenAI and others. (In response to complaints filed by prominent authors, OpenAI has said their “models are trained on publicly available data, grounded in fair use.”) Last month, Alex Reisner at The Atlantic, who has reported extensively on the use of pirated libraries in AI training, published a tool to search the titles in LibGen. (My novel, The Mythmakers, appears in the database, along with books by other VF staff. Not all books included in the database were necessarily used to train Llama; Meta has said that their training tool utilized “a fraction of LibGen,” and Reisner notes that the search tool uses a snapshot taken in January 2025, more than a year after Meta accessed its content.)

For authors like Carmen Maria Machado, who isn’t a named plaintiff in the cases but whose books—including In the Dream House and Her Body and Other Parties in their original English and in translation—appear to be among those pirated by LibGen, the titles in the database represent countless hours of work. “A decade of my life. That’s my creative labor. That’s my mind,” she tells me. “I just felt—I mean, violated is a really strong word, but it’s like, I sign a lot of contracts. I am in tremendous control of the rights that I have over my books, and my work, and my translations, and my film rights. All of that stuff is so carefully managed, but the idea that some company can just, with zero consequences, feed it into a machine is so insane to me that I can’t fully wrap my head around it.”

Speaking to VF, Lemley, the former Meta lawyer, says that the pirated books are “one of those things that sounds bad but actually shouldn’t matter at all in the law. Fair use is always about uses the plaintiff doesn’t approve of; that’s why there is a lawsuit.” He, like the current Meta legal team, cites Google Books, which scanned millions of books without permission—“and all search engines crawl the full internet, including plenty of pirated content,” he argues. “We want to reduce the chance that AI generates infringing output. But regulating what AI trains on is likely to have unexpected consequences.” To his mind, “copyright law should focus on the output rather than how the AI is trained.” That is, if AI trains on Harry Potter books and then spits out a Harry Potter book, that’s a copyright problem. If it spits out its own sequel, “that, too, might be a copyright problem.” But, he says, “The vast majority of what people are using AI for is not, Give me a Harry Potter book. It’s, Give me something new.”

Meta did conduct preliminary discussions with publishers about potential licensing fees, but they received numbers that, according to court documents, to the company’s mind, were “wildly off.” In the transcript of a taped deposition that has been made public, the defense describes potential negotiations over licensing material as “a bit of a song and dance” that “takes a bunch of their time; it takes our time,” and says that because of book publishing rights structures, “absent fair use, Meta would have to initiate individualized negotiations with millions of authors,” which would entail “identifying individual books and their authors; determining how to contact them; ascertaining whether they own rights clear of encumbrances,” etc. (While Meta describes its AI platform as “open source,” the company requires that developers who use Llama enter into a community license agreement, the terms of which range from the requirement that users “prominently” display the phrase “Built with Llama” on accompanying websites to including “Llama” at the beginning of a new AI model name.) The company claims “this process would be onerous for even a few authors; it is practically impossible for hundreds of thousands or millions.”

But the data required to build the models was enormous and, according to in-house company correspondence, couldn’t be achieved without using books, which kicked off internal debate and discussion that lasted years.

In October 2022, a senior researcher, Melanie Kambadur, wrote in a message to teammates, “I don’t think we should use pirated material. I really need to draw a line there.” An internal slide deck outlines concerns about using LibGen, such as policy risks, including US legislators concern “about AI developers using pirated websites for training,” and that “if there is media coverage suggesting we have used a dataset we know to be pirated, such as LibGen, this may undermine our negotiating position with regulators on these issues.” The same deck noted, “In no case would we disclose publicly that we had trained on libgen, however there is practical risk external parties could deduce our use of this dataset.”

Researchers, meanwhile, seemed to be adopting a policy of don’t-ask-don’t-tell. In a redacted exhibit from the plaintiffs that contains internal messages among Meta generative AI researchers concerning the use of LibGen, sent in November 2022, Kambadur asks, “Did someone in legal confirm that? Or are we just trying to not ask too many qs?” To which Guillaume Lample responds, “I didn’t ask questions 😀 but this is what OpenAI does with GPT3, what Google does with PALM, and what Deepmind does with Chinchilla so we will do it to[o].” (When reached for comment, a spokesperson from OpenAI said that the models powering ChatGPT and its current API were not developed using LibGen: “These datasets, created by former employees who are no longer with OpenAI, were last used in 2021.” A representative for Google, which also owns Deepmind, did not respond to a request for comment.)

“Not sure we can use meta’s IPs to load through torrents pirate content,” an engineer wrote in a 2023 message. “Torrenting from a corporate laptop doesn’t feel right [laugh/cry emoji].” The same person later shared a webpage with his colleagues: “What is the probability of getting arrested for using torrents in the USA?” Other communications show that researchers worked to remove the copyright pages from the books they had downloaded. (Meta’s lawyers argue that this was merely to make the data more friendly to the training model by eliminating boilerplate text.) In a 2024 email chain with the subject line “FW: [A/C Priv] LibGen Approval for OneLLM,” a Meta employee states that she wants to “flag an issue that’s going to be significantly gnarly.”

A motion filed by the plaintiffs in February describes a case of collective amnesia: In a deposition, Zuckerberg, Meta’s CEO, “claimed to have no knowledge of LibGen or any involvement in its use,” though internal documents describe needing “zuck/cox/ahmad” approval to move forward with using Books3 data for training, and the decision to use LibGen as occurring “[a]fter a prior escalation to MZ.” Another witness who claimed not to know the specifics or legal concerns regarding LibGen had received a memo describing the dataset as one “we know to be pirated.”

Meta’s attorneys argue that under precedent, “it does not matter whether Meta downloaded datasets containing ‘pirated’ books from a third-party who lacked authorization to distribute them, or borrowed used books from the library and scanned them by hand to achieve the same result.”

But their defense also hinges on the argument that the individual books themselves are, essentially, worthless—one expert witness for Meta describes that the influence of a single book in LLM pretraining “adjusted its performance by less than 0.06% on industry standard benchmarks, a meaningless change no different from noise.” Furthermore, Meta says, that while the company “has invested hundreds of millions of dollars in LLM development,” they see no market in paying authors to license their books because “for there to be a market, there must be something of value to exchange, but none of Plaintiffs works has economic value, individually, as training data.” (An argument essential to fair use, but that also sounds like a scaled up version of a scenario in which the New York Philharmonic board argues against paying individual members of the orchestra because the organization spent a lot of money on the upkeep of David Geffen Hall, and also, a solo bassoon cannot play every part in “The Rite of Spring.”)

“Would it kill these companies to shell out the measly price of 33 books?” Margaret Atwood wrote in a 2023 piece for The Atlantic. “They intend to make a lot of money off the entities they have reared and fattened on my words, so they could at least buy me a coffee.” (For the answer to this question, she need look no further than an email by a director of engineering at Meta, Sergey Edunov, in which he explains that “if we license one single book, we won’t be able to lean into fair use strategy.”)

Lemley addresses the idea of compulsory licensing for works that AI has trained on. “I’ll take a specific example. Stability AI trained on 2 billion images. The company itself is probably worth, I don’t know, maybe now a billion dollars, but say it was 2 billion. Even if you wanted to say half of the value of the company should be attributed to [the work it used for training], we should give the money to the people you trained on. Everybody gets 50 cents. I think what authors have in mind when they see this is not, I’ll get paid 50 cents.”

Meta argues that the end justifies the means. That “Oracle, ScaleAI, and Lockheed Martin are all using Llama to develop national security programs and to supplement existing data analysis and code generation functions,” that Yale’s medical school is building an open-source LLM designed to improve clinical decision-making, that the nonprofit Jacaranda Health is using it “to provide personalized health support in Swahili to Kenyan mothers.”

“Nevertheless,” the opposition statement coyly concedes, “Meta acknowledges that it is a commercial enterprise, that Llama is used for both commercial and non-commercial purposes, and that Meta hopes one day to recoup its significant investment in this important new technology.”

In the years since AI companies began unveiling their generative AI tools, the general consensus among authors has swelled. In a survey that the Authors Guild sent to its members in November 2023, 96% of the 2,431 respondents said that writers’ consent should be required to train AI, and that writers should be paid for their work. “The obvious threat,” says Mary Rasenberger, CEO of the Authors Guild, is that “this unlicensed use is being used to create these machines that authors rightfully fear could replace them, or at least will replace some of their work.” AI-generated genre books, she points out, have already flooded Amazon. The other concern is that “authors can’t enforce things like, you can’t allow outputs that include my work.” Some AI companies say that they’re putting filters on their tools that prevent the large language model from duplicating work verbatim—it’s one of the arguments at the heart of Meta’s fair use case. “But you can copy style, you can get excerpts from authors’ works, you can do sequels or mashups,” argues Rasenberger.

On X last month, Sam Altman announced one of OpenAI’s recent ventures: His team had trained a new model that was “good at creative writing” and “got the vibe of metafiction so right.” In response to Altman’s prompt, “Please write a metafictional literary short story about AI and grief,” the algorithm output 1,100 words of purple prose (“she lost him on a Thursday—that liminal day that tastes of almost-Friday”) told by an LLM. In one line, the narrator describes AI as “a democracy of ghosts”—the story’s best phrase. An avid reader might recognize it from Nabokov’s grief-ridden and oh-so-human 1957 novel, Pnin.

Reading through the emails by Meta employees that reduce literature to mineable assets and seeing title after favorite title rise from LibGen’s depths brings to mind a different Nabokov line altogether, this one from the perspective of Lolita’s Humbert Humbert recalling a drive with Dolores Haze. “It was quite special, that feeling: an oppressive, hideous constraint as if I were sitting with the small ghost of somebody I had just killed.”