Zuckerberg Appeared to Know Meta Trained AI on Pirated Library

January 15, 2025

The AI rush has brought with it thorny questions of copyright and ownership of data as tech companies train bots like ChatGPT on existing texts, but it seems Meta largely brushed these aside as they worked to integrate such tools into Facebook and Instagram.

As first revealed in a motion filed by attorneys for novelists Christopher Golden and Richard Kadrey and comedian Sarah Silverman, who are pursuing a class-action suit against Meta for allegedly using their copyrighted work without permission, employees at the tech giant had candid conversations about the potential for scandal that would arise from leveraging a risky resource: Library Genesis, or LibGen, a massive so-called “shadow library” of free downloadable ebooks and PDFs that includes otherwise paywalled research and academic articles. In these exchanges, Meta’s engineers identified LibGen as “a
dataset we know to be pirated,” but indicated that CEO Mark Zuckerberg had approved its use for training the next iteration of its large language model, Llama.

Now, under a court order from Judge Vince Chhabria of the U.S. District Court for the Northern District of California, the records of those previously confidential internal dialogues have been unsealed, and appear to confirm Zuckerberg’s decision to greenlight the transfer of pirated, copyrighted LibGen data to improve Llama — despite concerns about a backlash. In an email to Joelle Pineau, vice president of AI research at Meta, Sony Theakanath, director of product management, wrote, “After a prior escalation to MZ [Mark Zuckerberg], GenAI has been approved to use LibGen for Llama 3 […] with a number of agreed upon mitigations.” The note observed that including the LibGen material would help them reach certain performance benchmarks, and alluded to industry rumors that other AI companies, including OpenAI and Mistral AI, are “using the library for their models.” In the same email, Theakanath wrote that under no circumstances would Meta publicly disclose its use of LibGen.