Issues With Using Books for AI Training

Explore top LinkedIn content from expert professionals.

Summary

Using books for AI training raises significant challenges, particularly around copyright concerns and ethical data usage. While some courts have ruled that training AI on copyrighted books can fall under 'fair use,' this remains a contentious issue with potential legal, ethical, and market implications.

  • Address copyright risks: Always ensure any datasets used for AI training are legally obtained, avoiding materials from pirated or unlicensed sources to minimize liability.
  • Promote transparency: Implement clear mechanisms to verify the source of training data and adhere to copyright compliance, as this builds trust and reduces legal uncertainty.
  • Respect creator rights: Support opt-out tools, advocate for updated copyright laws, and use explicit licensing agreements to ensure fair treatment of original authors and content creators.
Summarized by AI based on LinkedIn member posts
  • View profile for Andrew Ng
    Andrew Ng Andrew Ng is an Influencer

    Founder of DeepLearning.AI; Managing General Partner of AI Fund; Exec Chairman of LandingAI

    2,311,151 followers

    On Monday, a United States District Court ruled that training LLMs on copyrighted books constitutes fair use. A number of authors had filed suit against Anthropic for training its models on their books without permission. Just as we allow people to read books and learn from them to become better writers, but not to regurgitate copyrighted text verbatim, the judge concluded that it is fair use for AI models to do so as well. Indeed, Judge Alsup wrote that the authors’ lawsuit is “no different than it would be if they complained that training schoolchildren to write well would result in an explosion of competing works.” While it remains to be seen whether the decision will be appealed, this ruling is reasonable and will be good for AI progress. (Usual caveat: I am not a lawyer and am not giving legal advice.) AI has massive momentum, but a few things could put progress at risk: - Regulatory capture that stifles innovation, including especially open source - Loss of access to cutting-edge semiconductor chips (the most likely cause would be war breaking out in Taiwan) - Regulations that severely impede access to data for training AI systems Access to high-quality data is important. Even though the mass media tends to talk about the importance of building large data centers and scaling up models, when I speak with friends at companies that train foundation models, many describe a very large amount of their daily challenges as data preparation. Specifically, a significant fraction of their day-to-day work follows the usual Data Centric AI practices of identifying high-quality data (books are one important source), cleaning data (the ruling describes Anthropic taking steps like removing book pages' headers, footers, and page numbers), carrying out error analyses to figure out what types of data to acquire more of, and inventing new ways to generate synthetic data. I am glad that a major risk to data access just decreased. Appropriately, the ruling further said that Anthropic’s conversion of books from paper format to digital — a step that’s needed to enable training — also was fair use. However, in a loss for Anthropic, the judge indicated that, while training on data that was acquired legitimately is fine, using pirated materials (such as texts downloaded from pirate websites) is not fair use. Thus, Anthropic still may be liable on this point. Other LLM providers, too, will now likely have to revisit their practices if they use datasets that may contain pirated works. Overall, the ruling is positive for AI progress. Perhaps the biggest benefit is that it reduces ambiguity with respect to AI training and copyright and (if it stands up to appeals) makes the roadmap for compliance clearer.... [Truncated due to length limit. Full text: https://lnkd.in/gAmhYj3k ]

  • View profile for Leonard Rodman, M.Sc. PMP® LSSBB® CSM® CSPO®

    Follow me and learn about AI for free! | AI Consultant and Influencer | API Automation Developer/Engineer | DM me for promotions

    53,167 followers

    Can Authors Keep Their Work from Being Used to Train AI Without Permission? ✍️📚🤖 If you're a writer, there's a good chance your work has already been absorbed into an AI model—without your knowledge or consent. Books, blogs, fanfiction, forums, articles… All of it has been scraped, indexed, and used to teach machines how to mimic human language. So what can authors actually do to protect their work? Here’s what’s possible (and what isn’t—yet): 🛑 Use “noAI” Clauses in Your Copyright/Terms Clearly state that your work may not be used for AI training. It won’t stop everyone, but it helps establish legal boundaries—and could matter in future lawsuits. 🔍 Avoid Platforms That Allow AI Scraping Before publishing, check the terms of service. Some platforms explicitly allow your content to be used for training; others are more protective. 🖋️ Push for Legal Reform The law hasn’t caught up to generative AI. Supporting copyright advocacy groups and legislation can help tip the scales back toward creators. 🤝 Join Opt-Out Registries Tools like haveibeentrained.com let creators see if their work was used—and request removal from certain datasets. It's not a perfect fix, but it's a start. 📣 Speak Out When authors make noise, platforms listen. Just ask the comic book artists, novelists, and journalists who’ve already triggered investigations and lawsuits. Right now, the balance of power favors the AI companies. But that doesn’t mean authors are powerless. We need visibility. Transparency. Fair compensation. And most of all—respect for the written word. Have you found your writing in an AI training dataset? What did you do? #AuthorsRights #EthicalAI #AIandWriters #GenerativeAI #Copyright #ResponsibleAI #WritingCommunity #AITrainingData #FairUseOrAbuse

  • View profile for Pradeep Sanyal

    Enterprise AI Leader | Former CIO & CTO | Chief AI Officer (Advisory) | Data & AI Strategy → Implementation | 0→1 Product Launch

    19,123 followers

    The era of “train now, ask forgiveness later” is over. The U.S. Copyright Office just made it official: The use of copyrighted content in AI training is no longer legally ambiguous - it’s becoming a matter of policy, provenance, and compliance. This report won’t end the lawsuits. But it reframes the battlefield. What it means for LLM developers: • The fair use defense is narrowing: “Courts are likely to find against fair use where licensing markets exist.” • The human analogy is rejected: “The Office does not view ingestion of massive datasets by a machine as equivalent to human learning.” • Memorization matters: “If models reproduce expressive elements of copyrighted works, this may exceed fair use.” • Licensing isn’t optional: “Voluntary licensing is likely to play a critical role in the development of AI training practices.” What it means for enterprises: • Risk now lives in the stack: “Users may be liable if they deploy a model trained on infringing content, even if they didn’t train it.” • Trust will be technical: “Provenance and transparency mechanisms may help reduce legal uncertainty.” • Safe adoption depends on traceability: “The ability to verify the source of training materials may be essential for downstream use.” Here’s the bigger shift: → Yesterday: Bigger models, faster answers → Today: Trusted models, traceable provenance → Tomorrow: Compliant models, legally survivable outputs We are entering the age of AI due diligence. In the future, compliance won’t slow you down. It will be what allows you to stay in the race.

  • View profile for Jillian B.

    AI + tech risk management for the legal industry | Co-Founder + Chief Risk Officer

    2,372 followers

    This is big. Although most headlines have focused on Meta “winning” this case, this order is a huge strike against the fair use argument on which AI companies rely. Here are some of the most notable points that Judge Chhabria made in his order: 1️⃣ "There is certainly no rule that when your use of a protected work is "transformative," this automatically inoculates you from a claim of copyright infringement." 2️⃣ "[I]n many circumstances it will be illegal to copy copyright-protected works to train generative AI models without permission. Which means that the companies, to avoid liability for copyright infringement, will generally need to pay copyright holders for the right to use their materials." 3️⃣ "This case [...] involves a technology that can generate literally millions of secondary works, with a miniscule fraction of the time and creativity used to create the original works it was trained on. No other use—whether it’s the creation of a single secondary work or the creation of other digital tools—has anything near the potential to flood the market with competing works the way that LLM training does. And so the concept of market dilution becomes highly relevant." 4️⃣ "Meta makes the mistake the Supreme Court instructs parties and courts to avoid: robotically applying concepts from previous cases without stepping back to consider context. Fair use is meant to be a flexible doctrine that takes account of “significant changes in technology.” Oracle, 593 U.S. at 19 (quoting Sony, 464 U.S. at 430). Courts can’t stick their heads in the sand to an obvious way that a new technology might severely harm the incentive to create, just because the issue has not come up before. Indeed, it seems likely that market dilution will often cause plaintiffs to decisively win the fourth factor—and thus win the fair use question overall—in cases like this." 5️⃣ "Relatedly, Meta argues that the “public interest” would be “badly disserved” by preventing Meta (and other AI developers) from using copyrighted text as training data without paying to do so. Meta seems to imply that such a ruling would stop the development of LLMs and other generative AI technologies in its tracks. This is nonsense." Importantly, Chhabria notes that “as should now be clear, this ruling does not stand for the proposition that Meta’s use of copyrighted materials to train its language models is lawful. It stands only for the proposition that these plaintiffs made the wrong arguments and failed to develop a record in support of the right one.”   One small step for Meta, one giant leap (backwards) for generative AI companies.

  • View profile for Paul Roetzer

    Founder & CEO, SmarterX & Marketing AI Institute | Co-Host of The Artificial Intelligence Show Podcast

    41,264 followers

    Investigative journalism remains the domain of humans, and this Atlantic article is a great example of how to do it. With so much talk around the data used to train large language models (LLMs), Alex Reisner explored the mystery of Books3 and its impact on today’s generative AI technology. “I recently obtained and analyzed a dataset used by Meta to train LLaMA. Its contents more than justify a fundamental aspect of the authors’ allegations: Pirated books are being used as inputs for computer programs that are changing how we read, learn, and communicate. The future promised by AI is written with stolen words.” “Upwards of 170,000 books, the majority published in the past 20 years, are in LLaMA’s training data. . . . These books are part of a dataset called “Books3,” and its use has not been limited to LLaMA. Books3 was also used to train Bloomberg’s BloombergGPT, EleutherAI’s GPT-J—a popular open-source model—and likely other generative-AI programs now embedded in websites across the internet.” Reisner interviewed the independent developer of Books3, Shawn Presser, who said he created the dataset to give independent developers “OpenAI-grade training data.” Presser claims he’s sympathetic to authors’ concerns, but he perceives a monopoly on generative AI by the biggest tech companies, “giving them total control of a technology that’s reshaping our culture: He created Books3 in the hope that it would allow any developer to create generative-AI tools.” The arguments of fair use are addressed from both sides, and Rebecca Tushnet, a law professor at Harvard, states that the law is “unsettled” when it comes to fair-use cases involving unauthorized material, with previous cases giving little indication of how a judge might rule in the future. This story is just beginning. Copyright law is sure to be at the center of generative AI conversations for years to come. #ai #copyright #technology https://lnkd.in/g7sWmpnm

  • From Practitioner to Author: 3 Things I Loved… and 3 That Left Me Speechless about Humanity! Writing “Your AI Survival Guide” was never about becoming an author. It was about turning 20+ years of executive scars, late-night whiteboard wars, and “we-need-it-by-yesterday” pivots into something others could learn from. What I LOVED 💗 in the process: 1. Democratizing knowledge — I’ve led AI and data initiatives across 9 industries, 5 continents, and countless war rooms. Being able to share how we solved real problems? Invaluable. 2. Codifying what actually worked — From frameworks to models to repeatable patterns—documenting it all helped me see what drove results and what caused failure. It was part therapy, part blueprint. 3. Reflecting on the full arc — Not just the wins, but the detours, disasters, and hard-won lessons. Writing forced me to zoom out, connect dots, and find clarity I didn’t know I needed. But here’s where things took a turn… What shocked 😳 and disappointed me 💔: 1. People stealing my frameworks (yes, the ones I copyrighted and published) and claiming them as their own. No license. No source. Just copy-paste and rebrand. 🤯 2. Discovering someone converted my book into a PDF, plugged it into a custom GPT, and started selling AI services using my IP. Without a single line of credit or acknowledgment. 3. I’m not a marketer or salesperson. But if you don’t promote your work, no one will. Having to “talk about the book” felt like pulling teeth—but I did it anyway. Because awareness matters and it was the only way to get it out there. 🎯 So what did I ultimately learn: ✅ Protect your IP. Copyright it. Trademark it. Watermark it if you must. BUT EVEN WITH ALL THAT - there’s no guarantees. ✅ Visibility ≠ vanity. It’s how you defend your work and ideas. I’m still learning this to be honest, but getting over it (slowly). ✅ If you’re not comfortable promoting yourself, promote the value your work delivers. *** Did you know, 1 in 5 business authors now report IP misuse through AI tool integrations. That’s not just scary—it’s sad. Wonder if in a year or two it’s going to be worth publishing anymore 🤷🏻♀️. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Worlds 1st Chief AI Officer for Enterprise, 10 patents, former Amazon & C-Suite Exec (5x), best-selling author, FORBES “AI Maverick & Visionary of the 21st Century” , Top ‘100 AI Thought Leaders’, helped IBM launch Watson in2011. My job, it’s not just to develop, but to create - I’m outfitting our new digital identity and ensuring security of our workforce and data in the age of AI! (And yes, that’s a band-aid on the cover. After all, I wrote about how to deploy AI and what mistakes to avoid)

  • View profile for Nita Jain

    Founder & CEO, Timeless Biosciences | Microbiome therapeutics for GI oncology and beyond | NIH RECOVER Initiative

    15,140 followers

    If you're an author or researcher, odds are your work has been used to train LLMs... My work on the microbiome is in the database Meta used to train their AI models. And I'm far from alone. Meta leveraged the contents of millions of books, research papers, and academic articles to train their Llama models. No permission. No compensation. No acknowledgment. As the recent release of the LibGen database reveals, Meta chose expediency over ethics, piracy over permission. Because going through proper channels was deemed "too slow." "Move fast and break things." Unsurprisingly, Mark Zuckerberg was the first to popularize Silicon Valley's oft-repeated mantra. But now is the time for slow, thoughtful, methodical sustainability... Our words, research, and ideas aren't just raw materials to be mined like some sort of intellectual coal deposit. They're the product of human thought, reflection, and creativity. They represent years of study, experimentation, and iteration. Why are intellectual property violations such a big deal, you ask? 1. Attribution and context: Research exists within a framework of citations, methodologies, and evolving understanding. AI strips this away, presenting findings as disembodied facts. 2. Academic integrity: Research constitutes specialized knowledge that belongs to its stakeholders, which includes investors, builders, and community beneficiaries. Forcibly taking this work for other purposes constitutes theft. 3. Future sustainability: If tech giants can just take whatever they want without giving back, who will fund tomorrow's breakthroughs? Meta's actions reveal their contempt for creators. They're not "democratizing knowledge"—they're monopolizing it. Building models on stolen work isn't innovation—it's appropriation. This is about who owns the future. Multi-trillion dollar companies are asking, "Why buy the rights to something if you can just steal it?" What we're asking for isn't radical: transparency about what content is being used, mechanisms to opt out, and fair compensation models. These requirements should be the bare minimum. Have you found your work in The Atlantic's Libgen database?

  • View profile for Stephen Klein

    Founder & CEO, Curiouser.AI | Berkeley Instructor | Building Values-Based, Human-Centered AI | LinkedIn Top Voice in AI

    67,221 followers

    Who Stole the Internet? In just a few years, a handful of companies built trillion-dollar AI models by scraping massive amounts of content from the internet. The 5 companies behind today's largest AI models: OpenAI, Meta, Google DeepMind, Anthropic, and Amazon, systematically scraped content from across the internet. OpenAI trained GPT models using datasets such as Common Crawl (scraped websites), WebText2 (Reddit-linked content), Books1 and Books2 (containing pirated books), Wikipedia dumps, and proprietary news archives¹. Meta used a dataset known as "Books3," which researchers identified as containing tens of thousands of pirated copyrighted books, to train LLaMA-2². Google DeepMind’s Gemini models were trained on scraped YouTube subtitles, GitHub code, books, and articles from the open web³. Anthropic's Claude models heavily relied on Common Crawl and undisclosed curated datasets, many of which include copyrighted works⁴. Amazon internally trained AI models using web-scraped material from third-party websites without clearly securing usage rights⁵. The sheer volume of content scraped is staggering. Common Crawl alone contains more than 500 billion web pages⁶. And contrary to common assumptions, “publicly available” does not mean “public domain”, a distinction many creators have emphasized. How much would it have cost if companies had paid for the content they used? Licensing news articles typically costs between $300 and $1,000 per article. Book rights can range from $10,000 to $100,000 per book, depending on exclusivity and use. Licensing a single high-quality image can cost anywhere from $25 to $5,000. Licensing professional code repositories can range from $5,000 to $50,000 per project. The fair market value of the content appropriated for AI training is conservatively estimated between $45 billion and $165 billion. 94% of major AI training datasets contain copyrighted material⁸. 70% of early GPT model data was scraped from websites that explicitly prohibited such use in their robots.txt files⁹. History may eventually record this as the largest uncompensated labor appropriation in human history. ******************************************************************************** Stephen Klein is Founder and CEO of Curiouser.AI, the only Generative AI designed to augment human intelligence, not replace it. He also teaches AI Ethics ay UC Berkeley. To signup curiouser.ai. Or contact hubble https://lnkd.in/gphSPv_e Footnotes: ¹ OpenAI GPT-3 Paper, 2020 ² Meta used pirated books to train LLaMA-2 — Ars Technica, July 2023 ³ Google DeepMind Gemini Technical Report, 2024 ⁴ Washington Post: Secret websites behind Common Crawl, 2023 ⁵ Amazon internal AI project disclosures, Business Insider, 2024 ⁶ Common Crawl Foundation — Dataset Statistics ⁷ Financial Times: OpenAI and Google scramble for content licensing deals, 2024 ⁸ Stanford University AI Index Report, 2024 ⁹ New York Times lawsuit filings, 2023

  • View profile for Franklin Graves

    AI + Data @ LinkedIn | Shaping the legal landscape of the creator economy through emerging technologies – AI, IP, data, & privacy 🚀

    10,299 followers

    🚨 The Authors Guild and fiction authors including George R.R. Martin, John Grisham, David Baldacci, have filed a complaint against OpenAI alleging #copyright infringement 🚨 What are the actions the Authors Guild alleges give rise to the claims? - Copying of the works - Training their LLMs on the works that were copied - The outputs are derivative works It's interesting in this one because the Authors Guild comes out swinging against any #FairUse argument, even noting the LLMs could have been trained on public domain works. But, the commercial value was in the plaintiff authors' works. The Authors Guild uses the publicly published information about the datasets used by OpenAI to build their LLMs, including the books2 and book3 datasets, and the source materials traced back to Z-Library. They also use Sam Altman's testimony in Congress against him in this complaint, noting "Altman and Defendants have proved unwilling to turn these words into actions." ✍️ Here are the authors currently named: David Baldacci Mary Bly Michael Connelly Sylvia Day Jonathan Franzen John Grisham Elin Hilderbrand Christina Baker Kline Maya Shanbhag Lang Victor LaValle George R.R. Martin Jodi Picoult Douglas Preston Roxana Robinson George Saunders Scott Turow Rachel Vail Here are claims brought: 🫣 Direct copyright infringement - copying of the works as part of the datasets 🫣 Vicarious copyright infringement - because the various OpenAI entities had control and financial interest in the other OpenAI entities and the alleged infringing activities. 🫣 Contributory copyright infringement - same rationale as vicarious, but with contributory elements of claim met #GenerativeAI #ArtificialIntelligence

  • View profile for Jerry Levine

    Award Winning Chief Evangelist & General Counsel @ ContractPodAi | Bringing Legal Tech to the World | AI, Privacy, Food, Contracts, and the Future of Law

    4,947 followers

    BIG AI (Litigation) News: Major decision out of the Northern District of California. AI training on copyrighted material is fair use under the Copyright Act, but obtaining the works by piracy (or a pirate website) is a separate violation. 1️⃣Pirated copies used by Anthropic used violated copyright, but to the extent that they had purchased copies, that was fair use. 2️⃣Storing the pirated copies indefinitely was also infringement. 3️⃣Buying the books after pirating them doesn't make everyone whole, but it might affect the statutory damages. 4️⃣Fair use includes the transformative use of the works to train AI models. The end result is that this is probably more beneficial to the authors who brought the claim based on the statutory damages + class action potential, but it also sets a path forward for fair use and transformative use for training. (Link to the opinion in comments!) Mark Lemley James Gatto

Explore categories