NVIDIA’s Copyright Defense Crumbles in AI Training Lawsuit

Craig Nash
By
Craig Nash
Tech writer at All Things Geek. Covers artificial intelligence, semiconductors, and computing hardware.
10 Min Read
NVIDIA's Copyright Defense Crumbles in AI Training Lawsuit

The NVIDIA copyright infringement lawsuit represents a critical moment for how courts will treat large language model training practices. In late April 2026, a federal judge in the Northern District of California refused to dismiss the case, rejecting NVIDIA’s argument that its NeMo Megatron Framework has legitimate non-infringing uses and that the company bears no responsibility for piracy conducted by its users.

Key Takeaways

  • Federal judge rejected NVIDIA’s “ISP piracy defense” in copyright infringement case filed March 2024
  • Lawsuit alleges NeMo Framework scripts were designed specifically to speed up piracy rather than serve legitimate purposes
  • NeMo Megatron models trained on approximately 197,000 pirated books from the Books3 dataset derived from Bibliotik archive
  • Judge ordered NVIDIA to produce discovery documents regarding shadow library datasets within one month
  • Case proceeded to amended complaint January 2026, with five named author-plaintiffs including Abdi Nazemian and Brian Keene

How NVIDIA Allegedly Used Pirated Books for AI Training

The lawsuit, formally titled Nazemian v. NVIDIA Corporation, centers on NVIDIA’s use of pirated material to train its NeMo Megatron large language models. According to the complaint, NVIDIA copied copyrighted works without consent, extracted protected expression from those works, and transformed that expression into numerical weights stored within the models. The training dataset came from “The Pile,” a collection curated by EleutherAI that includes a subcollection called Books3—approximately 108 gigabytes of fiction and nonfiction books derived directly from Bibliotik, a pirated e-book archive.

NVIDIA hosted model cards on Hugging Face identifying “The Pile” as the source dataset, effectively attracting customers to use the NeMo Megatron Framework by providing quick access to this data. Books3 remained available on Hugging Face until October 2023, when it was removed with a notice stating it was “defunct and no longer accessible due to reported copyright infringement.” The NeMo Megatron Framework itself launched in September 2022, with multiple model sizes including the 1.3B, 5B, 20B, and T5 3B variants.

Why NVIDIA’s Legal Defense Failed at the Dismissal Stage

NVIDIA’s primary defense strategy invoked what legal experts call the “ISP piracy defense”—the argument that a tool or platform cannot be held liable for infringing uses by third parties, and that the framework itself has substantial non-infringing uses. The judge rejected this argument entirely. According to the complaint, scripts within the NeMo Framework “have no other purpose” than to speed up infringement, meaning the tools were allegedly designed specifically to facilitate the copying and use of pirated material rather than to enable legitimate applications.

This distinction matters enormously. If a tool has legitimate uses independent of piracy, courts traditionally give it more legal leeway. But if evidence suggests the tool was designed primarily to facilitate copyright infringement, the defendant loses that protection. The judge’s refusal to dismiss the case signals that the court found the authors’ allegations plausible—that NVIDIA knowingly incorporated pirated material into its training pipeline and built tools specifically to exploit that data.

In August 2024, NVIDIA had argued that copyrighted books are merely “statistical correlations” and invoked fair use defenses. The judge’s refusal to dismiss suggests these arguments did not succeed at the motion-to-dismiss stage, though fair use remains an unresolved question that could still be litigated.

The Discovery Phase and Shadow Libraries

The case is now advancing toward substantive proceedings. In late April 2026, the judge ordered NVIDIA to produce discovery documents regarding “shadow library” datasets used to train NeMo models within one month. This order is significant because the complaint alleges that NVIDIA obtained copyrighted material from shadow libraries—clandestine digital archives—partly due to internal pressure to keep pace with AI rivals. The discovery process will likely reveal whether NVIDIA’s internal records show knowledge of the pirated nature of the training data and deliberate decisions to use it anyway.

The named plaintiffs in the lawsuit include five authors: Abdi Nazemian, Brian Keene, Stewart O’Nan, Susan Orlean, and Andre Dubus III, represented by the Joseph Saveri Law Firm. The class definition extends to anyone who owns a copyright in any work used as training data for NeMo Megatron-GPT large language models, potentially affecting thousands of authors.

How This Case Differs From Earlier AI Copyright Disputes

Earlier AI copyright litigation focused on whether model outputs were substantially similar to copyrighted works—a difficult legal standard to meet. The Nazemian case reframes the infringement question entirely. Instead of asking whether the model’s output copies protected expression, the lawsuit argues that infringement occurs at the point of copying inputs into the model. In other words, NVIDIA infringed copyright by downloading and retaining copyrighted works, regardless of whether the subsequent training process was transformative.

This framing sidesteps the “transformative use” defense that has protected some AI companies. If courts accept this reasoning, it could reshape how AI companies approach training data acquisition. The lawsuit treats the copying and retention of pirated books as direct infringement, not merely as a step in a larger transformative process.

What Happens Next

The case is now in the discovery phase, where both sides must produce documents and evidence. The judge’s April 2026 order demanding shadow library documentation suggests the court takes seriously the allegation that NVIDIA deliberately sourced pirated material. An amended complaint filed January 16, 2026, indicates the authors refined their legal theories after the initial March 2024 filing.

NVIDIA’s next opportunity to defeat the case comes at the summary judgment stage, where it can argue that even accepting all the plaintiff’s facts as true, no infringement occurred as a matter of law. But the judge’s refusal to dismiss suggests that argument will face an uphill battle. If the case reaches trial, a jury will decide whether NVIDIA infringed copyright and, if so, what damages are owed.

Why This Matters Beyond NVIDIA

The Nazemian ruling affects the entire AI industry. Other tech giants—including Meta, OpenAI, and Google—face similar copyright lawsuits. If courts accept the theory that copying and retaining pirated books for training constitutes direct infringement, companies will face enormous pressure to audit their training datasets and either license copyrighted material or exclude it entirely. The ruling also signals that courts are skeptical of arguments that AI tools have “non-infringing uses” when the evidence suggests they were designed specifically to exploit pirated content.

Did NVIDIA know the Books3 dataset contained pirated books?

NVIDIA claims it created NeMo in full compliance with copyright law and respected the rights of content creators. However, the lawsuit alleges that NVIDIA obtained material from shadow libraries due to competitive pressure, and that Books3 was derived from Bibliotik, a known pirated archive. The discovery process will likely reveal what NVIDIA knew and when.

What is the Books3 dataset and why does it matter?

Books3 is a subcollection of “The Pile” containing approximately 108 gigabytes of fiction and nonfiction books. It was derived from Bibliotik, a pirated e-book archive. NVIDIA used this dataset to train multiple NeMo Megatron models, incorporating roughly 197,000 pirated books into the training pipeline.

Could NVIDIA win this case on fair use grounds?

NVIDIA has invoked fair use defenses, arguing that its use of copyrighted material for training transformative AI models falls within fair use protections. However, the judge’s refusal to dismiss the case suggests this defense did not succeed at the motion-to-dismiss stage. Fair use remains an unresolved legal question that could determine the case’s outcome if it proceeds to trial.

The NVIDIA copyright infringement lawsuit represents a watershed moment for AI training practices. By refusing to dismiss the case, the judge signaled that courts will scrutinize how companies source training data and whether tools are designed to facilitate or prevent piracy. For NVIDIA, the path forward involves producing discovery documents about shadow libraries and defending its practices against detailed allegations of intentional infringement. For the broader AI industry, the ruling serves as a warning: relying on pirated datasets and building tools to exploit them will not survive judicial scrutiny.

Edited by the All Things Geek team.

Source: Tom's Hardware

Share This Article
Tech writer at All Things Geek. Covers artificial intelligence, semiconductors, and computing hardware.