One day in 2023, someone asked ChatGPT: “Give me the opening paragraph of this New York Times article.” When the answer came back, they asked, “And the next sentence?” Then again. And again. Sentence by sentence, ChatGPT reproduced a Pulitzer-winning piece that had taken a reporter eighteen months to investigate.
Most people believe that generative models like ChatGPT and Gemini learn from existing data and produce something of their own. So how does an AI come to memorize an entire NYT article? Was it just an accidental collision of phrasings during training?
Billions of dollars — and one company’s continued existence — hinge on the answer. The first lawsuit that may set the precedent is now playing out in New York.
The Lawsuit
In December 2023, the New York Times filed a copyright infringement suit against OpenAI and Microsoft in the U.S. District Court for the Southern District of New York. The complaint alleges that OpenAI ingested millions of NYT articles into its training data without permission, and that the resulting ChatGPT regurgitates the originals nearly verbatim, eating into the paper’s subscription market.
The Times is asking for two things: billions of dollars in damages, and the destruction of any model and dataset trained on its copyrighted work. If the latter is granted, OpenAI would have to rebuild its models essentially from scratch. That is why this case is treated not as a routine cost dispute but as a fight over the company’s survival.
Timeline
April 2023 — Quiet negotiations. NYT discovered that OpenAI had used decades of its articles for AI training without authorization. The Times raised the issue with OpenAI and Microsoft, asking for proper licensing fees and proposing technical measures to prevent ChatGPT from reproducing its articles verbatim. The talks eventually broke down.
December 2023 — The lawsuit (NYT strikes first). When negotiations failed, the Times sued for copyright infringement. It argued that OpenAI was copying its articles wholesale and undermining its subscription business. As key evidence, NYT prompted GPT-4 to continue specific articles sentence by sentence and submitted more than one hundred screenshots showing the model reproducing entire Pulitzer-winning pieces, word for word.
April 2025 — Ruling on the motion to dismiss (the court sets the ring). OpenAI moved to dismiss the case, arguing that AI training falls under fair use. After more than a year of briefs, the court denied the motion and let the central copyright-infringement claims proceed to the merits. In effect, the court made a preliminary finding that the Times’ claims and evidence have a credible legal basis. Around the same time, sixteen similar AI copyright suits across the country were consolidated, putting NYT at the head of the plaintiffs.
March 2026 — OpenAI counterattacks (the prompt wars begin). With the case fully under way, OpenAI launched a forceful counter. It argued that the Times’ reproduction evidence did not reflect normal usage but was the manufactured product of an adversarial attack on the system’s weaknesses. To prove this, OpenAI demanded to depose the NYT expert who had extracted the evidence, and to inspect the specific prompts and surrounding context used to coax the articles out.
The Core Question: Are the Weights a Product of “Understanding” or of “Storage”?
A ruling on the merits has not yet come down. The two sides are still fighting over evidence, and the final word on fair use remains open. But when the court ordered OpenAI to hand over twenty million anonymized ChatGPT conversation logs to the plaintiffs and moved into enforcement, the trial’s real question surfaced.
What is encoded in the weights of a large language model — the result of understanding and decomposing the source data, or simply the source data in a different format?
NYT and OpenAI hold two opposing views of what happens to training data inside a model.
OpenAI argues that the model decomposes the source. In training, the original text is broken down into pieces, and only statistical relationships among words and sentences are absorbed into the weights. It is like reading hundreds of cookbooks and ending up not with the recipes themselves but with a sense for which ingredients follow which. On this view, the weights are not copies of the originals but an abstracted form of knowledge — a wholly new artifact distinct from the source. This is the conceptual foundation of OpenAI’s transformative use argument.
NYT argues that the model stores the source. It points to the phenomenon in which a model has encoded a specific passage somewhere in its weights nearly verbatim, and spits it back when given the right prompt. On this view, “weights” are just another name for a database, and the output is a copy. The hundred-plus pieces of regeneration evidence that NYT brought to court are meant to support exactly this claim.
Both Sides Have a Point
The complication is that both views are, to some degree, true.
Work from Google DeepMind’s Nicholas Carlini and colleagues (2020) already showed that text appearing many times in training data is memorized nearly verbatim by the model. NYT articles, copied hundreds of times across the web and recurring throughout the training corpus, have a high probability of surviving inside the weights with little loss. Sentences that appeared only once, by contrast, are barely recoverable at all.
A recent paper by Conklin and colleagues at Princeton (2026) frames the same mechanism as lossy compression. Just as MP3 discards frequencies the ear cannot detect and keeps only what matters for the listening experience, an LLM keeps only the information useful for next-token prediction and discards the rest. From this angle, the model is not a lossless copy of the source but a lossy compression of it — purpose-built. The catch is that the amount of loss is not uniform; it depends on how often the data appeared.
Human memory sits on the same spectrum. A poem you have read dozens of times you can recite word for word, but a news article you skimmed once leaves only a vague sense of the gist. The first is closer to storage; the second is closer to understanding. LLMs live somewhere on the same spectrum — and the difficulty is that where they live varies from one piece of data to the next.
This technical gray zone is exactly the battlefield.
The NYT frame: ordinary prompts are enough to reproduce articles verbatim. This is not an accident; it is a structural defect baked into the model. The weights effectively store the Times’ articles, which makes the model a compressed copy.
The OpenAI frame: the evidence the Times submitted is the result of deliberate manipulation of the system — exceptional outputs squeezed out by adversarial prompts over thousands of attempts, almost never seen in normal user behavior. The model is fundamentally a transformative tool that decomposes and recomposes its inputs.
OpenAI’s demand that the Times turn over its prompts and context, and the court’s order that twenty million real conversation logs be produced, target the same question. Both sides want to settle, statistically, whether regeneration is an everyday occurrence or an edge case that appears only under extreme conditions. If a meaningful fraction of those twenty million sessions show reproduced source text, the storage frame is strengthened. If only a trivial fraction do, the understanding frame gains ground.
In the end, understanding versus storage is not a binary choice. Within a single model, different pieces of data sit at different points on the spectrum. Where the law draws the line on that spectrum is the meaningful conclusion this trial has to deliver.
What to Watch From Here
The case is ongoing, and a ruling on the merits is still some distance away. Five points are worth tracking.
First, the statistics the twenty million logs will reveal. The court rejected OpenAI’s delay tactics and finally ordered twenty million anonymized ChatGPT conversation logs turned over to plaintiffs. NYT had originally asked for 120 million; OpenAI counter-offered twenty million and then tried to hand-pick only favorable search results, which the court refused. When the analysis of those logs lands, the storage-versus-understanding debate will, for the first time, have empirical evidence in front of a court.
Second, how the court weighs the four factors of fair use. Section 107 of the U.S. Copyright Act evaluates fair use along four factors. The two most decisive in this case are the purpose and character of the use (whether it is transformative) and the effect on the market for the original work. The Times argues that ChatGPT substitutes for its articles and is eating its subscription market; OpenAI counters that search and summarization are an entirely different, transformative purpose.
Third, the ripple effect of recent rulings. In June 2025, the Northern District of California issued two pivotal decisions two days apart. In Bartz v. Anthropic, the court held that using legitimately acquired books to train an LLM was “exceedingly transformative” and constituted fair use (though it found the retention of pirated copies infringing). Kadrey v. Meta likewise found training itself to be fair use, but warned that in future cases a market-dilution theory could favor plaintiffs. NYT differs from these precedents in one important respect: it is positioned to put actual evidence of market displacement on the record.
Fourth, the worst-case scenario. If the court sides with NYT and goes as far as ordering the destruction of the training dataset, OpenAI would have to rebuild its models from scratch with only licensed material. U.S. copyright law allows statutory damages of up to $150,000 per infringed work for willful infringement, and with millions of articles in play in this case, the totals could become astronomical. The consolidated proceeding — In re: OpenAI, Inc. Copyright Infringement Litigation (case no. 1:25-md-03143) — now bundles sixteen suits, which would magnify the fallout of a loss.
Fifth, the possibility of settlement. Roughly twenty news organizations — including the Associated Press, News Corp, Vox Media, and Condé Nast — have already signed content-licensing deals with OpenAI instead of suing. The News Corp deal is reportedly worth more than $250 million over five years. Even Bartz v. Anthropic ended in a settlement to avoid the threat of astronomical statutory damages, even after the fair-use ruling. NYT could, at some point, pivot from a judgment to a negotiated outcome — in which case the precedent the industry has been waiting for may never arrive.
Closing
New technologies bring new disputes. Disputes give rise to conflict, and in resolving conflict, the law is forced to ask what the technology actually is.
When an AI learns from data, is it decomposing and understanding the originals, or is it storing them in a different form? That the answer changes from one piece of data to the next, within a single model, only makes the question harder. The landscape of AI we live with from here will depend on where, on that spectrum, the courts decide to draw the line.
References
- NYT v. OpenAI — order denying motion to dismiss (S.D.N.Y., April 4, 2025)
- OpenAI ordered to produce 20 million ChatGPT logs (Bloomberg Law)
- OpenAI’s response page on the NYT lawsuit
- Bartz v. Anthropic fair-use ruling (June 2025)
- Kadrey v. Meta fair-use ruling (FindLaw)
- Bartz v. Anthropic settlement analysis (Kluwer Copyright Blog)
- AI copyright litigation 2026 update (Norton Rose Fulbright)
- Conklin et al., “Learning is Forgetting: LLM Training As Lossy Compression” (2026)
- Carlini et al., “Extracting Training Data from Large Language Models” (2020)
- U.S. Copyright Act § 107 (fair use)
- U.S. Copyright Act § 504 (statutory damages)
- Timeline of OpenAI content-licensing deals (Digiday)