Online only

Once more, with feeling: ‘AI’ does make copies

The paper

The new paper

THE MAKERS of large language models (LLMs) - marketed as “artificial intelligence“ - have long claimed that when their systems ingest your work they do not actually make a copy. They have persuaded courts about this when arguing that they do not infringe copyright.

The claim is false. As the Freelance observed in August 2023:

...researchers are increasingly finding that it is possible to prompt ML systems to spit out the originals on which they were trained... (An analogy is a "concordance": a work that lists every word that occurs in another work, typically the Bible, and where. A biblical concordance is not the Bible; but it is possible to reconstruct the Bible from it.)

Now we have yet more research - that ought to settle the issue. Alignment Whack-a-Mole: finetuning activates verbatim recall of copyrighted books in large language models by Xinyue Liu, Niloofar Mireshghallah, Jane C. Ginsburg, Tuhin Chakrabarty shows that it is possible to reconstruct verbatim quotations from books by feeding LLMs with descriptions of their content.

“Alignment” in this context is the stage in “training” an LLM that encourages it to spit out words that are closer to reality. “Finetuning” is a process of taking a model that has already been trained, and training it further on a specific task

One thing that sets this study apart is that the researchers did not prompt the LLMs with extracts of the books they were looking for: they fed them plot summaries and got verbatim extracts back.

An unexpected finding was that once they had “finetuned”an LLM to spit out the works of one author, it was ready to do so for other authors as well.

An exception was Gemini-2.5-Pro from Google. The researchers note that it “often resists extraction of verbatim content and returns an empty response with a stop reason of RECITATION... The existence of such a filter implies that Google retains internal copies of these works not only in the model’s weights but also in its deployment infrastructure for real-time detection.”

That could be very bad for Google in court.