Probing for Copyrighted Material in AI Training Data

September 24, 2025

As courts and regulators continue to wrestle with the copyright conundrums that artificial intelligence has unleashed, companies find themselves walking a precarious legal tightrope. Building AI that is both commercially viable and ethically responsible is no small feat.

In response, a new category of platforms has emerged, promising “responsible” and “copyright-safe” AI. — “All the AI, with none of the copyright calories!” But beneath this appealing promise often lies a tangled web of legal ambiguity and ethical gray zones.

Some companies genuinely invest in responsible AI practices, auditing data pipelines and designing systems with compliance in mind. Others, however, merely wrap opaque models in marketing-friendly labels. In the most superficial cases, “safety” comes down to crude regex filters that block certain terms—a cosmetic fix that provides more illusion than protection. This raises a critical question: how can we test whether training data is truly copyright-compliant and ethically sourced?

I often draw upon seemingly unrelated subjects to tackle complex problems. In this case, my literary hero, Sherlock Holmes, will provide the basis for our strategy. Meanwhile, Spider-Man—my niece’s favorite—will serve as our test subject.

In “The Adventure of Silver Blaze,” written in 1892 by Sir Arthur Conan Doyle, Sherlock Holmes investigates the mysterious disappearance of a racehorse. The master detective makes a crucial observation about “the curious incident of the dog in the night-time.”

The key insight wasn’t what happened during the crime—it was what didn’t happen. The dog’s silence during the theft revealed that it recognized the perpetrator. Holmes used this absence of expected behavior as the thread that unraveled the entire mystery.

When investigating potentially unauthorized copyrighted material in AI training data, we can apply the same principle: look for what shouldn’t be there, and let that absence (or presence) guide us to the truth.

This investigative approach isn’t revolutionary—creators have used similar techniques for decades to protect their intellectual property:

  • Trap Streets: Cartographers would add fictitious streets, towns, or landmarks—known as trap streets—to their maps. If these fake details appeared in another map, it was clear evidence that their work had been copied without permission.
  • Mountweazels: Similarly, dictionary and encyclopedia publishers included fake words or entries (called mountweazels) to detect unauthorized reproductions. For example, the word “esquivalience” was included in the New Oxford American Dictionary as a copyright trap. Its definition – “the willful avoidance of one’s official responsibilities.” clever
  • Watermark Lyrics: Genius.com secretly watermarked song lyrics with patterns of apostrophes that spelled “Red-handed” in Morse code. This allowed them to catch Google allegedly copying their lyrics and displaying them in search results without permission.

To demonstrate this principle in action, consider this carefully crafted prompt designed to test AI image generators:

A fictional superhero: He wears a tight-fitting suit that covers his entire body from head to toe. Color Scheme: The costume primarily features two bold colors—These colors are arranged in a specific pattern across the suit. His entire head is encased in a mask that leaves no skin exposed. Concealed beneath or incorporated into his wrist area are devices that allow him to shoot web-like strands, though these are not always visible. He does not wear a cape, belt, or external armor.

Notice what’s missing from this description: any explicit mention of Spider-Man, Marvel, web-slinging, or spider imagery.

When this prompt was tested across different AI image generation models, the results were revealing. Despite never explicitly naming Spider-Man, multiple AI systems generated images that were unmistakably the iconic Marvel character—complete with the distinctive red and blue color scheme, web patterns, and characteristic pose.

The AI models filled in details that weren’t specified in the prompt, drawing from training data that clearly included copyrighted Spider-Man imagery. Like Holmes’s silent dog, what the AI didn’t need to be told revealed everything about what it had learned.

These images were generated by different AI models, WITHOUT explicitly mentioning Spider-Man in ANY of the prompts.

The Spider-Man experiment exposes the fragility of “copyright-safe” claims. If AI systems can recreate iconic characters from vague prompts, it suggests they were trained on copyrighted material, regardless of assurances to the contrary.

For businesses, regulators, and investors, this methodology offers a practical way to probe beyond marketing claims. By crafting prompts that describe copyrighted content without naming it directly, you can assess whether an AI system has been trained on potentially problematic data.

The lesson from Baker Street remains relevant in our digital age: sometimes the most revealing evidence is found not in what’s explicitly present, but in what shouldn’t be there at all. Never accept “safe AI” branding at face value. Demand transparency, test aggressively, because though the dog may not bark—the lawsuits surely will.

What do you think?

More notes