NMA Demands Common Crawl Stop Unauthorized Scraping and Establish AI-Use Protections

News

The News/Media Alliance submitted a formal letter to Common Crawl demanding the archive site cease unauthorized scraping and storage of publisher content, and institute additional safeguards to prevent AI companies from accessing publisher material in its database. This marks a direct escalation of publisher grievances against one of the internet’s largest crawled-content repositories, which supplies training data to numerous AI systems.

Why it matters

This action signals intensifying publisher pushback against open-access training datasets as the primary lever for controlling AI training-data flow. The demand targets Common Crawl’s foundational role in the AI supply chain—if enforceable, removal from Common Crawl would meaningfully constrain AI model training access to recent published content, unlike opt-in licensing. This follows the NMA’s recent partnership with Bria on AI content licensing, suggesting a dual-track strategy: licensing agreements with some AI vendors while demanding restrictions on public crawlers. The legal viability of the demand is unclear (Common Crawl operates internationally, and scraping legality varies by jurisdiction), but it reflects growing publisher consensus that public datasets should be gated for AI use, not simply licensed downstream.