News outlets are limiting the Internet Archive’s access to their journalism
More than 340 local news organizations have begun limiting the Internet Archive's access to their online journalism, preventing the Wayback Machine from archiving their content. This restriction is primarily enforced through modifications to their `robots.txt` files, a standard protocol that instructs web crawlers on which parts of a website to access or avoid. The change means that articles from these outlets, many of which are smaller, local publications, will no longer be captured and preserved by the Internet Archive, a non-profit dedicated to building a digital library of internet sites. This action follows a broader trend where content owners are asserting more control over how their data is accessed and used by automated systems, often citing concerns about copyright, data monetization, or the use of their content for AI training. The implications of this widespread restriction are significant for digital preservation and public access to information. Researchers, historians, and the general public rely on the Internet Archive to access historical news articles, especially when original sources become unavailable, paywalled, or removed from the live web. By blocking the archive, these news outlets are effectively removing their past reporting from a crucial public record, potentially hindering future research, fact-checking, and the ability to trace the evolution of local events and narratives. This trend highlights ongoing tensions between content creators seeking control over their intellectual property and organizations aiming to preserve the digital commons for future generations, raising questions about the future accessibility of historical web content.
Developers relying on historical web data or public archives for research or training models will face increased difficulty accessing news content.