The race to block OpenAI’s scraping bots is slowing down

By ligliteOctober 7, 2024No Comments3 Mins Read

It’s too early to say how the flurry of deals between AI companies and publishers will pan out. However, OpenAI has already scored one clear victory. The company’s web crawler isn’t being blocked by major news organizations as quickly as it once was.

The generative AI boom has sparked a data gold rush. And what followed was a data protection rush as publishers blocked AI crawlers to prevent their work from becoming training data without their consent. For example, when Apple debuted a new AI agent this summer, many top news organizations covered Apple’s web scraping using Robots Exclusion Protocol (robots.txt) files, which allow webmasters to control bots. I immediately opted out. With so many new AI bots on the scene, it can sometimes feel like you’re playing whack-a-mole trying to keep up.

OpenAI’s GPTBot is the most well-known and is blocked more often than competitors such as Google AI. An analysis of 1,000 shows that the number of top media websites that used robots.txt to “disallow” OpenAI’s GPTBot increased dramatically from its launch in August 2023 through that fall, and then in 2023. Popular news outlets from Ontario-based AI detection startup Originality AI have steadily (but more gradually) increased from November 2024 to April 2024. At its peak, the highest value was just over a third of the website. It is now down to nearly a quarter. Within a small group of the most prominent news organizations, block rates are still above 50 percent, but down from a high of almost 90 percent earlier this year.

But that number dropped significantly after Dotdash Meredith announced a licensing deal with OpenAI last May. It then fell again in late May when Vox announced its own deal, and again in August of this year when WIRED’s parent company, Condé Nast, struck a deal. At least for now, the trend of increasing blocks appears to be over.

These dips clearly make sense. When companies partner and allow their data to be used, they have no incentive to barricade their data and update their robots.txt files to allow crawling. With enough transactions, the overall percentage of sites blocking crawlers will almost certainly decrease. Some media outlets, like The Atlantic, unblocked OpenAI’s crawlers the same day the deal was announced. Some companies, like Vox, announced the partnership at the end of May but took days or weeks to unblock GPTBot on its properties at the end of June.

Although Robots.txt is not legally binding, it has long served as a standard governing the behavior of web crawlers. For most of the Internet’s existence, people running web pages expected each other to respect each other’s files. After a WIRED investigation this summer found that AI startup Perplexity likely chose to ignore robots.txt commands, Amazon’s cloud division said Perplexity was violating its own rules. We have started an investigation into whether or not this was the case. Ignoring robots.txt doesn’t look good. This may explain why many prominent AI companies, including OpenAI, explicitly say they use robots to decide what to crawl. Jon Gillham, CEO of Originality AI, believes this adds further urgency to OpenAI’s consensus drive. “It’s clear that OpenAI sees being blocked as a threat to its future ambitions,” Gillum said.

Categories

Subscribe to Updates

The race to block OpenAI’s scraping bots is slowing down

Related Posts