Cloudflare has released a new free tool that prevents AI companies from scraping their clients’ websites for content to train their bots to train large language models. The cloud service provider makes this tool available to their entire customer base, including free plans. “This feature will automatically update over time as we see new fingerprints from offending bots that we’ve identified as widely hacking the internet for model training,” the company said.
In Announcing this update, the Cloudflare team also shared some insights into how its customers are responding to the boom in bots scraping content to train generative AI models. According to the company’s internal data, 85.2 percent of customers chose to block even properly identified AI bots from accessing their sites.
Cloudflare has also identified the most active bots of the past year. Bytedance’s Bytespider bot attempted to access 40 percent of Cloudflare-controlled websites and 35 percent tried. They, along with Amazonbot and ClaudeBot, were half of the top four AI bot crawlers by number of queries on the Cloudflare network.
It is very difficult to completely and consistently block AI bots from accessing content. The arms race to build models faster has led to companies breaking or outright breaking existing rules on scraper blocking. hack websites without required permissions. But having a backend company on the scale of Cloudflare, being serious about trying to put the kibosh on this behavior can have some consequences.
“We fear that some AI companies intent on circumventing the rules to access content will persistently adapt to avoid detection by bots,” the company said. “We will continue to monitor and add more bot blocks to our AI Scrapers and Crawlers rules and improve our machine learning models to help keep the Internet a place where content creators can thrive and have full control over which models their content uses for training or output.” take it out.”