Cloudflare shares free tool to stop AI bots scraping websites

5 Jul 2024

Image: © naum/

The company also shared details of the top bots attempting to scrape data, which includes bots allegedly managed by ByteDance, Amazon, Anthropic and OpenAI.

As bots continue to scrape the internet to help train AI models, Cloudflare has released a new tool to let customers block all bots at once.

The tool aims to tackle scraping – the process of extracting content and data from websites. This practice has become more common with the rise of generative AI. Another issue highlighted by Cloudflare is web crawling – bots that roam the web to index content from various sites.

Cloudflare announced an option to let its customers block certain types of bots last year, but this new tool lets users block all bot types at once.

The company also performed an analysis of its traffic to monitor the prevalence of scraping bots and claimed the value of “original content in bulk has never been higher”.

“While our analysis identified the most popular crawlers in terms of request volume and number of internet properties accessed, many customers are likely not aware of the more popular AI crawlers actively crawling their sites,” Cloudflare said in a blogpost.

The IT giant also warned that not all AI companies are being transparent about their data scraping practices. Cloudflare claims it spotted bot operators attempting to appear as though they are “a real browser by using a spoofed user agent”.

“We will continue to keep watch and add more bot blocks to our AI scrapers and crawlers rule and evolve our machine learning models to help keep the internet a place where content creators can thrive,” Cloudflare said.

The biggest bots

Cloudflare shared insights about some of the most prominent AI bots scraping its network. The company claims to be connected to around 20pc of the web.

The company said the top AI crawler bots making requests to Cloudflare sites are Bytespider, Amazonbot, ClaudeBot, and GPTBot. Cloudflare claimed these bots are being used to train AI models for ByteDance, Amazon, Anthropic and OpenAI respectively.

Bytespider, GPTBot and ClaudeBot were the top three bots when it comes to the share of websites accessed, according to the Cloudflare data.

Data scraping has been a concern for various sectors recently with the growth of generative AI. In May, Sony Music Group wrote to more than 700 tech companies asking them to refrain from using its content to train AI models.

Find out how emerging tech trends are transforming tomorrow with our new podcast, Future Human: The Series. Listen now on Spotify, on Apple or wherever you get your podcasts.

Leigh Mc Gowran is a journalist with Silicon Republic