CCBot UA string clarified with full URL, new "How does CCBot fetch a web page?" section added, ZStandard compression support added
Source: https://commoncrawl.org/faq
What changed
Three substantive changes on the Common Crawl FAQ: (1) The current UA string is now explicitly stated as CCBot/2.0 (https://commoncrawl.org/faq/) — the previous version only said the bot identifies as CCBot/2.0 with contact info “sent along” but did not spell out the full string. (2) A new FAQ entry “How does CCBot fetch a web page?” documents that CCBot uses HTTP GET, supports HTTP/1.1 and HTTP/2 (HTTPS only for H2), IPv4 and IPv6, follows up to 4 redirects (5 for robots.txt per RFC 9309), does not execute JavaScript, and does not use cookies. (3) ZStandard (zstd) is added as a supported compression encoding alongside gzip and Brotli.
Implication
Publishers and bot-detection operators should update their UA-matching rules: the canonical CCBot/2.0 string now includes a trailing URL (https://commoncrawl.org/faq/), which differs from bare CCBot/2.0. The new fetch-behavior section is the first official documentation that CCBot does not run JavaScript and does not send cookies — relevant for server-side detection and for understanding what content CCBot will actually index. ZStandard support means servers may now negotiate zstd encoding with the crawler.
Raw diff
View diff
--- prev
+++ curr
@@ -28,17 +28,34 @@
to process and extract crawl candidates from our crawl database.
This candidate list is sorted by host (domain name) and then distributed to a set of crawler servers.
How does the Common Crawl CCBot identify itself?
+CCBot identifies itself via its
+UserAgent
+string as:
+
+CCBot/2.0 (https://commoncrawl.org/faq/)
Our older bot identified itself with the
-User-Agent
-string
+UserAgent
+string:
+
CCBot/1.0 (+https://commoncrawl.org/bot.html)
-, and the current version identifies itself as
-CCBot/2.0
-. We may increment the version number in the future.
-
-Contact information (a link to the FAQs) is sent along with the
-User-Agent
-string.
+We may increment the version number in the future.
+How does CCBot fetch a web page?
+CCBot is an automated crawler, checking first the
+robots.txt
+, and if crawling a page is allowed, fetches pages using
+HTTP
+GET
+requests.
+It supports both
+HTTP/1.1
+and
+HTTP/2
+, the latter only over TLS (
+https://
+). Connections over IPv4 and IPv6 are supported.
+CCBot follows up to four consecutive HTTP redirects, or up to five when fetching robots.txt in line with
+RFC 9309
+. Currently, JavaScript is not executed and Cookies are not used.
Will the Common Crawl CCBot make my website slow for other users?
The CCBot crawler has a number of algorithms designed to prevent undue load on web servers for a given domain.
We have taken great care to ensure that our crawler will never cause web servers to slow down or be inaccessible to other users.
@@ -56,21 +73,19 @@
For instance, to limit our crawler from request pages more than once every 2 seconds, add the following to your
robots.txt
file:
-
User-agent: CCBot
Crawl-delay: 2
How can I block the Common Crawl CCBot?
You configure your
robots.txt
file which uses the Robots Exclusion Protocol to block the crawler. Our bot’s exclusion
-User-Agent
+UserAgent
string is:
CCBot
.
Add these lines to your
robots.txt
file and our crawler will stop crawling your website:
-
User-agent: CCBot
Disallow: /
We will periodically continue to check if the
@@ -96,7 +111,7 @@
wait 24 hours
before trying again.
Please sleep between calls to our API (including if you run your script repeatedly in a loop), don't run multiple threads at once on the same IP, and don't use proxy networks. You should also ensure that you are using a properly formulated
-User-Agent
+UserAgent
string (
see RFC 9110
).
@@ -124,8 +139,10 @@
GET
requests. We also currently support the
gzip
-and
+,
Brotli
+, and
+ZStandard
encoding formats.
Why is the Common Crawl CCBot crawling pages I don’t have links to?
The bot may have found your pages by following links from other sites.
@@ -168,6 +185,8 @@
Get in touch
The Data
Overview
+CDXJ Index
+Columnar Index
Web Graphs
Latest Crawl
Crawl Stats
@@ -178,10 +197,9 @@
AI Agent
Blog
Examples
-Use Cases
CCBot
Infra Status
-Opt-out Registry
+Opt-Out Registry
FAQ
Community
Research Papers
@@ -190,12 +208,11 @@
Discord
Collaborators
About
+About
Team
Jobs
-Mission
-Impact
Privacy Policy
Terms of Use
©
-2025
+2026
Common Crawl