Crawler type: html_page commoncrawl.org↗

Common Crawl

stable for 2 days · 2 material events tracked · 7 snapshots in history

Documented user-agents (2)

Each distinct UA this vendor publishes on its docs page, extracted by Haiku from the latest snapshot. New UAs appearing or scope changes here are the high-signal events to watch.

User-agent Purpose Scope / when it fires Opt-out
CCBot/2.0 Web crawling for Common Crawl dataset creation General web crawl following robots.txt and Sitemap Protocol; samples random subset of websites User-agent: CCBot / Disallow: / in robots.txt file
CCBot/1.0 Web crawling for Common Crawl dataset creation (older version) General web crawl following robots.txt; legacy crawler version User-agent: CCBot / Disallow: / in robots.txt file
Change timeline — diffs over time with insights

Each block is a detected change: the new-vs-prior snapshot diff and the LLM-written insight. Newest first.

2026-05-29 2026-06-04 6 days apart
+4 −2
View diff
Index: ccbot
===================================================================
--- ccbot	2026-05-29
+++ ccbot	2026-06-04
@@ -128,7 +128,9 @@
 string (
 see RFC 9110
 ).
-You may wish to use our columnar index via Amazon Athena or Apache Spark if your query involves broad or large-scale filtering. These tools are better suited to high-volume access patterns and provide more flexibility for complex queries.
+You may wish to use our
+URL Index
+(previously called the "Columnar Index") via Amazon Athena or Apache Spark if your query involves broad or large-scale filtering. These tools are better suited to high-volume access patterns and provide more flexibility for complex queries.
 We also provide an official downloader client
 cc-downloader
 which is robust and polite. The
@@ -201,7 +203,7 @@
 The Data
 Overview
 CDXJ Index
-Columnar Index
+URL Index
 Web Graphs
 Latest Crawl
 Crawl Stats
2026-05-16 2026-05-28 12 days apart
+20 −6
View diff
Index: ccbot
===================================================================
--- ccbot	2026-05-16
+++ ccbot	2026-05-28
@@ -1,6 +1,9 @@
 Frequently Asked Questions
 Everything you need to know regarding general and technical questions about
 Common Crawl.
+For general information regarding our crawler, please see the
+CCBot
+page.
 General Questions
 What is Common Crawl?
 Common Crawl is a 501(c)(3) non-profit organization dedicated to providing a copy of the Internet to Internet researchers, companies and individuals at no cost for the purpose of research and analysis.
@@ -20,7 +23,8 @@
 page.
 Technical Questions
 What is the Common Crawl CCBot crawler?
-CCBot is a
+CCBot
+is a
 Nutch-based
 web crawler that makes use of the Apache Hadoop project.
 We use
@@ -28,7 +32,8 @@
 to process and extract crawl candidates from our crawl database.
 This candidate list is sorted by host (domain name) and then distributed to a set of crawler servers.
 How does the Common Crawl CCBot identify itself?
-CCBot identifies itself via its
+CCBot
+identifies itself via its
 UserAgent
 string as:
 ‍
@@ -40,7 +45,8 @@
 CCBot/1.0 (+https://commoncrawl.org/bot.html)
 We may increment the version number in the future.
 How does CCBot fetch a web page?
-CCBot is an automated crawler, checking first the
+CCBot
+is an automated crawler, checking first the
 robots.txt
 , and if crawling a page is allowed, fetches pages using
 HTTP
@@ -57,7 +63,9 @@
 RFC 9309
 . Currently, JavaScript is not executed and Cookies are not used.
 Will the Common Crawl CCBot make my website slow for other users?
-The CCBot crawler has a number of algorithms designed to prevent undue load on web servers for a given domain.
+The
+CCBot
+crawler has a number of algorithms designed to prevent undue load on web servers for a given domain.
 We have taken great care to ensure that our crawler will never cause web servers to slow down or be inaccessible to other users.
 The crawler uses an adaptive back-off algorithm that slows down requests to your website if your web server is responding with a
 HTTP 429
@@ -69,7 +77,9 @@
 Crawl-delay
 parameter for
 robots.txt
-. By increasing that number, you will indicate to CCBot to slow down the rate of crawling.
+. By increasing that number, you will indicate to
+CCBot
+to slow down the rate of crawling.
 For instance, to limit our crawler from request pages more than once every 2 seconds, add the following to your
 robots.txt
 file:
@@ -91,6 +101,9 @@
 We will periodically continue to check if the
 robots.txt
 file has been updated.
+You may also wish to be added to our opt-out registry. Please read
+this blog post
+for further information.
 Can I add my website to Common Crawl?
 Common Crawl's dataset is a sample of the web, and we do not generally archive any entire website but a randomly selected subset of it. Our crawler supports the Sitemap Protocol and utilizes any Sitemap announced in the
 robots.txt
@@ -160,7 +173,8 @@
 This information is also provided as JSON at
 https://index.commoncrawl.org/ccbot.json
 .
-CCBot is now run on dedicated IP address ranges with reverse DNS. This allows webmasters to verify whether a logged request stems from the real CCBot, for example:
+CCBot
+is now run on dedicated IP address ranges with reverse DNS. This allows webmasters to verify whether a logged request stems from the real CCBot, for example:
 $> host 18.97.14.84
 84.14.97.18.in-addr.arpa domain name pointer 18-97-14-84.crawl.commoncrawl.org.
 $> host 18-97-14-84.crawl.commoncrawl.org
2026-04-19 2026-05-16 27 days apart
+1 −0
View diff
Index: ccbot
===================================================================
--- ccbot	2026-04-19
+++ ccbot	2026-05-16
@@ -156,6 +156,7 @@
 18.97.14.80/29
 18.97.14.88/30
 98.85.178.216/32
+3.41.188.32/29
 This information is also provided as JSON at
 https://index.commoncrawl.org/ccbot.json
 .
2025-10-22 2026-04-19 179 days apart
+36 −19
View diff
Index: ccbot
===================================================================
--- ccbot	2025-10-22
+++ ccbot	2026-04-19
@@ -28,17 +28,34 @@
 to process and extract crawl candidates from our crawl database.
 This candidate list is sorted by host (domain name) and then distributed to a set of crawler servers.
 How does the Common Crawl CCBot identify itself?
+CCBot identifies itself via its
+UserAgent
+string as:
+‍
+CCBot/2.0 (https://commoncrawl.org/faq/)
 Our older bot identified itself with the
-User-Agent
-string
-CCBot/1.0 (+https://commoncrawl.org/bot.html)
-, and the current version identifies itself as
-CCBot/2.0
-. We may increment the version number in the future.
+UserAgent
+string:
 ‍
-Contact information (a link to the FAQs) is sent along with the
-User-Agent
-string.
+CCBot/1.0 (+https://commoncrawl.org/bot.html)
+We may increment the version number in the future.
+How does CCBot fetch a web page?
+CCBot is an automated crawler, checking first the
+robots.txt
+, and if crawling a page is allowed, fetches pages using
+HTTP
+GET
+requests.
+It supports both
+HTTP/1.1
+and
+HTTP/2
+, the latter only over TLS (
+https://
+). Connections over IPv4 and IPv6 are supported.
+CCBot follows up to four consecutive HTTP redirects, or up to five when fetching robots.txt in line with
+RFC 9309
+. Currently, JavaScript is not executed and Cookies are not used.
 Will the Common Crawl CCBot make my website slow for other users?
 The CCBot crawler has a number of algorithms designed to prevent undue load on web servers for a given domain.
 We have taken great care to ensure that our crawler will never cause web servers to slow down or be inaccessible to other users.
@@ -56,21 +73,19 @@
 For instance, to limit our crawler from request pages more than once every 2 seconds, add the following to your
 robots.txt
 file:
-‍
 User-agent: CCBot
 Crawl-delay: 2
 How can I block the Common Crawl CCBot?
 You configure your
 robots.txt
 file which uses the Robots Exclusion Protocol to block the crawler. Our bot’s exclusion
-User-Agent
+UserAgent
 string is:
 CCBot
 .
 Add these lines to your
 robots.txt
 file and our crawler will stop crawling your website:
-‍
 User-agent: CCBot
 Disallow: /
 We will periodically continue to check if the
@@ -96,7 +111,7 @@
 wait 24 hours
 before trying again.
 Please sleep between calls to our API (including if you run your script repeatedly in a loop), don't run multiple threads at once on the same IP, and don't use proxy networks. You should also ensure that you are using a properly formulated
-User-Agent
+UserAgent
 string (
 see RFC 9110
 ).
@@ -124,8 +139,10 @@
 GET
 requests. We also currently support the
 gzip
-and
+,
 Brotli
+, and
+ZStandard
 encoding formats.
 Why is the Common Crawl CCBot crawling pages I don’t have links to?
 The bot may have found your pages by following links from other sites.
@@ -168,6 +185,8 @@
 Get in touch
 The Data
 Overview
+CDXJ Index
+Columnar Index
 Web Graphs
 Latest Crawl
 Crawl Stats
@@ -178,10 +197,9 @@
 AI Agent
 Blog
 Examples
-Use Cases
 CCBot
 Infra Status
-Opt-out Registry
+Opt-Out Registry
 FAQ
 Community
 Research Papers
@@ -190,12 +208,11 @@
 Discord
 Collaborators
 About
+About
 Team
 Jobs
-Mission
-Impact
 Privacy Policy
 Terms of Use
 ©
-2025
+2026
 Common Crawl
\ No newline at end of file
Events