View diff
Index: ccbot
===================================================================
--- ccbot 2026-05-16
+++ ccbot 2026-05-28
@@ -1,6 +1,9 @@
Frequently Asked Questions
Everything you need to know regarding general and technical questions about
Common Crawl.
+For general information regarding our crawler, please see the
+CCBot
+page.
General Questions
What is Common Crawl?
Common Crawl is a 501(c)(3) non-profit organization dedicated to providing a copy of the Internet to Internet researchers, companies and individuals at no cost for the purpose of research and analysis.
@@ -20,7 +23,8 @@
page.
Technical Questions
What is the Common Crawl CCBot crawler?
-CCBot is a
+CCBot
+is a
Nutch-based
web crawler that makes use of the Apache Hadoop project.
We use
@@ -28,7 +32,8 @@
to process and extract crawl candidates from our crawl database.
This candidate list is sorted by host (domain name) and then distributed to a set of crawler servers.
How does the Common Crawl CCBot identify itself?
-CCBot identifies itself via its
+CCBot
+identifies itself via its
UserAgent
string as:
@@ -40,7 +45,8 @@
CCBot/1.0 (+https://commoncrawl.org/bot.html)
We may increment the version number in the future.
How does CCBot fetch a web page?
-CCBot is an automated crawler, checking first the
+CCBot
+is an automated crawler, checking first the
robots.txt
, and if crawling a page is allowed, fetches pages using
HTTP
@@ -57,7 +63,9 @@
RFC 9309
. Currently, JavaScript is not executed and Cookies are not used.
Will the Common Crawl CCBot make my website slow for other users?
-The CCBot crawler has a number of algorithms designed to prevent undue load on web servers for a given domain.
+The
+CCBot
+crawler has a number of algorithms designed to prevent undue load on web servers for a given domain.
We have taken great care to ensure that our crawler will never cause web servers to slow down or be inaccessible to other users.
The crawler uses an adaptive back-off algorithm that slows down requests to your website if your web server is responding with a
HTTP 429
@@ -69,7 +77,9 @@
Crawl-delay
parameter for
robots.txt
-. By increasing that number, you will indicate to CCBot to slow down the rate of crawling.
+. By increasing that number, you will indicate to
+CCBot
+to slow down the rate of crawling.
For instance, to limit our crawler from request pages more than once every 2 seconds, add the following to your
robots.txt
file:
@@ -91,6 +101,9 @@
We will periodically continue to check if the
robots.txt
file has been updated.
+You may also wish to be added to our opt-out registry. Please read
+this blog post
+for further information.
Can I add my website to Common Crawl?
Common Crawl's dataset is a sample of the web, and we do not generally archive any entire website but a randomly selected subset of it. Our crawler supports the Sitemap Protocol and utilizes any Sitemap announced in the
robots.txt
@@ -160,7 +173,8 @@
This information is also provided as JSON at
https://index.commoncrawl.org/ccbot.json
.
-CCBot is now run on dedicated IP address ranges with reverse DNS. This allows webmasters to verify whether a logged request stems from the real CCBot, for example:
+CCBot
+is now run on dedicated IP address ranges with reverse DNS. This allows webmasters to verify whether a logged request stems from the real CCBot, for example:
$> host 18.97.14.84
84.14.97.18.in-addr.arpa domain name pointer 18-97-14-84.crawl.commoncrawl.org.
$> host 18-97-14-84.crawl.commoncrawl.org