mirror of
https://github.com/ai-robots-txt/ai.robots.txt.git
synced 2026-06-16 05:26:56 +02:00
Update from Dark Visitors
This commit is contained in:
parent
f420408eee
commit
3091ad0a23
8 changed files with 74 additions and 4 deletions
|
|
@ -1,3 +1,3 @@
|
|||
RewriteEngine On
|
||||
RewriteCond %{HTTP_USER_AGENT} (AddSearchBot|AgentTimes|AI2Bot|AI2Bot\-DeepResearchEval|Ai2Bot\-Dolma|aiHitBot|amazon\-kendra|Amazonbot|AmazonBuyForMe|Amzn\-SearchBot|Amzn\-User|Andibot|Anomura|anthropic\-ai|ApifyBot|ApifyWebsiteContentCrawler|Applebot|Applebot\-Extended|Aranet\-SearchBot|atlassian\-bot|Awario|AzureAI\-SearchBot|bedrockbot|bigsur\.ai|Bravebot|Brightbot|Brightbot\ 1\.0|BuddyBot|Bytespider|CCBot|Channel3Bot|ChatGLM\-Spider|ChatGPT\ Agent|ChatGPT\-User|Claude\-Code|Claude\-SearchBot|Claude\-User|Claude\-Web|ClaudeBot|Cloudflare\-AutoRAG|CloudVertexBot|Code|cohere\-ai|cohere\-training\-data\-crawler|Cotoyogi|Crawl4AI|Crawlspace|Datenbank\ Crawler|DeepSeekBot|Devin|Diffbot|DuckAssistBot|Echobot\ Bot|EchoboxBot|ExaBot|FacebookBot|facebookexternalhit|Factset_spyderbot|FirecrawlAgent|FriendlyCrawler|Gemini\-Deep\-Research|Google\-Agent|Google\-CloudVertexBot|Google\-Extended|Google\-Firebase|Google\-Gemini\-CLI|Google\-NotebookLM|GoogleAgent\-Mariner|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|HenkBot|iAskBot|iaskspider|iaskspider/2\.0|IbouBot|ICC\-Crawler|ImagesiftBot|imageSpider|img2dataset|ISSCyberRiskCrawler|kagi\-fetcher|Kangaroo\ Bot|KlaviyoAIBot|KunatoCrawler|laion\-huggingface\-processor|LAIONDownloader|LCC|LinerBot|Linguee\ Bot|LinkupBot|Manus\-User|meta\-externalagent|Meta\-ExternalAgent|meta\-externalfetcher|Meta\-ExternalFetcher|meta\-webindexer|MistralAI\-User|MistralAI\-User/1\.0|MyCentralAIScraperBot|NagetBot|netEstate\ Imprint\ Crawler|newsai|NotebookLM|NovaAct|OAI\-SearchBot|omgili|omgilibot|OpenAI|opencode|Operator|PanguBot|Panscient|panscient\.com|Perplexity\-User|PerplexityBot|PetalBot|PhindBot|Poggio\-Citations|Poseidon\ Research\ Crawler|QualifiedBot|QuillBot|quillbot\.com|SBIntuitionsBot|Scrapy|SemrushBot\-OCOB|SemrushBot\-SWA|Shap\-User|ShapBot|Sidetrade\ indexer\ bot|Spider|TavilyBot|Terra\ Cotta|TerraCotta|Thinkbot|TikTokSpider|Timpibot|Trae|TwinAgent|VelenPublicWebCrawler|WARDBot|Webzio\-Extended|webzio\-extended|wpbot|WRTNBot|YaK|YandexAdditional|YandexAdditionalBot|YouBot|ZanistaBot) [NC]
|
||||
RewriteCond %{HTTP_USER_AGENT} (AddSearchBot|AgentTimes|AI2Bot|AI2Bot\-DeepResearchEval|Ai2Bot\-Dolma|aiHitBot|amazon\-kendra|Amazonbot|AmazonBuyForMe|Amzn\-SearchBot|Amzn\-User|Andibot|Anomura|anthropic\-ai|ApifyBot|ApifyWebsiteContentCrawler|Applebot|Applebot\-Extended|Aranet\-SearchBot|atlassian\-bot|Awario|AzureAI\-SearchBot|bedrockbot|bigsur\.ai|Bravebot|Brightbot|Brightbot\ 1\.0|BuddyBot|Bytespider|CCBot|Channel3Bot|ChatGLM\-Spider|ChatGPT\ Agent|ChatGPT\-User|Claude\-Code|Claude\-SearchBot|Claude\-User|Claude\-Web|ClaudeBot|Cloudflare\-AutoRAG|CloudVertexBot|Code|cohere\-ai|cohere\-training\-data\-crawler|Cotoyogi|CragCrawler|Crawl4AI|Crawlspace|Datenbank\ Crawler|DeepSeekBot|Devin|Diffbot|DuckAssistBot|Echobot\ Bot|EchoboxBot|ExaBot|FacebookBot|facebookexternalhit|Factset_spyderbot|FirecrawlAgent|FriendlyCrawler|GeistHaus\-PageFetcher|Gemini\-Deep\-Research|Google\-Agent|Google\-CloudVertexBot|Google\-Extended|Google\-Firebase|Google\-Gemini\-CLI|Google\-NotebookLM|GoogleAgent\-Mariner|GoogleAgent\-URLContext|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|HenkBot|iAskBot|iaskspider|iaskspider/2\.0|IbouBot|ICC\-Crawler|ImagesiftBot|imageSpider|img2dataset|ISSCyberRiskCrawler|kagi\-fetcher|Kangaroo\ Bot|Kimi\-User|KlaviyoAIBot|KunatoCrawler|laion\-huggingface\-processor|LAIONDownloader|LCC|LinerBot|Linguee\ Bot|LinkupBot|Manus\-User|meta\-externalagent|Meta\-ExternalAgent|meta\-externalfetcher|Meta\-ExternalFetcher|meta\-webindexer|MistralAI\-User|MistralAI\-User/1\.0|MyCentralAIScraperBot|NagetBot|netEstate\ Imprint\ Crawler|newsai|NotebookLM|NovaAct|OAI\-SearchBot|omgili|omgilibot|OpenAI|opencode|Operator|PanguBot|Panscient|panscient\.com|Perplexity\-User|PerplexityBot|PetalBot|PhindBot|Poggio\-Citations|Poseidon\ Research\ Crawler|QualifiedBot|Querit\-SearchBot|QueritBot|QuillBot|quillbot\.com|SBIntuitionsBot|Scrapy|SemrushBot\-OCOB|SemrushBot\-SWA|Shap\-User|ShapBot|Sidetrade\ indexer\ bot|Spider|TavilyBot|Terra\ Cotta|TerraCotta|Thinkbot|TikTokSpider|Timpibot|Trae|TwinAgent|UseAI|VelenPublicWebCrawler|WARDBot|Webzio\-Extended|webzio\-extended|wpbot|WRTNBot|YaK|YandexAdditional|YandexAdditionalBot|YouBot|ZanistaBot) [NC]
|
||||
RewriteRule !^/?robots\.txt$ - [F]
|
||||
|
|
|
|||
|
|
@ -1,3 +1,3 @@
|
|||
@aibots {
|
||||
header_regexp User-Agent "(AddSearchBot|AgentTimes|AI2Bot|AI2Bot\-DeepResearchEval|Ai2Bot\-Dolma|aiHitBot|amazon\-kendra|Amazonbot|AmazonBuyForMe|Amzn\-SearchBot|Amzn\-User|Andibot|Anomura|anthropic\-ai|ApifyBot|ApifyWebsiteContentCrawler|Applebot|Applebot\-Extended|Aranet\-SearchBot|atlassian\-bot|Awario|AzureAI\-SearchBot|bedrockbot|bigsur\.ai|Bravebot|Brightbot|Brightbot\ 1\.0|BuddyBot|Bytespider|CCBot|Channel3Bot|ChatGLM\-Spider|ChatGPT\ Agent|ChatGPT\-User|Claude\-Code|Claude\-SearchBot|Claude\-User|Claude\-Web|ClaudeBot|Cloudflare\-AutoRAG|CloudVertexBot|Code|cohere\-ai|cohere\-training\-data\-crawler|Cotoyogi|Crawl4AI|Crawlspace|Datenbank\ Crawler|DeepSeekBot|Devin|Diffbot|DuckAssistBot|Echobot\ Bot|EchoboxBot|ExaBot|FacebookBot|facebookexternalhit|Factset_spyderbot|FirecrawlAgent|FriendlyCrawler|Gemini\-Deep\-Research|Google\-Agent|Google\-CloudVertexBot|Google\-Extended|Google\-Firebase|Google\-Gemini\-CLI|Google\-NotebookLM|GoogleAgent\-Mariner|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|HenkBot|iAskBot|iaskspider|iaskspider/2\.0|IbouBot|ICC\-Crawler|ImagesiftBot|imageSpider|img2dataset|ISSCyberRiskCrawler|kagi\-fetcher|Kangaroo\ Bot|KlaviyoAIBot|KunatoCrawler|laion\-huggingface\-processor|LAIONDownloader|LCC|LinerBot|Linguee\ Bot|LinkupBot|Manus\-User|meta\-externalagent|Meta\-ExternalAgent|meta\-externalfetcher|Meta\-ExternalFetcher|meta\-webindexer|MistralAI\-User|MistralAI\-User/1\.0|MyCentralAIScraperBot|NagetBot|netEstate\ Imprint\ Crawler|newsai|NotebookLM|NovaAct|OAI\-SearchBot|omgili|omgilibot|OpenAI|opencode|Operator|PanguBot|Panscient|panscient\.com|Perplexity\-User|PerplexityBot|PetalBot|PhindBot|Poggio\-Citations|Poseidon\ Research\ Crawler|QualifiedBot|QuillBot|quillbot\.com|SBIntuitionsBot|Scrapy|SemrushBot\-OCOB|SemrushBot\-SWA|Shap\-User|ShapBot|Sidetrade\ indexer\ bot|Spider|TavilyBot|Terra\ Cotta|TerraCotta|Thinkbot|TikTokSpider|Timpibot|Trae|TwinAgent|VelenPublicWebCrawler|WARDBot|Webzio\-Extended|webzio\-extended|wpbot|WRTNBot|YaK|YandexAdditional|YandexAdditionalBot|YouBot|ZanistaBot)"
|
||||
header_regexp User-Agent "(AddSearchBot|AgentTimes|AI2Bot|AI2Bot\-DeepResearchEval|Ai2Bot\-Dolma|aiHitBot|amazon\-kendra|Amazonbot|AmazonBuyForMe|Amzn\-SearchBot|Amzn\-User|Andibot|Anomura|anthropic\-ai|ApifyBot|ApifyWebsiteContentCrawler|Applebot|Applebot\-Extended|Aranet\-SearchBot|atlassian\-bot|Awario|AzureAI\-SearchBot|bedrockbot|bigsur\.ai|Bravebot|Brightbot|Brightbot\ 1\.0|BuddyBot|Bytespider|CCBot|Channel3Bot|ChatGLM\-Spider|ChatGPT\ Agent|ChatGPT\-User|Claude\-Code|Claude\-SearchBot|Claude\-User|Claude\-Web|ClaudeBot|Cloudflare\-AutoRAG|CloudVertexBot|Code|cohere\-ai|cohere\-training\-data\-crawler|Cotoyogi|CragCrawler|Crawl4AI|Crawlspace|Datenbank\ Crawler|DeepSeekBot|Devin|Diffbot|DuckAssistBot|Echobot\ Bot|EchoboxBot|ExaBot|FacebookBot|facebookexternalhit|Factset_spyderbot|FirecrawlAgent|FriendlyCrawler|GeistHaus\-PageFetcher|Gemini\-Deep\-Research|Google\-Agent|Google\-CloudVertexBot|Google\-Extended|Google\-Firebase|Google\-Gemini\-CLI|Google\-NotebookLM|GoogleAgent\-Mariner|GoogleAgent\-URLContext|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|HenkBot|iAskBot|iaskspider|iaskspider/2\.0|IbouBot|ICC\-Crawler|ImagesiftBot|imageSpider|img2dataset|ISSCyberRiskCrawler|kagi\-fetcher|Kangaroo\ Bot|Kimi\-User|KlaviyoAIBot|KunatoCrawler|laion\-huggingface\-processor|LAIONDownloader|LCC|LinerBot|Linguee\ Bot|LinkupBot|Manus\-User|meta\-externalagent|Meta\-ExternalAgent|meta\-externalfetcher|Meta\-ExternalFetcher|meta\-webindexer|MistralAI\-User|MistralAI\-User/1\.0|MyCentralAIScraperBot|NagetBot|netEstate\ Imprint\ Crawler|newsai|NotebookLM|NovaAct|OAI\-SearchBot|omgili|omgilibot|OpenAI|opencode|Operator|PanguBot|Panscient|panscient\.com|Perplexity\-User|PerplexityBot|PetalBot|PhindBot|Poggio\-Citations|Poseidon\ Research\ Crawler|QualifiedBot|Querit\-SearchBot|QueritBot|QuillBot|quillbot\.com|SBIntuitionsBot|Scrapy|SemrushBot\-OCOB|SemrushBot\-SWA|Shap\-User|ShapBot|Sidetrade\ indexer\ bot|Spider|TavilyBot|Terra\ Cotta|TerraCotta|Thinkbot|TikTokSpider|Timpibot|Trae|TwinAgent|UseAI|VelenPublicWebCrawler|WARDBot|Webzio\-Extended|webzio\-extended|wpbot|WRTNBot|YaK|YandexAdditional|YandexAdditionalBot|YouBot|ZanistaBot)"
|
||||
}
|
||||
|
|
@ -43,6 +43,7 @@ Code
|
|||
cohere-ai
|
||||
cohere-training-data-crawler
|
||||
Cotoyogi
|
||||
CragCrawler
|
||||
Crawl4AI
|
||||
Crawlspace
|
||||
Datenbank Crawler
|
||||
|
|
@ -58,6 +59,7 @@ facebookexternalhit
|
|||
Factset_spyderbot
|
||||
FirecrawlAgent
|
||||
FriendlyCrawler
|
||||
GeistHaus-PageFetcher
|
||||
Gemini-Deep-Research
|
||||
Google-Agent
|
||||
Google-CloudVertexBot
|
||||
|
|
@ -66,6 +68,7 @@ Google-Firebase
|
|||
Google-Gemini-CLI
|
||||
Google-NotebookLM
|
||||
GoogleAgent-Mariner
|
||||
GoogleAgent-URLContext
|
||||
GoogleOther
|
||||
GoogleOther-Image
|
||||
GoogleOther-Video
|
||||
|
|
@ -82,6 +85,7 @@ img2dataset
|
|||
ISSCyberRiskCrawler
|
||||
kagi-fetcher
|
||||
Kangaroo Bot
|
||||
Kimi-User
|
||||
KlaviyoAIBot
|
||||
KunatoCrawler
|
||||
laion-huggingface-processor
|
||||
|
|
@ -120,6 +124,8 @@ PhindBot
|
|||
Poggio-Citations
|
||||
Poseidon Research Crawler
|
||||
QualifiedBot
|
||||
Querit-SearchBot
|
||||
QueritBot
|
||||
QuillBot
|
||||
quillbot.com
|
||||
SBIntuitionsBot
|
||||
|
|
@ -138,6 +144,7 @@ TikTokSpider
|
|||
Timpibot
|
||||
Trae
|
||||
TwinAgent
|
||||
UseAI
|
||||
VelenPublicWebCrawler
|
||||
WARDBot
|
||||
Webzio-Extended
|
||||
|
|
|
|||
|
|
@ -1 +1 @@
|
|||
$HTTP["url"] != "/robots.txt" { $HTTP["user-agent"] =~ "(AddSearchBot|AgentTimes|AI2Bot|AI2Bot\-DeepResearchEval|Ai2Bot\-Dolma|aiHitBot|amazon\-kendra|Amazonbot|AmazonBuyForMe|Amzn\-SearchBot|Amzn\-User|Andibot|Anomura|anthropic\-ai|ApifyBot|ApifyWebsiteContentCrawler|Applebot|Applebot\-Extended|Aranet\-SearchBot|atlassian\-bot|Awario|AzureAI\-SearchBot|bedrockbot|bigsur\.ai|Bravebot|Brightbot|Brightbot\ 1\.0|BuddyBot|Bytespider|CCBot|Channel3Bot|ChatGLM\-Spider|ChatGPT\ Agent|ChatGPT\-User|Claude\-Code|Claude\-SearchBot|Claude\-User|Claude\-Web|ClaudeBot|Cloudflare\-AutoRAG|CloudVertexBot|Code|cohere\-ai|cohere\-training\-data\-crawler|Cotoyogi|Crawl4AI|Crawlspace|Datenbank\ Crawler|DeepSeekBot|Devin|Diffbot|DuckAssistBot|Echobot\ Bot|EchoboxBot|ExaBot|FacebookBot|facebookexternalhit|Factset_spyderbot|FirecrawlAgent|FriendlyCrawler|Gemini\-Deep\-Research|Google\-Agent|Google\-CloudVertexBot|Google\-Extended|Google\-Firebase|Google\-Gemini\-CLI|Google\-NotebookLM|GoogleAgent\-Mariner|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|HenkBot|iAskBot|iaskspider|iaskspider/2\.0|IbouBot|ICC\-Crawler|ImagesiftBot|imageSpider|img2dataset|ISSCyberRiskCrawler|kagi\-fetcher|Kangaroo\ Bot|KlaviyoAIBot|KunatoCrawler|laion\-huggingface\-processor|LAIONDownloader|LCC|LinerBot|Linguee\ Bot|LinkupBot|Manus\-User|meta\-externalagent|Meta\-ExternalAgent|meta\-externalfetcher|Meta\-ExternalFetcher|meta\-webindexer|MistralAI\-User|MistralAI\-User/1\.0|MyCentralAIScraperBot|NagetBot|netEstate\ Imprint\ Crawler|newsai|NotebookLM|NovaAct|OAI\-SearchBot|omgili|omgilibot|OpenAI|opencode|Operator|PanguBot|Panscient|panscient\.com|Perplexity\-User|PerplexityBot|PetalBot|PhindBot|Poggio\-Citations|Poseidon\ Research\ Crawler|QualifiedBot|QuillBot|quillbot\.com|SBIntuitionsBot|Scrapy|SemrushBot\-OCOB|SemrushBot\-SWA|Shap\-User|ShapBot|Sidetrade\ indexer\ bot|Spider|TavilyBot|Terra\ Cotta|TerraCotta|Thinkbot|TikTokSpider|Timpibot|Trae|TwinAgent|VelenPublicWebCrawler|WARDBot|Webzio\-Extended|webzio\-extended|wpbot|WRTNBot|YaK|YandexAdditional|YandexAdditionalBot|YouBot|ZanistaBot)" { url.access-deny = ( "" ) } }
|
||||
$HTTP["url"] != "/robots.txt" { $HTTP["user-agent"] =~ "(AddSearchBot|AgentTimes|AI2Bot|AI2Bot\-DeepResearchEval|Ai2Bot\-Dolma|aiHitBot|amazon\-kendra|Amazonbot|AmazonBuyForMe|Amzn\-SearchBot|Amzn\-User|Andibot|Anomura|anthropic\-ai|ApifyBot|ApifyWebsiteContentCrawler|Applebot|Applebot\-Extended|Aranet\-SearchBot|atlassian\-bot|Awario|AzureAI\-SearchBot|bedrockbot|bigsur\.ai|Bravebot|Brightbot|Brightbot\ 1\.0|BuddyBot|Bytespider|CCBot|Channel3Bot|ChatGLM\-Spider|ChatGPT\ Agent|ChatGPT\-User|Claude\-Code|Claude\-SearchBot|Claude\-User|Claude\-Web|ClaudeBot|Cloudflare\-AutoRAG|CloudVertexBot|Code|cohere\-ai|cohere\-training\-data\-crawler|Cotoyogi|CragCrawler|Crawl4AI|Crawlspace|Datenbank\ Crawler|DeepSeekBot|Devin|Diffbot|DuckAssistBot|Echobot\ Bot|EchoboxBot|ExaBot|FacebookBot|facebookexternalhit|Factset_spyderbot|FirecrawlAgent|FriendlyCrawler|GeistHaus\-PageFetcher|Gemini\-Deep\-Research|Google\-Agent|Google\-CloudVertexBot|Google\-Extended|Google\-Firebase|Google\-Gemini\-CLI|Google\-NotebookLM|GoogleAgent\-Mariner|GoogleAgent\-URLContext|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|HenkBot|iAskBot|iaskspider|iaskspider/2\.0|IbouBot|ICC\-Crawler|ImagesiftBot|imageSpider|img2dataset|ISSCyberRiskCrawler|kagi\-fetcher|Kangaroo\ Bot|Kimi\-User|KlaviyoAIBot|KunatoCrawler|laion\-huggingface\-processor|LAIONDownloader|LCC|LinerBot|Linguee\ Bot|LinkupBot|Manus\-User|meta\-externalagent|Meta\-ExternalAgent|meta\-externalfetcher|Meta\-ExternalFetcher|meta\-webindexer|MistralAI\-User|MistralAI\-User/1\.0|MyCentralAIScraperBot|NagetBot|netEstate\ Imprint\ Crawler|newsai|NotebookLM|NovaAct|OAI\-SearchBot|omgili|omgilibot|OpenAI|opencode|Operator|PanguBot|Panscient|panscient\.com|Perplexity\-User|PerplexityBot|PetalBot|PhindBot|Poggio\-Citations|Poseidon\ Research\ Crawler|QualifiedBot|Querit\-SearchBot|QueritBot|QuillBot|quillbot\.com|SBIntuitionsBot|Scrapy|SemrushBot\-OCOB|SemrushBot\-SWA|Shap\-User|ShapBot|Sidetrade\ indexer\ bot|Spider|TavilyBot|Terra\ Cotta|TerraCotta|Thinkbot|TikTokSpider|Timpibot|Trae|TwinAgent|UseAI|VelenPublicWebCrawler|WARDBot|Webzio\-Extended|webzio\-extended|wpbot|WRTNBot|YaK|YandexAdditional|YandexAdditionalBot|YouBot|ZanistaBot)" { url.access-deny = ( "" ) } }
|
||||
|
|
@ -1,6 +1,6 @@
|
|||
set $block 0;
|
||||
|
||||
if ($http_user_agent ~* "(AddSearchBot|AgentTimes|AI2Bot|AI2Bot\-DeepResearchEval|Ai2Bot\-Dolma|aiHitBot|amazon\-kendra|Amazonbot|AmazonBuyForMe|Amzn\-SearchBot|Amzn\-User|Andibot|Anomura|anthropic\-ai|ApifyBot|ApifyWebsiteContentCrawler|Applebot|Applebot\-Extended|Aranet\-SearchBot|atlassian\-bot|Awario|AzureAI\-SearchBot|bedrockbot|bigsur\.ai|Bravebot|Brightbot|Brightbot\ 1\.0|BuddyBot|Bytespider|CCBot|Channel3Bot|ChatGLM\-Spider|ChatGPT\ Agent|ChatGPT\-User|Claude\-Code|Claude\-SearchBot|Claude\-User|Claude\-Web|ClaudeBot|Cloudflare\-AutoRAG|CloudVertexBot|Code|cohere\-ai|cohere\-training\-data\-crawler|Cotoyogi|Crawl4AI|Crawlspace|Datenbank\ Crawler|DeepSeekBot|Devin|Diffbot|DuckAssistBot|Echobot\ Bot|EchoboxBot|ExaBot|FacebookBot|facebookexternalhit|Factset_spyderbot|FirecrawlAgent|FriendlyCrawler|Gemini\-Deep\-Research|Google\-Agent|Google\-CloudVertexBot|Google\-Extended|Google\-Firebase|Google\-Gemini\-CLI|Google\-NotebookLM|GoogleAgent\-Mariner|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|HenkBot|iAskBot|iaskspider|iaskspider/2\.0|IbouBot|ICC\-Crawler|ImagesiftBot|imageSpider|img2dataset|ISSCyberRiskCrawler|kagi\-fetcher|Kangaroo\ Bot|KlaviyoAIBot|KunatoCrawler|laion\-huggingface\-processor|LAIONDownloader|LCC|LinerBot|Linguee\ Bot|LinkupBot|Manus\-User|meta\-externalagent|Meta\-ExternalAgent|meta\-externalfetcher|Meta\-ExternalFetcher|meta\-webindexer|MistralAI\-User|MistralAI\-User/1\.0|MyCentralAIScraperBot|NagetBot|netEstate\ Imprint\ Crawler|newsai|NotebookLM|NovaAct|OAI\-SearchBot|omgili|omgilibot|OpenAI|opencode|Operator|PanguBot|Panscient|panscient\.com|Perplexity\-User|PerplexityBot|PetalBot|PhindBot|Poggio\-Citations|Poseidon\ Research\ Crawler|QualifiedBot|QuillBot|quillbot\.com|SBIntuitionsBot|Scrapy|SemrushBot\-OCOB|SemrushBot\-SWA|Shap\-User|ShapBot|Sidetrade\ indexer\ bot|Spider|TavilyBot|Terra\ Cotta|TerraCotta|Thinkbot|TikTokSpider|Timpibot|Trae|TwinAgent|VelenPublicWebCrawler|WARDBot|Webzio\-Extended|webzio\-extended|wpbot|WRTNBot|YaK|YandexAdditional|YandexAdditionalBot|YouBot|ZanistaBot)") {
|
||||
if ($http_user_agent ~* "(AddSearchBot|AgentTimes|AI2Bot|AI2Bot\-DeepResearchEval|Ai2Bot\-Dolma|aiHitBot|amazon\-kendra|Amazonbot|AmazonBuyForMe|Amzn\-SearchBot|Amzn\-User|Andibot|Anomura|anthropic\-ai|ApifyBot|ApifyWebsiteContentCrawler|Applebot|Applebot\-Extended|Aranet\-SearchBot|atlassian\-bot|Awario|AzureAI\-SearchBot|bedrockbot|bigsur\.ai|Bravebot|Brightbot|Brightbot\ 1\.0|BuddyBot|Bytespider|CCBot|Channel3Bot|ChatGLM\-Spider|ChatGPT\ Agent|ChatGPT\-User|Claude\-Code|Claude\-SearchBot|Claude\-User|Claude\-Web|ClaudeBot|Cloudflare\-AutoRAG|CloudVertexBot|Code|cohere\-ai|cohere\-training\-data\-crawler|Cotoyogi|CragCrawler|Crawl4AI|Crawlspace|Datenbank\ Crawler|DeepSeekBot|Devin|Diffbot|DuckAssistBot|Echobot\ Bot|EchoboxBot|ExaBot|FacebookBot|facebookexternalhit|Factset_spyderbot|FirecrawlAgent|FriendlyCrawler|GeistHaus\-PageFetcher|Gemini\-Deep\-Research|Google\-Agent|Google\-CloudVertexBot|Google\-Extended|Google\-Firebase|Google\-Gemini\-CLI|Google\-NotebookLM|GoogleAgent\-Mariner|GoogleAgent\-URLContext|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|HenkBot|iAskBot|iaskspider|iaskspider/2\.0|IbouBot|ICC\-Crawler|ImagesiftBot|imageSpider|img2dataset|ISSCyberRiskCrawler|kagi\-fetcher|Kangaroo\ Bot|Kimi\-User|KlaviyoAIBot|KunatoCrawler|laion\-huggingface\-processor|LAIONDownloader|LCC|LinerBot|Linguee\ Bot|LinkupBot|Manus\-User|meta\-externalagent|Meta\-ExternalAgent|meta\-externalfetcher|Meta\-ExternalFetcher|meta\-webindexer|MistralAI\-User|MistralAI\-User/1\.0|MyCentralAIScraperBot|NagetBot|netEstate\ Imprint\ Crawler|newsai|NotebookLM|NovaAct|OAI\-SearchBot|omgili|omgilibot|OpenAI|opencode|Operator|PanguBot|Panscient|panscient\.com|Perplexity\-User|PerplexityBot|PetalBot|PhindBot|Poggio\-Citations|Poseidon\ Research\ Crawler|QualifiedBot|Querit\-SearchBot|QueritBot|QuillBot|quillbot\.com|SBIntuitionsBot|Scrapy|SemrushBot\-OCOB|SemrushBot\-SWA|Shap\-User|ShapBot|Sidetrade\ indexer\ bot|Spider|TavilyBot|Terra\ Cotta|TerraCotta|Thinkbot|TikTokSpider|Timpibot|Trae|TwinAgent|UseAI|VelenPublicWebCrawler|WARDBot|Webzio\-Extended|webzio\-extended|wpbot|WRTNBot|YaK|YandexAdditional|YandexAdditionalBot|YouBot|ZanistaBot)") {
|
||||
set $block 1;
|
||||
}
|
||||
|
||||
|
|
|
|||
49
robots.json
49
robots.json
|
|
@ -314,6 +314,13 @@
|
|||
"frequency": "No information provided.",
|
||||
"description": "Scrapes data for AI training in Japanese language."
|
||||
},
|
||||
"CragCrawler": {
|
||||
"operator": "Unclear at this time.",
|
||||
"respect": "Unclear at this time.",
|
||||
"function": "AI Data Providers",
|
||||
"frequency": "Unclear at this time.",
|
||||
"description": "Description unavailable from darkvisitors.com More info can be found at https://darkvisitors.com/agents/agents/cragcrawler"
|
||||
},
|
||||
"Crawl4AI": {
|
||||
"operator": "Unclear at this time.",
|
||||
"respect": "Unclear at this time.",
|
||||
|
|
@ -419,6 +426,13 @@
|
|||
"operator": "Unknown",
|
||||
"respect": "[Yes](https://imho.alex-kunz.com/2024/01/25/an-update-on-friendly-crawler)"
|
||||
},
|
||||
"GeistHaus-PageFetcher": {
|
||||
"operator": "Unclear at this time.",
|
||||
"respect": "Unclear at this time.",
|
||||
"function": "AI Assistants",
|
||||
"frequency": "Unclear at this time.",
|
||||
"description": "Description unavailable from darkvisitors.com More info can be found at https://darkvisitors.com/agents/agents/geisthaus-pagefetcher"
|
||||
},
|
||||
"Gemini-Deep-Research": {
|
||||
"operator": "Unclear at this time.",
|
||||
"respect": "Unclear at this time.",
|
||||
|
|
@ -475,6 +489,13 @@
|
|||
"frequency": "Unclear at this time.",
|
||||
"description": "GoogleAgent-Mariner is an AI agent created by Google that can use a web browser. It can intelligently navigate and interact with websites to complete multi-step tasks on behalf of a human user. More info can be found at https://darkvisitors.com/agents/agents/googleagent-mariner"
|
||||
},
|
||||
"GoogleAgent-URLContext": {
|
||||
"operator": "Unclear at this time.",
|
||||
"respect": "Unclear at this time.",
|
||||
"function": "AI Assistants",
|
||||
"frequency": "Unclear at this time.",
|
||||
"description": "Description unavailable from darkvisitors.com More info can be found at https://darkvisitors.com/agents/agents/googleagent-urlcontext"
|
||||
},
|
||||
"GoogleOther": {
|
||||
"operator": "Google",
|
||||
"respect": "[Yes](https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers)",
|
||||
|
|
@ -587,6 +608,13 @@
|
|||
"frequency": "Unclear at this time.",
|
||||
"description": "Kangaroo Bot is used by the company Kangaroo LLM to download data to train AI models tailored to Australian language and culture. More info can be found at https://darkvisitors.com/agents/agents/kangaroo-bot"
|
||||
},
|
||||
"Kimi-User": {
|
||||
"operator": "Unclear at this time.",
|
||||
"respect": "Unclear at this time.",
|
||||
"function": "AI Assistants",
|
||||
"frequency": "Unclear at this time.",
|
||||
"description": "Description unavailable from darkvisitors.com More info can be found at https://darkvisitors.com/agents/agents/kimi-user"
|
||||
},
|
||||
"KlaviyoAIBot": {
|
||||
"operator": "[Klaviyo](https://www.klaviyo.com)",
|
||||
"respect": "[Yes](https://help.klaviyo.com/hc/en-us/articles/40496146232219)",
|
||||
|
|
@ -853,6 +881,20 @@
|
|||
"frequency": "No explicit frequency provided.",
|
||||
"description": "Operated by Qualified as part of their suite of AI product offerings."
|
||||
},
|
||||
"Querit-SearchBot": {
|
||||
"operator": "Unclear at this time.",
|
||||
"respect": "Unclear at this time.",
|
||||
"function": "AI Data Providers",
|
||||
"frequency": "Unclear at this time.",
|
||||
"description": "Description unavailable from darkvisitors.com More info can be found at https://darkvisitors.com/agents/agents/querit-searchbot"
|
||||
},
|
||||
"QueritBot": {
|
||||
"operator": "Unclear at this time.",
|
||||
"respect": "Unclear at this time.",
|
||||
"function": "AI Data Providers",
|
||||
"frequency": "Unclear at this time.",
|
||||
"description": "Description unavailable from darkvisitors.com More info can be found at https://darkvisitors.com/agents/agents/queritbot"
|
||||
},
|
||||
"QuillBot": {
|
||||
"description": "Operated by QuillBot as part of their suite of AI product offerings.",
|
||||
"frequency": "No explicit frequency provided.",
|
||||
|
|
@ -979,6 +1021,13 @@
|
|||
"frequency": "Unclear at this time.",
|
||||
"description": "Description unavailable from darkvisitors.com More info can be found at https://darkvisitors.com/agents/agents/twinagent"
|
||||
},
|
||||
"UseAI": {
|
||||
"operator": "Unclear at this time.",
|
||||
"respect": "Unclear at this time.",
|
||||
"function": "AI Assistants",
|
||||
"frequency": "Unclear at this time.",
|
||||
"description": "Description unavailable from darkvisitors.com More info can be found at https://darkvisitors.com/agents/agents/useai"
|
||||
},
|
||||
"VelenPublicWebCrawler": {
|
||||
"operator": "[Velen Crawler](https://velen.io)",
|
||||
"respect": "[Yes](https://velen.io)",
|
||||
|
|
|
|||
|
|
@ -43,6 +43,7 @@ User-agent: Code
|
|||
User-agent: cohere-ai
|
||||
User-agent: cohere-training-data-crawler
|
||||
User-agent: Cotoyogi
|
||||
User-agent: CragCrawler
|
||||
User-agent: Crawl4AI
|
||||
User-agent: Crawlspace
|
||||
User-agent: Datenbank Crawler
|
||||
|
|
@ -58,6 +59,7 @@ User-agent: facebookexternalhit
|
|||
User-agent: Factset_spyderbot
|
||||
User-agent: FirecrawlAgent
|
||||
User-agent: FriendlyCrawler
|
||||
User-agent: GeistHaus-PageFetcher
|
||||
User-agent: Gemini-Deep-Research
|
||||
User-agent: Google-Agent
|
||||
User-agent: Google-CloudVertexBot
|
||||
|
|
@ -66,6 +68,7 @@ User-agent: Google-Firebase
|
|||
User-agent: Google-Gemini-CLI
|
||||
User-agent: Google-NotebookLM
|
||||
User-agent: GoogleAgent-Mariner
|
||||
User-agent: GoogleAgent-URLContext
|
||||
User-agent: GoogleOther
|
||||
User-agent: GoogleOther-Image
|
||||
User-agent: GoogleOther-Video
|
||||
|
|
@ -82,6 +85,7 @@ User-agent: img2dataset
|
|||
User-agent: ISSCyberRiskCrawler
|
||||
User-agent: kagi-fetcher
|
||||
User-agent: Kangaroo Bot
|
||||
User-agent: Kimi-User
|
||||
User-agent: KlaviyoAIBot
|
||||
User-agent: KunatoCrawler
|
||||
User-agent: laion-huggingface-processor
|
||||
|
|
@ -120,6 +124,8 @@ User-agent: PhindBot
|
|||
User-agent: Poggio-Citations
|
||||
User-agent: Poseidon Research Crawler
|
||||
User-agent: QualifiedBot
|
||||
User-agent: Querit-SearchBot
|
||||
User-agent: QueritBot
|
||||
User-agent: QuillBot
|
||||
User-agent: quillbot.com
|
||||
User-agent: SBIntuitionsBot
|
||||
|
|
@ -138,6 +144,7 @@ User-agent: TikTokSpider
|
|||
User-agent: Timpibot
|
||||
User-agent: Trae
|
||||
User-agent: TwinAgent
|
||||
User-agent: UseAI
|
||||
User-agent: VelenPublicWebCrawler
|
||||
User-agent: WARDBot
|
||||
User-agent: Webzio-Extended
|
||||
|
|
|
|||
|
|
@ -45,6 +45,7 @@
|
|||
| cohere\-ai | [Cohere](https://cohere.com) | Unclear at this time. | Retrieves data to provide responses to user-initiated prompts. | Takes action based on user prompts. | Retrieves data based on user prompts. |
|
||||
| cohere\-training\-data\-crawler | Cohere to download training data for its LLMs (Large Language Models) that power its enterprise AI products | Unclear at this time. | AI Data Scrapers | Unclear at this time. | cohere-training-data-crawler is a web crawler operated by Cohere to download training data for its LLMs (Large Language Models) that power its enterprise AI products. More info can be found at https://darkvisitors.com/agents/agents/cohere-training-data-crawler |
|
||||
| Cotoyogi | [ROIS](https://ds.rois.ac.jp/en_center8/en_crawler/) | Yes | AI LLM Scraper. | No information provided. | Scrapes data for AI training in Japanese language. |
|
||||
| CragCrawler | Unclear at this time. | Unclear at this time. | AI Data Providers | Unclear at this time. | Description unavailable from darkvisitors.com More info can be found at https://darkvisitors.com/agents/agents/cragcrawler |
|
||||
| Crawl4AI | Unclear at this time. | Unclear at this time. | Undocumented AI Agents | Unclear at this time. | Description unavailable from darkvisitors.com More info can be found at https://darkvisitors.com/agents/agents/crawl4ai |
|
||||
| Crawlspace | [Crawlspace](https://crawlspace.dev) | [Yes](https://news.ycombinator.com/item?id=42756654) | Scrapes data | Unclear at this time. | Provides crawling services for any purpose, probably including AI model training. |
|
||||
| Datenbank Crawler | Datenbank | Unclear at this time. | AI Data Scrapers | Unclear at this time. | Datenbank Crawler is an AI data scraper operated by Datenbank. It's not currently known to be artificially intelligent or AI-related. If you think that's incorrect or can provide more detail about its purpose, please contact us. More info can be found at https://darkvisitors.com/agents/agents/datenbank-crawler |
|
||||
|
|
@ -60,6 +61,7 @@
|
|||
| Factset\_spyderbot | [Factset](https://www.factset.com/ai) | Unclear at this time. | AI model training. | No information provided. | Scrapes data for AI training. |
|
||||
| FirecrawlAgent | [Firecrawl](https://www.firecrawl.dev/) | Yes | AI scraper and LLM training | No information provided. | Scrapes data for AI systems and LLM training. |
|
||||
| FriendlyCrawler | Unknown | [Yes](https://imho.alex-kunz.com/2024/01/25/an-update-on-friendly-crawler) | We are using the data from the crawler to build datasets for machine learning experiments. | Unclear at this time. | Unclear who the operator is; but data is used for training/machine learning. |
|
||||
| GeistHaus\-PageFetcher | Unclear at this time. | Unclear at this time. | AI Assistants | Unclear at this time. | Description unavailable from darkvisitors.com More info can be found at https://darkvisitors.com/agents/agents/geisthaus-pagefetcher |
|
||||
| Gemini\-Deep\-Research | Unclear at this time. | Unclear at this time. | AI Assistants | Unclear at this time. | Gemini-Deep-Research is the agent responsible for collecting and scanning resources used in Google Gemini's Deep Research feature, which acts as a personal research assistant. More info can be found at https://darkvisitors.com/agents/agents/gemini-deep-research |
|
||||
| Google\-Agent | Unclear at this time. | Unclear at this time. | AI Agents | Unclear at this time. | Description unavailable from darkvisitors.com More info can be found at https://darkvisitors.com/agents/agents/google-agent |
|
||||
| Google\-CloudVertexBot | Google | [Yes](https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers) | Build and manage AI models for businesses employing Vertex AI | No information. | Google-CloudVertexBot crawls sites on the site owners' request when building Vertex AI Agents. |
|
||||
|
|
@ -68,6 +70,7 @@
|
|||
| Google\-Gemini\-CLI | Unclear at this time. | Unclear at this time. | AI Coding Agents | Unclear at this time. | Description unavailable from darkvisitors.com More info can be found at https://darkvisitors.com/agents/agents/google-gemini-cli |
|
||||
| Google\-NotebookLM | Unclear at this time. | Unclear at this time. | AI Assistants | Unclear at this time. | Google-NotebookLM is an AI-powered research and note-taking assistant that helps users synthesize information from their own uploaded sources, such as documents, transcripts, or web content. It can generate summaries, answer questions, and highlight key themes from the materials you provide, acting like a personalized research companion built on Google's Gemini model. Google-NotebookLM fetches source URLs when users add them to their notebooks, enabling the AI to access and analyze those pages for context and insights. More info can be found at https://darkvisitors.com/agents/agents/google-notebooklm |
|
||||
| GoogleAgent\-Mariner | Google | Unclear at this time. | AI Agents | Unclear at this time. | GoogleAgent-Mariner is an AI agent created by Google that can use a web browser. It can intelligently navigate and interact with websites to complete multi-step tasks on behalf of a human user. More info can be found at https://darkvisitors.com/agents/agents/googleagent-mariner |
|
||||
| GoogleAgent\-URLContext | Unclear at this time. | Unclear at this time. | AI Assistants | Unclear at this time. | Description unavailable from darkvisitors.com More info can be found at https://darkvisitors.com/agents/agents/googleagent-urlcontext |
|
||||
| GoogleOther | Google | [Yes](https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers) | Scrapes data. | No information. | "Used by various product teams for fetching publicly accessible content from sites. For example, it may be used for one-off crawls for internal research and development." |
|
||||
| GoogleOther\-Image | Google | [Yes](https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers) | Scrapes data. | No information. | "Used by various product teams for fetching publicly accessible content from sites. For example, it may be used for one-off crawls for internal research and development." |
|
||||
| GoogleOther\-Video | Google | [Yes](https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers) | Scrapes data. | No information. | "Used by various product teams for fetching publicly accessible content from sites. For example, it may be used for one-off crawls for internal research and development." |
|
||||
|
|
@ -84,6 +87,7 @@
|
|||
| ISSCyberRiskCrawler | [ISS-Corporate](https://iss-cyber.com) | No | Scrapes data to train machine learning models. | No information. | Used to train machine learning based models to quantify cyber risk. |
|
||||
| kagi\-fetcher | Unclear at this time. | Unclear at this time. | AI Assistants | Unclear at this time. | Description unavailable from darkvisitors.com More info can be found at https://darkvisitors.com/agents/agents/kagi-fetcher |
|
||||
| Kangaroo Bot | Unclear at this time. | Unclear at this time. | AI Data Scrapers | Unclear at this time. | Kangaroo Bot is used by the company Kangaroo LLM to download data to train AI models tailored to Australian language and culture. More info can be found at https://darkvisitors.com/agents/agents/kangaroo-bot |
|
||||
| Kimi\-User | Unclear at this time. | Unclear at this time. | AI Assistants | Unclear at this time. | Description unavailable from darkvisitors.com More info can be found at https://darkvisitors.com/agents/agents/kimi-user |
|
||||
| KlaviyoAIBot | [Klaviyo](https://www.klaviyo.com) | [Yes](https://help.klaviyo.com/hc/en-us/articles/40496146232219) | AI Search Crawlers | Indexes based on 'change signals' and user configuration. | Indexes content to tailor AI experiences, generate content, answers and recommendations. |
|
||||
| KunatoCrawler | Unclear at this time. | Unclear at this time. | Undocumented AI Agents | Unclear at this time. | Description unavailable from darkvisitors.com More info can be found at https://darkvisitors.com/agents/agents/kunatocrawler |
|
||||
| laion\-huggingface\-processor | Unclear at this time. | Unclear at this time. | AI Data Scrapers | Unclear at this time. | Description unavailable from darkvisitors.com More info can be found at https://darkvisitors.com/agents/agents/laion-huggingface-processor |
|
||||
|
|
@ -122,6 +126,8 @@
|
|||
| Poggio\-Citations | Unclear at this time. | Unclear at this time. | AI Assistants | Unclear at this time. | Description unavailable from darkvisitors.com More info can be found at https://darkvisitors.com/agents/agents/poggio-citations |
|
||||
| Poseidon Research Crawler | [Poseidon Research](https://www.poseidonresearch.com) | Unclear at this time. | AI research crawler | No explicit frequency provided. | Lab focused on scaling the interpretability research necessary to make better AI systems possible. |
|
||||
| QualifiedBot | [Qualified](https://www.qualified.com) | Unclear at this time. | Company offers AI agents and other related products; usage can be assumed to support said products. | No explicit frequency provided. | Operated by Qualified as part of their suite of AI product offerings. |
|
||||
| Querit\-SearchBot | Unclear at this time. | Unclear at this time. | AI Data Providers | Unclear at this time. | Description unavailable from darkvisitors.com More info can be found at https://darkvisitors.com/agents/agents/querit-searchbot |
|
||||
| QueritBot | Unclear at this time. | Unclear at this time. | AI Data Providers | Unclear at this time. | Description unavailable from darkvisitors.com More info can be found at https://darkvisitors.com/agents/agents/queritbot |
|
||||
| QuillBot | [Quillbot](https://quillbot.com) | Unclear at this time. | Company offers AI detection, writing tools and other services. | No explicit frequency provided. | Operated by QuillBot as part of their suite of AI product offerings. |
|
||||
| quillbot\.com | [Quillbot](https://quillbot.com) | Unclear at this time. | Company offers AI detection, writing tools and other services. | No explicit frequency provided. | Operated by QuillBot as part of their suite of AI product offerings. |
|
||||
| SBIntuitionsBot | [SB Intuitions](https://www.sbintuitions.co.jp/en/) | [Yes](https://www.sbintuitions.co.jp/en/bot/) | Uses data gathered in AI development and information analysis. | No information. | AI development and information analysis |
|
||||
|
|
@ -140,6 +146,7 @@
|
|||
| Timpibot | [Timpi](https://timpi.io) | Unclear at this time. | Scrapes data for use in training LLMs. | No information. | Makes data available for training AI models. |
|
||||
| Trae | Unclear at this time. | Unclear at this time. | AI Coding Agents | Unclear at this time. | Description unavailable from darkvisitors.com More info can be found at https://darkvisitors.com/agents/agents/trae |
|
||||
| TwinAgent | Unclear at this time. | Unclear at this time. | AI Agents | Unclear at this time. | Description unavailable from darkvisitors.com More info can be found at https://darkvisitors.com/agents/agents/twinagent |
|
||||
| UseAI | Unclear at this time. | Unclear at this time. | AI Assistants | Unclear at this time. | Description unavailable from darkvisitors.com More info can be found at https://darkvisitors.com/agents/agents/useai |
|
||||
| VelenPublicWebCrawler | [Velen Crawler](https://velen.io) | [Yes](https://velen.io) | Scrapes data for business data sets and machine learning models. | No information. | "Our goal with this crawler is to build business datasets and machine learning models to better understand the web." |
|
||||
| WARDBot | WEBSPARK | Unclear at this time. | AI Data Scrapers | Unclear at this time. | WARDBot is an AI data scraper operated by WEBSPARK. It's not currently known to be artificially intelligent or AI-related. If you think that's incorrect or can provide more detail about its purpose, please contact us. More info can be found at https://darkvisitors.com/agents/agents/wardbot |
|
||||
| Webzio\-Extended | Unclear at this time. | Unclear at this time. | AI Data Scrapers | Unclear at this time. | Webzio-Extended is a web crawler used by Webz.io to maintain a repository of web crawl data that it sells to other companies, including those using it to train AI models. More info can be found at https://darkvisitors.com/agents/agents/webzio-extended |
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue