mirror of
https://github.com/ai-robots-txt/ai.robots.txt.git
synced 2025-12-29 12:18:33 +01:00
4.8 KiB
4.8 KiB
| Name | Operator | Respects robots.txt |
Data use | Visit regularity | Description |
|---|---|---|---|---|---|
| AdsBot-Google | Yes (Exceptions for Dynamic Search Ads) | Analyzes website content for ad relevancy, improves ad serving for Google Ads. Data anonymized according to Google's Privacy Policy (https://policies.google.com/privacy?hl=en-US). Unclear on data retention or use by other products. | Varies depending on campaign activity and website updates. Crawls optimized to minimize impact, specific frequency not public. | Web crawler by Google Ads to analyze websites for ad effectiveness and ensure ad relevancy to webpage content. | |
| Amazonbot | Amazon | Yes | Service improvement and enabling answers for Alexa users. | No information provided. | Includes references to crawled website when surfacing answers via Alexa; does not clearly outline other uses. |
| anthropic-ai | Anthropic | Unclear at this time. | Scrapes data to train Anthropic's AI products. | No information provided. | Scrapes data to train LLMs and AI products offered by Anthropic. |
| Applebot | Apple | Yes | Indexes sites to provide answers and search results for Siri users. | Irregular and may be prompted by user queries. | Used to answer queries from users; may included references to the indexed site. |
| AwarioRssBot | |||||
| AwarioSmartBot | |||||
| Bytespider | ByteDance | No | LLM training. | Unclear at this time. | Downloads data to train LLMS, including ChatGPT competitors. |
| CCBot | Common Crawl | Yes | Provides crawl data for an open source repository that has been used to train LLMs. | Unclear at this time. | Sources data that is made openly available and is used to train AI models. |
| ChatGPT-User | OpenAI | Yes | Takes action based on user prompts. | Only when prompted by a user. | Used by plugins in ChatGPT to answer queries based on user input. |
| ClaudeBot | Anthropic | Unclear at this time. | Scrapes data to train Anthropic's AI products. | No information provided. | Scrapes data to train LLMs and AI products offered by Anthropic. |
| Claude-Web Anthropic | Unclear at this time. | Scrapes data to train Anthropic's AI products. | No information provided. | Scrapes data to train LLMs and AI products offered by Anthropic. | |
| cohere-ai | Cohere | Unclear at this time. | Retrieves data to provide responses to user-initiated prompts. | Takes action based on user prompts. | Retrieves data based on user prompts. |
| DataForSeoBot | |||||
| Diffbot | |||||
| FacebookBot | |||||
| Google-Extended | |||||
| GoogleOther | |||||
| GPTBot | OpenAI | Yes | Scrapes data to train OpenAI's products. | No information provided. | Data is used to train current and future models, removed paywalled data, PII and data that violates the company's policies. |
| img2dataset | |||||
| ImagesiftBot | |||||
| magpie-crawler | |||||
| Meltwater | |||||
| omgili | |||||
| omgilibot | |||||
| peer39_crawler | |||||
| peer39_crawler/1.0 | |||||
| PerplexityBot | |||||
| PiplBot | |||||
| scoop.it | |||||
| Seekr | |||||
| YouBot |