AI Crawlers & Robots.txt 2026: GPTBot, ClaudeBot — Allow or Block?
A new generation of web crawlers is indexing your content — not for search engine results pages, but for AI model training and real-time AI answers. GPTBot, ClaudeBot, PerplexityBot, and Google-Extended are the most prominent, and how you handle them in your robots.txt has real consequences for your visibility in AI-powered search. This guide covers every known AI crawler, the exact robots.txt syntax, and a clear framework for deciding whether to allow or block each one.
TL;DR — Quick Summary
- ✓ There are 8+ known AI crawlers active in 2026, each with a unique User-agent string
- ✓ For most websites: allow all AI crawlers to maximize visibility in AI-powered search
- ✓ OpenAI uses two crawlers: GPTBot (training) and ChatGPT-User (real-time) — you can block one and allow the other
- ✓ Blocking AI crawlers does NOT affect your Google Search rankings (Googlebot is separate)
- ✓ Robots.txt is voluntary, not legally binding — but all major AI companies currently honor it
Complete AI Crawler Directory 2026
GPTBotOpenAIChatGPT-UserOpenAIClaudeBotAnthropicClaude-WebAnthropicPerplexityBotPerplexity AIGoogle-ExtendedGoogleCCBotCommon CrawlApplebot-ExtendedAppleTable of Contents
The AI Crawler Landscape in 2026
Until 2023, the only web crawlers most site owners worried about were search engine crawlers: Googlebot, Bingbot, and maybe Yandexbot. Then OpenAI launched GPTBot in August 2023, and a new era of web crawling began. Today, there are at least 8 distinct AI crawlers actively indexing the web, each operated by a different company with a different purpose.
These AI crawlers serve two distinct functions: training data collection (crawling content to include in the next model training run) and real-time retrieval (fetching web pages on-the-fly when a user asks the AI a question). The distinction matters because you might want to allow real-time retrieval (so your content appears in AI answers) while blocking training data collection (so your content is not used to train the model without compensation).
As of February 2026, the robots.txt protocol (standardized in RFC 9309) remains the primary mechanism for controlling AI crawler access. All major AI companies — OpenAI, Anthropic, Google, Perplexity, and Apple — have publicly committed to honoring robots.txt directives for their AI crawlers. However, the Common Crawl dataset (CCBot) is a notable edge case because it is an open dataset used by many LLMs, including some that do not operate their own crawlers.
Every AI Crawler Explained
GPTBot (OpenAI)
User-agent: GPTBot
Purpose: Collects data for training OpenAI models (GPT-4, GPT-5, and successors).
Respects robots.txt: Yes (since August 2023).
IP range: Published by OpenAI in their official documentation.
GPTBot is OpenAI's primary training data crawler. Content crawled by GPTBot may be incorporated into future model training runs. OpenAI states they filter out content behind paywalls and known PII (personally identifiable information) from training data.
ChatGPT-User (OpenAI)
User-agent: ChatGPT-User
Purpose: Real-time web browsing when ChatGPT users invoke the search tool.
Respects robots.txt: Yes.
Key distinction: This crawler fetches pages in real-time to answer user queries. Content is not stored for training.
This is the crawler that matters for appearing in ChatGPT's real-time answers. When a user asks ChatGPT a question and it searches the web, ChatGPT-User fetches the pages, extracts relevant information, and cites the source in the response. Blocking ChatGPT-User means your content will never appear in ChatGPT's web-based answers.
ClaudeBot (Anthropic)
User-agent: ClaudeBot
Purpose: Training data collection and potentially real-time retrieval.
Respects robots.txt: Yes (since mid-2024).
Note: Anthropic previously used Claude-Web, which is now deprecated. Use ClaudeBot for current rules.
Unlike OpenAI, Anthropic does not currently separate training and real-time crawlers into distinct user-agents. ClaudeBot handles both functions. This means you cannot selectively allow real-time answering while blocking training for Claude — it is all or nothing with a single robots.txt rule.
PerplexityBot (Perplexity AI)
User-agent: PerplexityBot
Purpose: Real-time search and indexing for Perplexity's AI search engine.
Respects robots.txt: Yes (after controversy in mid-2024).
Note: Perplexity faced criticism in 2024 for reportedly ignoring some robots.txt directives. They have since committed to full compliance.
Perplexity is a dedicated AI search engine that fetches and cites web sources in real-time for every query. Blocking PerplexityBot means your content will not appear in Perplexity answers, which process over 100 million queries per month as of 2026.
Google-Extended (Google)
User-agent: Google-Extended
Purpose: Training data for Google's Gemini AI models and AI Overviews.
Respects robots.txt: Yes.
Critical distinction: Blocking Google-Extended does NOT affect Googlebot or your search rankings. It only affects AI training and AI Overviews.
Google-Extended is separate from Googlebot (the main search crawler). Your Google Search rankings are entirely unaffected by Google-Extended rules. However, blocking Google-Extended may prevent your content from being used in Google AI Overviews — the AI-generated summaries that appear at the top of Google search results for many queries. Given that AI Overviews are becoming an increasingly important traffic source, blocking Google-Extended has real visibility costs.
CCBot (Common Crawl)
User-agent: CCBot
Purpose: Builds an open web dataset used by researchers and many LLM companies.
Respects robots.txt: Yes.
Common Crawl is a nonprofit that crawls the web and makes the data freely available. Its dataset is used by many AI companies for training, including those that do not operate their own crawlers. Blocking CCBot reduces the chance that your content appears in the Common Crawl dataset, but companies can also license web data from other sources. Blocking CCBot is the broadest signal you can send, but it is not a complete solution.
Other Notable Crawlers
Applebot-Extended (Apple): Used for Apple Intelligence features. Separate from the standard Applebot which powers Siri and Spotlight search. FacebookBot (Meta): While primarily for link previews, Meta has AI training interests. Bytespider (ByteDance/TikTok): Used for content indexing and potentially AI training. Each has its own User-agent string and robots.txt behavior.
Should You Block AI Crawlers? Arguments For and Against
Arguments FOR Blocking
- 1.Content protection — Your content is used for model training without direct compensation or attribution
- 2.Bandwidth concerns — AI crawlers can be aggressive, consuming significant server resources
- 3.Zero-click problem — AI answers may reduce click-throughs to your site
- 4.Competitive data — Proprietary research or analysis you do not want competitors accessing via AI
Arguments AGAINST Blocking
- 1.AI visibility loss — Your content will not appear in AI-generated answers or citations
- 2.Growing traffic source — AI search is becoming a major referral channel
- 3.Cannot stop all LLMs — Your content is likely already in training data; blocking now is closing the door after the fact
- 4.Brand authority — Being cited by AI engines builds brand trust and awareness
Key Insight
For most businesses and content publishers, the AI visibility benefit outweighs the content protection concern. The traffic from being cited in AI answers is high-quality (high intent, high trust) and growing rapidly. Unless you have specific reasons to protect content (premium paywalled content, proprietary data, or legal obligations), allowing AI crawlers is the stronger business decision.
Robots.txt Syntax: Exact Rules for Every AI Crawler
The robots.txt syntax for AI crawlers follows the same rules as any other crawler (RFC 9309). Each AI crawler has a specific User-agent string that you reference in your robots.txt file. Here are the exact rules for every scenario.
# Allow all AI crawlers for maximum AI search visibility
User-agent: GPTBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: CCBot
Allow: /
# Block all known AI crawlers
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
Important: Default Behavior
If your robots.txt does not mention an AI crawler at all, the default behavior depends on your User-agent: * rule. If you have User-agent: * / Allow: / (or no robots.txt at all), all AI crawlers are allowed by default. If you have specific disallow rules under User-agent: *, those rules apply to AI crawlers too unless you explicitly override them with a specific User-agent rule.
Granular Control: Block Training, Allow Answering
The most nuanced approach is to block model training while allowing real-time answering. This way, your content appears in AI-generated answers (with citations and links) but is not used to train the next model version. Currently, only OpenAI supports this distinction with two separate crawlers.
# Block OpenAI training but allow real-time ChatGPT answers
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Allow: /
# For Anthropic, Perplexity, Google: no separate training/answering crawlers
# You must decide: allow both or block both
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
This nuanced approach gives you the best of both worlds for OpenAI: your content appears in ChatGPT answers with citations and links back to your site, but it is not incorporated into future training runs. For Anthropic and Perplexity, you must decide between full access or no access until they introduce separate crawlers for training and real-time use.
| Platform | Training Crawler | Answering Crawler | Granular Control? |
|---|---|---|---|
| OpenAI | GPTBot | ChatGPT-User | Yes |
| Anthropic | ClaudeBot | ClaudeBot | No (single crawler) |
| Perplexity | PerplexityBot | PerplexityBot | No (single crawler) |
| Google-Extended | Googlebot* | Partial** |
* Google AI Overviews use Googlebot for retrieval, so you cannot block AI Overviews retrieval without blocking regular search crawling. ** Blocking Google-Extended prevents training only; AI Overviews retrieval depends on Googlebot access.
The Legal Landscape: Is Robots.txt Enforceable?
Robots.txt is a voluntary protocol, not a legal contract. RFC 9309 (published in September 2022) standardized the protocol but does not impose legal obligations. Crawlers can technically ignore robots.txt rules without violating a law — though doing so can expose them to other legal claims.
Several high-profile lawsuits have been filed against AI companies for web scraping, including cases by the New York Times, publishers, and individual content creators. While robots.txt itself is not legally binding, courts have considered it as evidence of a site owner's clear intent. Combined with Terms of Service that explicitly prohibit automated scraping and AI training, robots.txt strengthens a content owner's legal position.
In practice, all major AI companies currently honor robots.txt. The reputational cost of ignoring it is too high — Perplexity faced significant backlash in 2024 when reports emerged that it was bypassing robots.txt directives, and the company quickly committed to full compliance. The social contract around robots.txt remains strong even without legal enforcement.
Impact on LLM Optimization: The Visibility Trade-Off
Blocking AI crawlers has a direct and measurable impact on your visibility in AI-powered search. If you block GPTBot and ChatGPT-User, your content will not appear in any ChatGPT answer — not in real-time search results, not in citations, and not in browsing mode. The same applies to other platforms: blocking their crawlers means complete invisibility on those platforms.
For a detailed guide on optimizing your content for AI search (beyond just crawler access), see our comprehensive LLM Optimization Guide, which covers all 13 optimization parameters including content format, structured data, brand authority signals, and measurement.
Critical Warning
If you are investing in content marketing, SEO, or brand building, blocking AI crawlers actively undermines your investment. AI search is growing at 40%+ year-over-year, and brands that are invisible in AI answers today will have a compounding disadvantage as the channel matures. Unless you have specific content protection requirements (premium paywalled content, legally sensitive data), blocking AI crawlers is a strategic mistake for most businesses.
AI Crawler Decision Matrix by Site Type
| Site Type | Recommendation | Reasoning |
|---|---|---|
| Business / SaaS | Allow All | Maximum AI visibility drives brand awareness and leads |
| Blog / Content Site | Allow All | AI citations drive high-quality referral traffic |
| E-commerce | Allow All | Product recommendations in AI answers drive sales |
| News Publisher | Selective | Allow answering, consider blocking training (revenue protection) |
| Premium / Paywalled | Block Training | Protect premium content from free AI access |
| Legal / Compliance | Case by Case | Data sensitivity may require blocking; consult legal team |
| Portfolio / Personal | Allow All | Visibility and brand building outweigh any risk |
Recommended Approach for Most Websites
For the majority of websites — businesses, blogs, e-commerce sites, SaaS companies, and personal brands — we recommend allowing all AI crawlers by default. The visibility benefit of appearing in AI-powered search outweighs the content protection concern for most use cases.
- 1
Audit your current robots.txt
Use InstaRank SEO's Robots.txt checker to see if you have any rules blocking AI crawlers. Many sites inadvertently block them through overly restrictive wildcard rules.
- 2
Add explicit Allow rules for all AI crawlers
Even if your default rule allows everything, adding explicit Allow rules for each AI crawler makes your intent clear and prevents accidental blocking from future rule changes.
- 3
Consider selective blocking for sensitive paths
If you have premium content, admin areas, or sensitive data, block AI crawlers from those specific paths while allowing the rest of your site.
- 4
Monitor AI crawler activity in server logs
Track which AI crawlers are visiting, how often, and which pages they access. This data informs your ongoing strategy.
- 5
Review quarterly
The AI crawler landscape is evolving. New crawlers appear, companies change policies, and your business needs may shift. Review your robots.txt AI rules every quarter.
Monitoring AI Crawler Activity
Once you have configured your robots.txt rules, you need to verify that AI crawlers are actually visiting your site and accessing the content you want them to see. Server logs are the primary data source for this monitoring.
Server Log Analysis
Look for these User-agent strings in your server access logs:
2026-02-23 08:14:22
GPTBot/1.2 GET /blog/seo-guide 200 45.2KB
2026-02-23 08:15:01
ClaudeBot/1.0 GET /blog/seo-guide 200 45.2KB
2026-02-23 08:16:45
PerplexityBot/1.0 GET /blog/ai-crawlers 200 38.1KB
2026-02-23 08:17:33
Google-Extended GET /blog/robots-txt-guide 200 42.7KB
2026-02-23 08:19:12
ChatGPT-User/1.0 GET /tools/seo-audit 200 12.3KB
You can filter your server logs using standard tools. For Apache, use grep -E "GPTBot|ChatGPT-User|ClaudeBot|PerplexityBot|Google-Extended" access.log. For Nginx, the same pattern works with your access log file. Cloud providers like Cloudflare, Vercel, and AWS CloudFront also log User-agent strings in their analytics dashboards.
What to Look For
- Crawl frequency — How often each AI crawler visits. Increasing frequency suggests your content is being prioritized for indexing.
- Pages accessed — Which pages AI crawlers hit most. This tells you what content is most interesting to AI retrieval systems.
- Status codes — Ensure crawlers are getting 200 responses, not 403 or 404. A 403 means your server is blocking the crawler despite robots.txt allowing it (check firewall rules, CDN settings, and WAF rules).
- Crawl depth — Are crawlers exploring your site deeply or only hitting top-level pages? Deep crawling indicates strong site authority and internal linking.
Best Practice
Use InstaRank SEO's Robots.txt Checker to verify that your robots.txt correctly allows or blocks AI crawlers. The tool checks for syntax errors, conflicting rules, and common misconfigurations that might inadvertently block crawlers you want to allow (or vice versa).
Check Your AI Crawler Configuration
- → Verify which AI crawlers your robots.txt allows or blocks
- → Detect conflicting rules that may inadvertently block crawlers
- → Get specific recommendations for your site type
- → Audit all technical SEO factors in one free scan
Run a free robots.txt audit on your website:
Run Free Site Audit →