Your robots.txt is probably blocking AI bots without you knowing it. According to an Originality.ai study from 2025, 73% of websites block at least one AI crawler — often by default, through an overly broad Disallow rule inherited from a migration or a WordPress template.
In 2026, the landscape has changed. Seven major AI bots crawl the web constantly, and the distinction between those that train models and those that power real-time citations has become strategic. Block the wrong bot, and you disappear from AI responses. Allow them all without thinking, and you give away your training data for free.
This article gives you the reference configuration — robots.txt, sitemap.xml and llms.txt — for 2026.
AI bots in 2026: who crawls what
The first mistake is treating all AI bots the same way. There are two fundamentally different categories:
- Training crawlers (GPTBot, ClaudeBot, CCBot, Google-Extended, Bytespider): they collect data to improve models. Blocking them has no immediate impact on your visibility in AI responses.
- Citation crawlers (OAI-SearchBot, PerplexityBot, Googlebot for AI Overviews): they power real-time responses. Blocking them means disappearing from AI citations.
| Bot | Owner | Type | Respects robots.txt |
|---|---|---|---|
GPTBot | OpenAI | Training + search | Yes |
OAI-SearchBot | OpenAI | Real-time citation | Yes |
ClaudeBot | Anthropic | Training | Yes |
PerplexityBot | Perplexity | Real-time citation | Yes |
Google-Extended | Gemini training | Yes | |
Bytespider | ByteDance | Training | Partial |
CCBot | Common Crawl | Open corpus | Yes |
The strategic recommendation: always allow citation crawlers (OAI-SearchBot, PerplexityBot). For training crawlers, the decision depends on your strategy — some sites choose to block training while remaining citable. For a deeper dive, see our technical guide on AI crawlability.
robots.txt: strategic configuration for AI bots
The robots.txt is your first control lever. Here are three configurations based on your strategy.
"Allow all" configuration (recommended for GEO)
If your goal is to maximize AI visibility — citations, recommendations, appearing in responses — allow all bots:
User-agent: *
Allow: /
User-agent: GPTBot
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
Sitemap: https://your-site.com/sitemap.xmlWhy list each bot individually when User-agent: * already covers them? Because some bots check their own directive first before looking at *. By explicitly listing each bot with Allow: /, you eliminate any ambiguity.
"Citation yes, training no" configuration
If you want to be cited by AI without your data being used to train models:
User-agent: *
Allow: /
# Citation crawlers — ALLOW
User-agent: OAI-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
# Training crawlers — BLOCK
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Bytespider
Disallow: /
Sitemap: https://your-site.com/sitemap.xmlWarning: this configuration has a limitation. AI models are trained on past snapshots. If you block GPTBot today, future GPT versions will know less about your site, which can indirectly reduce your citations long-term. It's a trade-off to evaluate.
Classic trap: the global Disallow: / rule
The most common trap: a User-agent: * / Disallow: / with no exceptions. That blocks all AI bots at once. You often see this on sites migrated from WordPress with a misconfigured SEO plugin, or on staging sites whose robots.txt was forgotten in production.
Is your robots.txt blocking AI bots? The Detekia GEO audit checks your site's crawlability.
Analyze my website for free →Sitemap.xml: signals that help AI
The sitemap isn't just for Google. AI bots like PerplexityBot and OAI-SearchBot read it too to discover your pages. Three signals are particularly important.
<lastmod> — The freshness signal
AI values fresh content. The <lastmod> tag tells them when a page was last updated. According to Growth Memo observations (2026), pages with a recent <lastmod> are crawled more frequently by AI bots.
Rule: only update <lastmod> when content actually changes. Not on every build, not automatically. AI bots (and Google) detect artificial <lastmod> values and ignore them.
<priority> and structure
The <priority> tag is ignored by most engines, but sitemap structure matters. Split your sitemaps if you have more than 100 URLs: a sitemap-pages.xml for marketing pages and a sitemap-blog.xml for articles. This helps bots prioritize.
hreflang in the sitemap
For multilingual sites, xhtml:link tags with hreflang in the sitemap help AI associate the correct language versions. This is particularly important for Perplexity, which adapts responses to the user's language.
<url>
<loc>https://your-site.com/geo-guide</loc>
<lastmod>2026-04-15</lastmod>
<xhtml:link rel="alternate" hreflang="fr"
href="https://your-site.com/geo-guide" />
<xhtml:link rel="alternate" hreflang="en"
href="https://your-site.com/en/geo-guide" />
<xhtml:link rel="alternate" hreflang="x-default"
href="https://your-site.com/geo-guide" />
</url>llms.txt: the emerging standard for AI
The llms.txt file is an initiative launched in late 2024 to give LLMs a structured summary of your site. Unlike robots.txt (which tells bots what they can crawl), llms.txt tells them what your site is and which pages are most important.
As of April 2026, llms.txt isn't yet an official standard, but it's read by some crawlers and can influence how AI understands your site. It's an optional but increasingly recommended signal.
# Detekia
> Detekia is a GEO audit tool that analyzes website visibility
> on AI engines (ChatGPT, Gemini, Perplexity).
## Main pages
- [Home](https://detekia.fr): Free GEO audit
- [Methodology](https://detekia.fr/methodologie): 8 GEO criteria
- [Pricing](https://detekia.fr/pricing): 1-page and complete audit
- [Blog](https://detekia.fr/blog): GEO and SEO guides
## Expertise
- AI visibility auditing
- GEO scoring out of 100
- Technical recommendations with code
- Methodology based on Princeton/KDD 2024Best practices: keep it concise (under 500 words), structure in Markdown sections, and update when your offering evolves. The file goes at the root: your-site.com/llms.txt.
For a complete guide on llms.txt and AI crawlability, see our dedicated article: llms.txt, robots.txt and AI crawlability.
5 traps to avoid
- Blocking GPTBot thinking you're only blocking training. Since early 2026, OpenAI uses
OAI-SearchBotfor real-time citations, separate fromGPTBot. If you only block GPTBot, your ChatGPT citations are preserved. But if your rule blocks both, you disappear. - Forgetting the
Sitemap:directive in robots.txt. It's the simplest way for bots to discover your sitemap. Without this line, some AI crawlers don't find it automatically. - Automatic
lastmodon every deploy. If all your pages have today's date aslastmod, bots end up ignoring this signal. Only update pages whose content has actually changed. - Overly aggressive
Crawl-delay. Some sites addCrawl-delay: 10to limit server load. AI bots like PerplexityBot respect this directive — a 10-second delay between pages means crawling 100 pages takes 17 minutes. For content sites, that's a bottleneck for AI indexing. - Not testing after migrations. CMS migrations, CDN changes and reverse proxy updates can silently overwrite your robots.txt. Always test after any infrastructure change with a simple
curl https://your-site.com/robots.txt.
2026 reference configuration
Here's the complete configuration to copy-paste and adapt. It allows all citation bots, all training bots, and includes the sitemap and llms.txt references.
# robots.txt — Optimal GEO configuration 2026
# Documentation: detekia.fr/blog/sitemap-robots-txt-bots-ia-2026
User-agent: *
Allow: /
# AI bots — Real-time citation
User-agent: OAI-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
# AI bots — Training (allow for maximum visibility)
User-agent: GPTBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: Bytespider
Allow: /
User-agent: CCBot
Allow: /
# Sitemap
Sitemap: https://your-site.com/sitemap.xmlFor the "citation only" setup (block training), replace Allow: / with Disallow: / on GPTBot, ClaudeBot, Google-Extended, CCBot and Bytespider.
Verify your configuration is correct: the Detekia GEO audit analyzes your site's AI bot crawlability.
Analyze my website for free →Final checklist
- ✓ robots.txt explicitly allows citation bots (OAI-SearchBot, PerplexityBot)
- ✓ The
Sitemap:directive points to your sitemap.xml - ✓ The sitemap contains
<lastmod>values updated only when content actually changes - ✓ hreflang tags are present in the sitemap for multilingual sites
- ✓ An llms.txt file is present at the root with a structured site summary
- ✓ No excessive
Crawl-delay(or absent) - ✓ robots.txt is tested after every migration or infrastructure change
- ✓ The 8 GEO criteria are verified, including AI crawlability