Sitemap and robots.txt for AI Bots in 2026

Your robots.txt is probably blocking AI bots without you knowing it. According to an Originality.ai study from 2025, 73% of websites block at least one AI crawler — often by default, through an overly broad Disallow rule inherited from a migration or a WordPress template.

In 2026, the landscape has changed. Seven major AI bots crawl the web constantly, and the distinction between those that train models and those that power real-time citations has become strategic. Block the wrong bot, and you disappear from AI responses. Allow them all without thinking, and you give away your training data for free.

This article gives you the reference configuration — robots.txt, sitemap.xml and llms.txt — for 2026.

AI bots in 2026: who crawls what

The first mistake is treating all AI bots the same way. There are two fundamentally different categories:

Training crawlers (GPTBot, ClaudeBot, CCBot, Google-Extended, Bytespider): they collect data to improve models. Blocking them has no immediate impact on your visibility in AI responses.
Citation crawlers (OAI-SearchBot, PerplexityBot, Googlebot for AI Overviews): they power real-time responses. Blocking them means disappearing from AI citations.

Bot	Owner	Type	Respects robots.txt
`GPTBot`	OpenAI	Training + search	Yes
`OAI-SearchBot`	OpenAI	Real-time citation	Yes
`ClaudeBot`	Anthropic	Training	Yes
`PerplexityBot`	Perplexity	Real-time citation	Yes
`Google-Extended`	Google	Gemini training	Yes
`Bytespider`	ByteDance	Training	Partial
`CCBot`	Common Crawl	Open corpus	Yes

The strategic recommendation: always allow citation crawlers (OAI-SearchBot, PerplexityBot). For training crawlers, the decision depends on your strategy — some sites choose to block training while remaining citable. For a deeper dive, see our technical guide on AI crawlability.

robots.txt: strategic configuration for AI bots

The robots.txt is your first control lever. Here are three configurations based on your strategy.

"Allow all" configuration (recommended for GEO)

If your goal is to maximize AI visibility — citations, recommendations, appearing in responses — allow all bots:

robots.txt — Maximum visibility

User-agent: *
Allow: /

User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

Sitemap: https://your-site.com/sitemap.xml

Why list each bot individually when User-agent: * already covers them? Because some bots check their own directive first before looking at *. By explicitly listing each bot with Allow: /, you eliminate any ambiguity.

"Citation yes, training no" configuration

If you want to be cited by AI without your data being used to train models:

robots.txt — Citations only

User-agent: *
Allow: /

# Citation crawlers — ALLOW
User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

# Training crawlers — BLOCK
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

Sitemap: https://your-site.com/sitemap.xml

Warning: this configuration has a limitation. AI models are trained on past snapshots. If you block GPTBot today, future GPT versions will know less about your site, which can indirectly reduce your citations long-term. It's a trade-off to evaluate.

Classic trap: the global `Disallow: /` rule

The most common trap: a User-agent: * / Disallow: / with no exceptions. That blocks all AI bots at once. You often see this on sites migrated from WordPress with a misconfigured SEO plugin, or on staging sites whose robots.txt was forgotten in production.

Is your robots.txt blocking AI bots? The Detekia GEO audit checks your site's crawlability.

Analyze my website for free →

Sitemap.xml: signals that help AI

The sitemap isn't just for Google. AI bots like PerplexityBot and OAI-SearchBot read it too to discover your pages. Three signals are particularly important.

`<lastmod>` — The freshness signal

AI values fresh content. The <lastmod> tag tells them when a page was last updated. According to Growth Memo observations (2026), pages with a recent <lastmod> are crawled more frequently by AI bots.

Rule: only update <lastmod> when content actually changes. Not on every build, not automatically. AI bots (and Google) detect artificial <lastmod> values and ignore them.

`<priority>` and structure

The <priority> tag is ignored by most engines, but sitemap structure matters. Split your sitemaps if you have more than 100 URLs: a sitemap-pages.xml for marketing pages and a sitemap-blog.xml for articles. This helps bots prioritize.

hreflang in the sitemap

For multilingual sites, xhtml:link tags with hreflang in the sitemap help AI associate the correct language versions. This is particularly important for Perplexity, which adapts responses to the user's language.

Sitemap excerpt with hreflang

<url>
  <loc>https://your-site.com/geo-guide</loc>
  <lastmod>2026-04-15</lastmod>
  <xhtml:link rel="alternate" hreflang="fr"
    href="https://your-site.com/geo-guide" />
  <xhtml:link rel="alternate" hreflang="en"
    href="https://your-site.com/en/geo-guide" />
  <xhtml:link rel="alternate" hreflang="x-default"
    href="https://your-site.com/geo-guide" />
</url>

llms.txt: the emerging standard for AI

The llms.txt file is an initiative launched in late 2024 to give LLMs a structured summary of your site. Unlike robots.txt (which tells bots what they can crawl), llms.txt tells them what your site is and which pages are most important.

As of April 2026, llms.txt isn't yet an official standard, but it's read by some crawlers and can influence how AI understands your site. It's an optional but increasingly recommended signal.

Example llms.txt

# Detekia
> Detekia is a GEO audit tool that analyzes website visibility
> on AI engines (ChatGPT, Gemini, Perplexity).

## Main pages
- [Home](https://detekia.fr): Free GEO audit
- [Methodology](https://detekia.fr/methodologie): 8 GEO criteria
- [Pricing](https://detekia.fr/pricing): 1-page and complete audit
- [Blog](https://detekia.fr/blog): GEO and SEO guides

## Expertise
- AI visibility auditing
- GEO scoring out of 100
- Technical recommendations with code
- Methodology based on Princeton/KDD 2024

Best practices: keep it concise (under 500 words), structure in Markdown sections, and update when your offering evolves. The file goes at the root: your-site.com/llms.txt.

For a complete guide on llms.txt and AI crawlability, see our dedicated article: llms.txt, robots.txt and AI crawlability.

5 traps to avoid

Blocking GPTBot thinking you're only blocking training. Since early 2026, OpenAI uses OAI-SearchBot for real-time citations, separate from GPTBot. If you only block GPTBot, your ChatGPT citations are preserved. But if your rule blocks both, you disappear.
Forgetting the Sitemap: directive in robots.txt. It's the simplest way for bots to discover your sitemap. Without this line, some AI crawlers don't find it automatically.
Automatic lastmod on every deploy. If all your pages have today's date as lastmod, bots end up ignoring this signal. Only update pages whose content has actually changed.
Overly aggressive Crawl-delay. Some sites add Crawl-delay: 10 to limit server load. AI bots like PerplexityBot respect this directive — a 10-second delay between pages means crawling 100 pages takes 17 minutes. For content sites, that's a bottleneck for AI indexing.
Not testing after migrations. CMS migrations, CDN changes and reverse proxy updates can silently overwrite your robots.txt. Always test after any infrastructure change with a simple curl https://your-site.com/robots.txt.

2026 reference configuration

Here's the complete configuration to copy-paste and adapt. It allows all citation bots, all training bots, and includes the sitemap and llms.txt references.

robots.txt — 2026 reference configuration

# robots.txt — Optimal GEO configuration 2026
# Documentation: detekia.fr/blog/sitemap-robots-txt-bots-ia-2026

User-agent: *
Allow: /

# AI bots — Real-time citation
User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

# AI bots — Training (allow for maximum visibility)
User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: Bytespider
Allow: /

User-agent: CCBot
Allow: /

# Sitemap
Sitemap: https://your-site.com/sitemap.xml

For the "citation only" setup (block training), replace Allow: / with Disallow: / on GPTBot, ClaudeBot, Google-Extended, CCBot and Bytespider.

Verify your configuration is correct: the Detekia GEO audit analyzes your site's AI bot crawlability.