Before asking whether AI engines understand your content, there's a more fundamental question: can they even access it? A surprising number of well-optimized sites inadvertently block AI bots in their robots.txt file. Others let bots in but serve them unrendered JavaScript — gibberish to crawlers.
This technical guide covers everything you need to know about AI crawlability: the user-agents to know, the correct robots.txt configuration, the new llms.txt standard, and the checks to run to make sure your site is actually indexable by LLMs.
→AI Crawlability is one of the 8 GEO score criteria — see the full methodology →
AI bots: who are they?
Every major AI platform deploys its own bots to crawl the web. Just like Googlebot for SEO, these bots identify themselves with a specific "user-agent" in their HTTP requests.
| AI Platform | Primary User-Agent | Secondary User-Agent |
|---|---|---|
| ChatGPT / OpenAI | GPTBot | OAI-SearchBot |
| Claude / Anthropic | ClaudeBot | anthropic-ai |
| Perplexity | PerplexityBot | — |
| Google Gemini | Google-Extended | Googlebot |
| Meta AI | FacebookBot | — |
| Common Crawl | CCBot | — |
The critical point: if your robots.txt contains a Disallow: / directive for User-agent: * (all bots), it blocks all AI bots too. This is often a legacy configuration error — originally meant to block scrapers — that unintentionally applies to LLMs.
Configure robots.txt correctly
A GEO-compatible robots.txt must explicitly allow the major AI bots. Here's the recommended configuration:
# robots.txt — GEO-compatible configuration
# Traditional search engines
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
# OpenAI / ChatGPT
User-agent: GPTBot
Allow: /
User-agent: OAI-SearchBot
Allow: /
# Anthropic / Claude
User-agent: ClaudeBot
Allow: /
User-agent: anthropic-ai
Allow: /
# Perplexity
User-agent: PerplexityBot
Allow: /
# Google Gemini / AI Overviews
User-agent: Google-Extended
Allow: /
# Common Crawl (training data)
User-agent: CCBot
Allow: /
# Default rule
User-agent: *
Allow: /
Sitemap: https://www.your-website.com/sitemap.xmlIf you want to allow AI bots while blocking certain scrapers, you can combine specific directives with a restrictive default rule:
# Explicitly allow AI bots
User-agent: GPTBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
# Block unidentified scrapers
User-agent: *
Disallow: /Important: Specific directives take priority over general ones. A bot listed explicitly with Allow: / will be allowed even if the User-agent: * rule blocks it.
Check your current configuration
To test your current robots.txt, go directly to https://your-website.com/robots.txt. Look for the user-agents listed above and check whether they're allowed or blocked.
Three situations to identify:
- User-agent not mentioned → the bot inherits the
User-agent: *rule. If that rule isAllow: /, you're fine. If it'sDisallow: /, the bot is blocked. - User-agent with
Disallow: /→ the bot is explicitly blocked. Fix this immediately. - User-agent with
Allow: /→ correct, the bot can crawl your site.
Google Search Console includes a robots.txt tester. For non-Google AI bots, you can use the "robots.txt viewer" Chrome extension or a service like sitechecker.pro.
The llms.txt file: the new standard
The llms.txt file is an initiative proposed in 2024 (by Jeremy Howard, creator of fast.ai) to create a standard that allows websites to communicate directly with LLMs. It sits at the site root, like robots.txt, and contains a structured summary of the site and its key content.
What llms.txt contains
The proposed format is simple — structured Markdown with defined sections:
# Site Name
> A concise description of what your site does (2-3 sentences max).
> Ideally: who you are, what you offer, who you serve.
## Main Content
- [Complete GEO Guide](https://your-website.com/blog/geo-guide-complete): The definitive guide to optimizing AI visibility in 2026
- [Free GEO Audit](https://your-website.com/): Automated GEO score analysis tool, 8 criteria measured
- [Methodology](https://your-website.com/methodology): Detailed explanation of each GEO score criterion
## About
- Founded: 2025
- Expertise: GEO (Generative Engine Optimization), SEO, AI visibility
- Contact: contact@your-website.com
## What We Don't Do
- We don't do paid search advertising (PPC)
- We don't offer content creation services
## Useful Links
- [Blog](https://your-website.com/blog): Technical articles on GEO
- [Pricing](https://your-website.com/pricing): Our plans
- [Contact](https://your-website.com/contact): Contact formllms.txt vs llms-full.txt
The standard proposes two variants:
- llms.txt — short summary, index of important pages. Ideal for LLMs doing quick scans.
- llms-full.txt — complete version with the full content of key pages. Intended for LLMs that want to index content in depth.
Start with llms.txt. The full version is optional and mainly useful for content-rich sites (documentation, knowledge bases).
Is llms.txt already adopted by AI engines?
As of 2026, the standard is recognized and used by Perplexity and some academic crawlers. OpenAI and Anthropic follow the robots.txt standard and have indicated they will take llms.txt into account going forward. Google has not officially commented.
The pragmatic recommendation: create an llms.txt now. The cost is minimal (20 minutes), the potential benefit is real as adoption grows, and it won't hurt your SEO.
Other crawlability obstacles
Client-side JavaScript
This is the most underestimated problem. If your content is rendered in client-side JavaScript (React, Vue, Angular without SSR), basic AI bots won't see that content — they receive the initial HTML without waiting for JS execution.
<!-- ❌ Content invisible to basic bots -->
<div id="app"></div>
<script>
// Content is loaded after the fact via JS
document.getElementById('app').innerHTML = '<h1>My content</h1>';
</script>
<!-- ✅ Content visible immediately -->
<h1>My content</h1>Fix: Use Server-Side Rendering (SSR) or Static Site Generation (SSG). Next.js, Nuxt.js, and Gatsby are built for this. If you're running a pure SPA, implement pre-rendering.
Content behind authentication
Pages that require a login are inaccessible to bots by definition. If your valuable content is behind a login wall, consider making a public version (or a preview) available without authentication.
4xx and 5xx errors
Pages that return HTTP errors will be ignored. Regularly check that your important pages return a 200 using Google Search Console or a crawl tool like Screaming Frog.
Canonicals and redirects
If your important pages have redirect chains or canonical tags pointing to a different URL, bots may not follow through to the final content. Simplify your URL structure and limit redirects to a single hop.
AI crawlability checklist
- ✓ robots.txt tested — GPTBot, ClaudeBot, PerplexityBot, and Google-Extended allowed
- ✓ XML sitemap present and referenced in robots.txt
- ✓ llms.txt created at the site root
- ✓ Main content rendered server-side (SSR or SSG)
- ✓ No important content behind client-side JavaScript only
- ✓ Key pages return HTTP 200
- ✓ No redirect chains to key pages
- ✓ Response time under 3 seconds
→Check your site's AI crawlability with the GEO audit →
Frequently asked questions
Does blocking AI bots hurt my regular SEO?
No, as long as you only block AI bots and let Googlebot access your site normally. The two crawl systems are independent. Google-Extended is separate from Googlebot — you can block one without affecting the other.
Should I allow Common Crawl?
Common Crawl is the dataset on which many open-source LLMs are trained. Allowing it increases your chances of being in the training data of future models. If you don't have a specific reason to block it (sensitive content, paywall), allow it.
What happens if I block AI bots after being indexed?
Information already in LLM training data stays — you can't "erase" from the training corpus. However, for real-time AI search systems (Perplexity, SearchGPT), blocking the bot will prevent future citations.