llms.txt, robots.txt and AI Crawlability: The Technical Guide

Before asking whether AI engines understand your content, there's a more fundamental question: can they even access it? A surprising number of well-optimized sites inadvertently block AI bots in their robots.txt file. Others let bots in but serve them unrendered JavaScript — gibberish to crawlers.

This technical guide covers everything you need to know about AI crawlability: the user-agents to know, the correct robots.txt configuration, the new llms.txt standard, and the checks to run to make sure your site is actually indexable by LLMs.

→AI Crawlability is one of the 8 GEO score criteria — see the full methodology →

AI bots: who are they?

Every major AI platform deploys its own bots to crawl the web. Just like Googlebot for SEO, these bots identify themselves with a specific "user-agent" in their HTTP requests.

AI Platform	Primary User-Agent	Secondary User-Agent
ChatGPT / OpenAI	`GPTBot`	`OAI-SearchBot`
Claude / Anthropic	`ClaudeBot`	`anthropic-ai`
Perplexity	`PerplexityBot`	`—`
Google Gemini	`Google-Extended`	`Googlebot`
Meta AI	`FacebookBot`	`—`
Common Crawl	`CCBot`	`—`

The critical point: if your robots.txt contains a Disallow: / directive for User-agent: * (all bots), it blocks all AI bots too. This is often a legacy configuration error — originally meant to block scrapers — that unintentionally applies to LLMs.

Configure robots.txt correctly

A GEO-compatible robots.txt must explicitly allow the major AI bots. Here's the recommended configuration:

# robots.txt — GEO-compatible configuration

# Traditional search engines
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

# OpenAI / ChatGPT
User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

# Anthropic / Claude
User-agent: ClaudeBot
Allow: /

User-agent: anthropic-ai
Allow: /

# Perplexity
User-agent: PerplexityBot
Allow: /

# Google Gemini / AI Overviews
User-agent: Google-Extended
Allow: /

# Common Crawl (training data)
User-agent: CCBot
Allow: /

# Default rule
User-agent: *
Allow: /

Sitemap: https://www.your-website.com/sitemap.xml

If you want to allow AI bots while blocking certain scrapers, you can combine specific directives with a restrictive default rule:

# Explicitly allow AI bots
User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

# Block unidentified scrapers
User-agent: *
Disallow: /

Important: Specific directives take priority over general ones. A bot listed explicitly with Allow: / will be allowed even if the User-agent: * rule blocks it.

Check your current configuration

To test your current robots.txt, go directly to https://your-website.com/robots.txt. Look for the user-agents listed above and check whether they're allowed or blocked.

Three situations to identify:

User-agent not mentioned → the bot inherits the User-agent: * rule. If that rule is Allow: /, you're fine. If it's Disallow: /, the bot is blocked.
User-agent with Disallow: / → the bot is explicitly blocked. Fix this immediately.
User-agent with Allow: / → correct, the bot can crawl your site.

Google Search Console includes a robots.txt tester. For non-Google AI bots, you can use the "robots.txt viewer" Chrome extension or a service like sitechecker.pro.

The llms.txt file: the new standard

The llms.txt file is an initiative proposed in 2024 (by Jeremy Howard, creator of fast.ai) to create a standard that allows websites to communicate directly with LLMs. It sits at the site root, like robots.txt, and contains a structured summary of the site and its key content.

What llms.txt contains

The proposed format is simple — structured Markdown with defined sections:

# Site Name

> A concise description of what your site does (2-3 sentences max).
> Ideally: who you are, what you offer, who you serve.

## Main Content

- [Complete GEO Guide](https://your-website.com/blog/geo-guide-complete): The definitive guide to optimizing AI visibility in 2026
- [Free GEO Audit](https://your-website.com/): Automated GEO score analysis tool, 8 criteria measured
- [Methodology](https://your-website.com/methodology): Detailed explanation of each GEO score criterion

## About

- Founded: 2025
- Expertise: GEO (Generative Engine Optimization), SEO, AI visibility
- Contact: contact@your-website.com

## What We Don't Do

- We don't do paid search advertising (PPC)
- We don't offer content creation services

## Useful Links

- [Blog](https://your-website.com/blog): Technical articles on GEO
- [Pricing](https://your-website.com/pricing): Our plans
- [Contact](https://your-website.com/contact): Contact form

llms.txt vs llms-full.txt

The standard proposes two variants:

llms.txt — short summary, index of important pages. Ideal for LLMs doing quick scans.
llms-full.txt — complete version with the full content of key pages. Intended for LLMs that want to index content in depth.

Start with llms.txt. The full version is optional and mainly useful for content-rich sites (documentation, knowledge bases).

Is llms.txt already adopted by AI engines?

As of 2026, the standard is recognized and used by Perplexity and some academic crawlers. OpenAI and Anthropic follow the robots.txt standard and have indicated they will take llms.txt into account going forward. Google has not officially commented.

The pragmatic recommendation: create an llms.txt now. The cost is minimal (20 minutes), the potential benefit is real as adoption grows, and it won't hurt your SEO.

Other crawlability obstacles

Client-side JavaScript

This is the most underestimated problem. If your content is rendered in client-side JavaScript (React, Vue, Angular without SSR), basic AI bots won't see that content — they receive the initial HTML without waiting for JS execution.

<!-- ❌ Content invisible to basic bots -->
<div id="app"></div>
<script>
  // Content is loaded after the fact via JS
  document.getElementById('app').innerHTML = '<h1>My content</h1>';
</script>

<!-- ✅ Content visible immediately -->
<h1>My content</h1>

Fix: Use Server-Side Rendering (SSR) or Static Site Generation (SSG). Next.js, Nuxt.js, and Gatsby are built for this. If you're running a pure SPA, implement pre-rendering.

Content behind authentication

Pages that require a login are inaccessible to bots by definition. If your valuable content is behind a login wall, consider making a public version (or a preview) available without authentication.

4xx and 5xx errors

Pages that return HTTP errors will be ignored. Regularly check that your important pages return a 200 using Google Search Console or a crawl tool like Screaming Frog.

Canonicals and redirects

If your important pages have redirect chains or canonical tags pointing to a different URL, bots may not follow through to the final content. Simplify your URL structure and limit redirects to a single hop.

AI crawlability checklist

✓ robots.txt tested — GPTBot, ClaudeBot, PerplexityBot, and Google-Extended allowed
✓ XML sitemap present and referenced in robots.txt
✓ llms.txt created at the site root
✓ Main content rendered server-side (SSR or SSG)
✓ No important content behind client-side JavaScript only
✓ Key pages return HTTP 200
✓ No redirect chains to key pages
✓ Response time under 3 seconds

→Check your site's AI crawlability with the GEO audit →

Frequently asked questions

Does blocking AI bots hurt my regular SEO?

No, as long as you only block AI bots and let Googlebot access your site normally. The two crawl systems are independent. Google-Extended is separate from Googlebot — you can block one without affecting the other.

Should I allow Common Crawl?

Common Crawl is the dataset on which many open-source LLMs are trained. Allowing it increases your chances of being in the training data of future models. If you don't have a specific reason to block it (sensitive content, paywall), allow it.

What happens if I block AI bots after being indexed?

Information already in LLM training data stays — you can't "erase" from the training corpus. However, for real-time AI search systems (Perplexity, SearchGPT), blocking the bot will prevent future citations.

→See all GEO criteria and their impact on your score →