What is a robots.txt file?
Lawrence Hitches Written by Lawrence Hitches | AI SEO Consultant | May 03, 2026 | 7 min read

A robots.txt file is a plain text file at the root of your site (yourdomain.com/robots.txt) that tells web crawlers which URLs they can access. In 2026 it controls not just Googlebot and Bingbot, but also AI crawlers like GPTBot (OpenAI), ClaudeBot (Anthropic), PerplexityBot (Perplexity), and Google-Extended (used for Gemini training). Misconfigured robots.txt is the single most common reason brands have zero AI search visibility, they blocked the AI crawlers without realising. This guide covers what robots.txt does in 2026, how to configure it for both traditional and AI search engines, and the most common mistakes I see auditing client sites at StudioHawk.

What Is a Robots.txt File?

Robots.txt is a plain text file that lives at the root of your domain (yourdomain.com/robots.txt). It uses the Robots Exclusion Protocol, a 1994 standard that tells well-behaved web crawlers which paths on your site they're allowed to crawl. The file uses two main directives:

  • User-agent: specifies which crawler the rules apply to (e.g., Googlebot, GPTBot, *)
  • Allow / Disallow: specifies which URL paths the crawler can/cannot access

A minimal robots.txt looks like:

User-agent: *
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Robots.txt is voluntary, it's an instruction, not enforcement. Well-behaved crawlers (Googlebot, Bingbot, GPTBot, ClaudeBot, PerplexityBot) honour it. Malicious scrapers ignore it entirely.

The AI Crawler Era: Bots You Need to Know in 2026

The biggest 2026 change to robots.txt: AI crawlers. Most sites' robots.txt files were written before AI search was a thing, and many block AI crawlers either deliberately (early defensive moves to "keep AI out") or accidentally (theme defaults that block "unknown" bots). The result: zero AI search visibility.

The AI crawlers worth knowing:

CrawlerOperatorPurposeBlock = no visibility in
GPTBotOpenAITraining data + ChatGPT SearchChatGPT, ChatGPT Search citations
ChatGPT-UserOpenAIReal-time browsing when users ask ChatGPT a questionLive ChatGPT answers
OAI-SearchBotOpenAIChatGPT Search indexChatGPT Search results
ClaudeBotAnthropicTraining data for ClaudeClaude knowledge
Claude-UserAnthropicLive browsing when users ask ClaudeLive Claude answers
PerplexityBotPerplexityIndex for Perplexity Answer EnginePerplexity citations
Perplexity-UserPerplexityLive retrieval for Perplexity queriesLive Perplexity answers
Google-ExtendedGoogleTraining data for GeminiGemini knowledge (separate from Google Search)
BytespiderByteDanceTikTok / Doubao AI trainingTikTok AI search, Doubao
Meta-ExternalAgentMetaLlama trainingMeta AI products

Note the User vs Bot distinction: GPTBot crawls for training; ChatGPT-User browses in real-time when a user asks a question. Block GPTBot and you lose presence in ChatGPT's "knowledge"; block ChatGPT-User and you lose live browsing citations. For AI search visibility, you generally want to allow both.

How to Configure Robots.txt for Both Traditional and AI Search

The 2026 best-practice baseline robots.txt for most content sites:

User-agent: *
Allow: /
Disallow: /wp-admin/
Disallow: /search/
Disallow: /*?

# Explicitly allow AI crawlers
User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: Google-Extended
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Why explicit Allow blocks for AI crawlers when wildcard already allows them? Two reasons: (1) clarity in audits, (2) ensures any future "block unknown bots" pattern doesn't accidentally exclude them.

If you want to opt out of AI training but stay in AI search

Some publishers want to block AI crawlers from using their content for training but still want to appear in real-time AI search citations. The pattern:

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Claude-User
Allow: /

User-agent: PerplexityBot
Allow: /

This blocks training crawlers but allows live retrieval. Use case: news publishers and major content businesses negotiating licensing deals with AI companies.

The Most Common Robots.txt Mistakes (I See These Weekly)

From auditing client sites at StudioHawk:

  • Disallow: / left in from staging, a developer pushed staging robots.txt to production, blocking the entire site from crawl. The classic and most catastrophic mistake. Always check your live robots.txt after a launch.
  • Blocking AI crawlers without realising, a Cloudflare default rule, a security plugin, or an over-eager developer added rules blocking GPTBot/ClaudeBot/PerplexityBot. Result: zero AI search citations. Audit any site that's invisible in ChatGPT/Perplexity for this first.
  • Disallowing /wp-content/ on WordPress, blocks Google from accessing your CSS, JavaScript, and images, which Google needs to render the page properly. Modern best-practice: don't block /wp-content/.
  • Trying to use robots.txt to hide pages from search results, robots.txt blocks crawling, NOT indexing. A blocked URL can still appear in search results (with no description) if other sites link to it. Use noindex meta tags or password protection for true exclusion.
  • Forgetting to include a Sitemap directive, pointing crawlers at your sitemap from robots.txt is a free win. Always include it.
  • Wildcard mistakes, Disallow: /*? blocks all URLs with parameters, which can accidentally block your search results page or filtered category URLs. Test wildcard rules before deploying.
  • Blocking by IP rather than user-agent, IP-based blocking happens at the firewall (Cloudflare, AWS WAF) layer, not in robots.txt. If you've "blocked AI crawlers" via Cloudflare WAF, robots.txt won't fix it.

Robots.txt vs Meta Robots vs X-Robots-Tag (When to Use Which)

ToolWhat it doesUse when
robots.txtBlocks crawlers from accessing URLs (saves crawl budget)You don't want a section crawled at all (admin, staging, internal search)
Meta robots tagControls indexing/following per page (noindex, nofollow)Page should be crawled but not indexed (thank-you pages, thin content)
X-Robots-Tag headerSame as meta robots but at HTTP header level (works for non-HTML files)You need to noindex PDFs, images, or other non-HTML content
Cloudflare WAF rulesBlocks at the network edge, crawler can't even reach the siteHostile bots that ignore robots.txt; rate-limiting AI crawlers
Password protectionBlocks all access without credentialsTrue confidentiality required

The biggest practical confusion: people use robots.txt to "hide" pages from Google. It doesn't work that way. Use noindex (via meta robots or X-Robots-Tag) for indexing control; use robots.txt for crawling control.

How to Audit Your Robots.txt (5-Minute Check)

  1. Check the file exists: visit yourdomain.com/robots.txt directly. If it 404s, you don't have one (which is fine, but missing the Sitemap signal).
  2. Check Disallow: / isn't in there for User-agent: * (the catastrophic full-site block).
  3. Check AI crawlers aren't blocked, search the file for GPTBot, ClaudeBot, PerplexityBot. If you see Disallow rules without specific intent, fix them.
  4. Check Sitemap directive is present, should point to your actual sitemap URL.
  5. Use Google Search Console's robots.txt Tester, paste in URLs from your site and confirm they're not accidentally blocked.
  6. Run Screaming Frog with respect-robots-txt enabled, if Screaming Frog can't crawl your site, neither can Google.
  7. Check Cloudflare WAF rules if AI crawlers are still being blocked despite robots.txt allowing them, the block may be at the edge.

FAQ: Robots.txt in 2026

Should I block AI crawlers like GPTBot and ClaudeBot?

Generally no, blocking AI crawlers means zero presence in ChatGPT, Claude, Perplexity, and other AI search engines. The exception is if you're a major publisher negotiating licensing deals with AI companies. For most content businesses, allowing AI crawlers is the right move because AI search is now a real (and growing) traffic channel.

What's the difference between GPTBot and ChatGPT-User?

GPTBot crawls your site to gather training data for ChatGPT. ChatGPT-User browses in real-time when a user asks ChatGPT a question that triggers a web search. Block GPTBot and ChatGPT loses your content from its base knowledge; block ChatGPT-User and you can't be cited in live ChatGPT answers. For maximum AI visibility, allow both.

Does robots.txt block pages from appearing in Google search results?

No, robots.txt blocks crawling, not indexing. A URL blocked in robots.txt can still appear in Google's index with no description if other sites link to it. To actually keep a page out of search results, use a noindex meta tag (or X-Robots-Tag header), and make sure Google can crawl the page to see the noindex directive.

Where does the robots.txt file go?

At the root of your domain: yourdomain.com/robots.txt. Crawlers look for it there and only there. Subdomains need their own robots.txt at their respective root.

Can I have one robots.txt for staging and another for production?

Yes, and you should, staging environments should typically have Disallow: / to prevent accidental indexing. The catch: developers regularly forget to update the file when promoting code from staging to production, accidentally blocking the live site. Always double-check your live robots.txt after a deploy.

Should robots.txt include the Sitemap directive?

Yes. Adding "Sitemap: https://yourdomain.com/sitemap.xml" at the bottom of robots.txt helps crawlers discover your sitemap automatically, useful for crawlers that don't have access to Google Search Console (like Bing, Yandex, AI crawlers).

How often should I review my robots.txt?

Quarterly at minimum, immediately after any site migration or major launch, and whenever a new AI search engine launches a crawler you might want to allow or block. The AI crawler landscape is evolving fast in 2026; staying current matters.

Final Takeaway: Robots.txt Is Now an AI Search Lever

Robots.txt used to be a defensive technical SEO file, block staging, save crawl budget, point at the sitemap, done. In 2026 it's an offensive AI search lever. Configure it correctly and you're allowing GPTBot, ClaudeBot, PerplexityBot, Google-Extended, and the rest to crawl your content for AI training and live retrieval. Misconfigure it and you're invisible in AI search regardless of how good your content is.

The audit takes 5 minutes. Do it on every site you own or manage.

If you'd like this audited end-to-end across your site (including the Cloudflare WAF and meta robots layer), you can work with me directly as an AI SEO consultant.

Sources & Further Reading

Soaring Above Search

Weekly AI search insights from the front line. One newsletter. Six sections. Everything that actually moved this week, with a practitioner's take.

Lawrence Hitches
Lawrence Hitches AI SEO Consultant, Melbourne

Chief of Staff at StudioHawk, Australia's largest dedicated SEO agency. Specialising in AI search visibility, technical SEO, and organic growth strategy. Leading a team of 120+ across Melbourne, Sydney, London, and the US. Book a free consultation →