Ankore
Technical SEO · Free

Robots.txt Generator

Generate a syntactically correct robots.txt with multiple user-agent groups, allow/disallow rules, sitemap declarations, AI bot controls, and quick-start presets for common configurations.

  • No signup, no email required
  • Works entirely in your browser
  • Output you can copy and paste directly
  • Built by a working SEO team, not gated by upsells

Start from a preset

Group 01
Allow

No allow rules.

Disallow

No disallow rules.

robots.txt

robots.txt
User-agent: *
Where to put this

Save the output as robots.txt at the root of your domain — it must be served at https://yoursite.com/robots.txt. Subdirectory placement does not work.

What it does

The robots.txt file is a plain-text directive that tells web crawlers which parts of your site they may or may not access. It lives at the root of your domain (https://yoursite.com/robots.txt) and is the first file most search engines and AI crawlers fetch when discovering your site.

This tool generates a syntactically correct robots.txt from a friendly form interface — no need to remember directive syntax, quote rules, or worry about ordering. Pick a preset to start, add your sitemap, and configure per-bot rules including the newer AI crawlers (GPTBot, ClaudeBot, PerplexityBot, etc.).

Why robots.txt matters more in 2026

Robots.txt was always foundational, but two trends in 2025–2026 made it strategic:

  • AI training crawlers.GPTBot (OpenAI), ClaudeBot (Anthropic), Google-Extended (Google's training data), CCBot (Common Crawl), PerplexityBot, and dozens of smaller AI-training crawlers now respect robots.txt directives. This is your control surface for whether your content gets used in AI training.
  • Crawl budget pressure. Larger sites compete for crawl budget across more bots than ever. Disallowing low-value paths (search results, faceted navigation, infinite scroll variants, internal staging) frees up budget for the content you actually want indexed.
  • Sitemap declaration. Including the Sitemap: line in robots.txt remains the most reliable way to ensure every search engine — not just Google — discovers your sitemap, regardless of whether you have submitted it via Search Console.

How to use this generator

  1. Pick a presetto start from. "Allow all" opens your site to every bot; "Block all" is for staging environments; "Block staging paths" keeps your production site indexable while excluding common non-prod paths; "WordPress default" is a sensible starting point for WP sites.
  2. Add your sitemap URL. This single line is the most consistently valuable thing in any robots.txt. Use the full https URL.
  3. Configure user-agent groups. Each group applies to a specific bot. * is the catch-all. Add separate groups to apply different rules to specific bots (e.g., allow Google but disallow GPTBot).
  4. Add Allow and Disallow paths. Paths are relative to the domain root and must start with /. Use trailing slashes for directories (/admin/) and no slash for specific files (/secret.html).
  5. Set Crawl-delay if needed. Optional and ignored by Google (use Search Console settings instead) but respected by Bing, Yandex, and many AI crawlers. Use 1–5 seconds for sites with crawl budget concerns.
  6. Copy the output and save as robots.txt at the root of your domain.

Robots.txt best practices

  • Always include your Sitemap line. The single highest-value line in any robots.txt. Even if Google already knows your sitemap, every other crawler discovers it through this declaration.
  • Disallow is not deindex. Critical misconception. Disallow tells bots not to crawl the URL — but if the URL is linked from elsewhere, it can still be indexed (with no description). To remove a URL from the index, use noindexin the page's meta tags or HTTP header. Robots.txt is for crawl control, not index control.
  • Be specific with paths. Disallow: /searchblocks any URL starting with "search", including /search-results/ and /searchable. Use trailing slashes for directories (/search/) when you mean directories specifically.
  • Use Allow rules to override broader Disallow. If you disallow /wp-admin/ but need bots to access /wp-admin/admin-ajax.php for proper rendering, the Allow: rule overrides the broader Disallow: for that specific path.
  • Block AI training crawlers if you choose. Common syntax to opt out of AI training:
    User-agent: GPTBot
    Disallow: /
    Repeat for ClaudeBot, Google-Extended, CCBot, PerplexityBot, etc. Be aware: this only works if the bot respects robots.txt (most major ones now do, but enforcement is honor-system).
  • Test before deploying.A misconfigured robots.txt can deindex your entire site within hours. Use Google Search Console's robots.txt Tester (in the legacy tools) or the URL Inspection tool to verify each important URL is still allowed.
  • Robots.txt is publicly readable. Anyone can visit yoursite.com/robots.txt and read it. Do not use it to hide sensitive paths — disallowing /admin-secret/ in robots.txt is essentially pointing a blinking arrow at your admin panel. For real access control, use HTTP authentication or proper authorization.
  • Keep it short and reviewable. A robots.txt longer than 30–50 lines usually means the rules belong in your CMS or page-level meta tags instead. Robots.txt is best kept simple.
  • Update when site structure changes. URL restructures, new sections, deleted directories — all reasons to revisit the file. A stale robots.txt is one of the most common causes of mysterious indexing problems on older sites.

Common robots.txt mistakes

  • Disallow: / (with trailing slash) on production. Blocks the entire site. Easy to leave in place after a staging push to production. Always check the live robots.txt immediately after deploys.
  • Confusing Disallow with noindex. A page disallowed in robots.txt but linked from other sites can still be indexed (just without metadata). To truly remove from index, use <meta name="robots" content="noindex"> in the page itself.
  • Using robots.txt as security. Disallow: /admin/ tells anyone reading the file exactly where to look. Use auth, not robots.txt.
  • Wildcard misuse. Wildcards (* and $) are non-standard but supported by major crawlers. Disallow: /*?print=true blocks any URL containing that query — useful but easy to misuse.
  • Multiple sitemap entries with conflicts. Listing five sitemap files when only one is current confuses crawlers. Use a sitemap index file if you have multiple sitemaps.
  • Forgetting the file is at the root. yoursite.com/robots.txt works. yoursite.com/seo/robots.txt does not — crawlers only check the root.
  • Blocking CSS or JavaScript. Common in old robots.txt files: Disallow: /css/ or Disallow: /js/. Modern crawlers need to render pages with full assets to evaluate them. Blocking CSS/JS hurts ranking. Always allow these.

How to validate your robots.txt

  1. Visit yoursite.com/robots.txt in a browser. Confirm the file is served (200 status) and renders as plain text. Any HTTP error here means crawlers cannot read your rules.
  2. Use Google Search Console's URL Inspection tool. Test individual URLs to confirm Google sees them as allowed. The tool shows the exact rule blocking a URL if blocked.
  3. Check the "Crawled - currently not indexed" report.A surge in this report after a robots.txt change is the canonical "you broke something" signal. Investigate immediately.
  4. Fetch as crawler with cURL. curl -A "Googlebot" https://yoursite.com/robots.txt verifies the file is served the same to bots as to browsers.

Frequently asked questions

Where does robots.txt go?

At the root of your domain — yoursite.com/robots.txt. It must be served from the root; subdirectory placement (yoursite.com/seo/robots.txt) does not work because crawlers only check the root URL.

Will robots.txt remove pages from Google's index?

No. Disallow tells bots not to crawl a URL — it does not remove URLs already in the index, and a disallowed URL linked from elsewhere can still appear in search results without metadata. To remove from index, use noindex in the page's meta tags or HTTP headers.

Should I block AI training crawlers?

It is your call. Blocking GPTBot, ClaudeBot, Google-Extended, etc., prevents your content from being used to train future AI models — but it does not prevent the AI from citing your content in real-time search (e.g., Perplexity, Bing Chat). For most public content, allowing AI crawlers is the default. For original research, paid content, or competitive moats, blocking makes sense.

What happens if I don't have a robots.txt?

Crawlers assume everything is allowed. Your site will be crawled normally. The downsides are: you lose the ability to declare your sitemap location, you cannot block low-value paths from crawl budget, and you cannot opt out of specific crawlers. Having a robots.txt is best practice; having none is not catastrophic.

Can I have different rules for different bots?

Yes. Each User-agent group applies its own Allow / Disallow rules. Common pattern: a permissive group for Googlebot, a restrictive group for GPTBot. The * (catch-all) group applies to bots not specifically named.

Does robots.txt work for subdomains?

Each subdomain has its own robots.txt. blog.yoursite.com and yoursite.com are separate as far as robots.txt is concerned — each needs its own file at its respective root.

What's the difference between Disallow and noindex?

Disallow (in robots.txt) tells bots not to crawl the URL — but the URL can still be indexed if linked from elsewhere. Noindex (a meta tag or HTTP header on the page itself) tells bots not to include the URL in search results. For full removal: allow crawling (so bots can read the noindex tag) AND set noindex. For crawl-budget protection on low-value pages: disallow in robots.txt.

How big can robots.txt be?

Google parses up to 500KB. Larger files get truncated. In practice, well-formed robots.txt files are under 5KB — if yours is bigger, you probably need to consolidate rules or move them to page-level meta tags.