robots.txt
robots.txt controls crawling; noindex controls indexing. They solve different problems.
Definition
robots.txt is a plain text file at your site root that provides crawl directives to bots. It’s advisory and not a security mechanism.
Why it matters
- Reduces crawl waste on low-value paths (e.g., admin)
- Avoids accidentally blocking CSS/JS needed for rendering
- Works well with sitemaps to improve discovery
- Manages crawl budget for large sites
- Prevents dev/staging environments from accidental indexing
- Controls specific crawler access (e.g., blocking AI training bots)
- First touchpoint for search engines; wrong settings affect entire site
How to implement
- Place it at the site root: /robots.txt
- Don't block critical rendering resources (CSS/JS/images)
- Use noindex (meta/header) to prevent appearing in search results, not robots.txt
- Use User-agent to specify rules for specific crawlers
- Add Sitemap directive pointing to sitemap location
- Regularly check crawl status reports in Search Console
- Validate rules with robots.txt testing tool
Examples
txt
# Basic robots.txt
User-agent: *
Allow: /
Disallow: /admin/
Disallow: /api/
Disallow: /private/
# Allow CSS/JS crawling
Allow: /*.css
Allow: /*.js
Sitemap: https://example.com/sitemap.xmltxt
# Block specific AI training crawlers
User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
# But allow Googlebot
User-agent: Googlebot
Allow: /
Sitemap: https://example.com/sitemap.xmlRelated
FAQ
Common questions about this term.