Skip to main content

    robots.txt

    robots.txt controls crawling; noindex controls indexing. They solve different problems.

    Definition

    robots.txt is a plain text file at your site root that provides crawl directives to bots. It’s advisory and not a security mechanism.

    Why it matters

    • Reduces crawl waste on low-value paths (e.g., admin)
    • Avoids accidentally blocking CSS/JS needed for rendering
    • Works well with sitemaps to improve discovery
    • Manages crawl budget for large sites
    • Prevents dev/staging environments from accidental indexing
    • Controls specific crawler access (e.g., blocking AI training bots)
    • First touchpoint for search engines; wrong settings affect entire site

    How to implement

    • Place it at the site root: /robots.txt
    • Don't block critical rendering resources (CSS/JS/images)
    • Use noindex (meta/header) to prevent appearing in search results, not robots.txt
    • Use User-agent to specify rules for specific crawlers
    • Add Sitemap directive pointing to sitemap location
    • Regularly check crawl status reports in Search Console
    • Validate rules with robots.txt testing tool

    Examples

    txt
    # Basic robots.txt
    User-agent: *
    Allow: /
    Disallow: /admin/
    Disallow: /api/
    Disallow: /private/
    
    # Allow CSS/JS crawling
    Allow: /*.css
    Allow: /*.js
    
    Sitemap: https://example.com/sitemap.xml
    txt
    # Block specific AI training crawlers
    User-agent: GPTBot
    Disallow: /
    
    User-agent: CCBot
    Disallow: /
    
    # But allow Googlebot
    User-agent: Googlebot
    Allow: /
    
    Sitemap: https://example.com/sitemap.xml

    Related

    FAQ

    Common questions about this term.

    Back to glossary