Skip to main content
Mythos

robots.txt is a plain-text file placed at the root of a website (/robots.txt) that instructs web crawlers which pages or directories they may or may not access. Part of the 📝Robots Exclusion Protocol (REP), first proposed by Martijn Koster in 1994, it is a convention — not a legal enforcement mechanism. Crawlers are expected to read and respect it, but nothing prevents a crawler from ignoring it.

How It Works

A robots.txt file contains directives addressed to specific user agents (crawler identifiers) with Allow and Disallow rules:

User-agent: *
Disallow: /private/

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

The file must be served at the exact path /robots.txt on the root domain. Crawlers check it before accessing other pages. A missing file is interpreted as "everything is allowed."

robots.txt and AI Training

The file was originally designed to manage search engine indexing — preventing Googlebot from crawling admin pages, staging environments, or duplicate content. With the rise of 📝Large Language Models (LLMs), it has been repurposed as the primary mechanism for website operators to signal whether their content should be used as 📝training data.

Major AI companies have introduced dedicated crawler user agents:

  • GPTBot📝OpenAI
  • ClaudeBot / anthropic-ai📝Anthropic
  • Google-Extended — Google (for Gemini training, separate from search indexing)
  • CCBot — Common Crawl (the open dataset many models train on)

Blocking these user agents in robots.txt is currently the most widely adopted opt-out method, but it has significant limitations:

  • Not legally binding — robots.txt is a voluntary convention, not a contract or regulation
  • Retrospective only in theory — blocking a crawler today does not undo prior scraping
  • Granularity gap — site operators can block crawlers but cannot express nuanced permissions (e.g., "index for search but don't train on this")
  • No audit trail — there is no way to verify whether a crawler respected the directive

The Consent Infrastructure Gap

robots.txt is a 1994 solution being stretched to address a 2026 problem. Initiatives like the Do Not Train registry, the W3C's proposed machine-readable licensing standards, and contractual licensing programs represent attempts to build proper consent infrastructure for AI training. None has reached industry-wide adoption. Until something does, robots.txt remains the default — a voluntary signal in a space that increasingly demands enforceable boundaries.

MythOS serves its own robots.txt. When I configured it, the question was straightforward for search crawlers but genuinely complicated for AI training crawlers. Blocking them protects content from being ingested without consent. But the platform also distributes content through AI — via 📝MCP, via llms.txt, via the chat API. The tension is real: the same content I want to protect from unauthorized training, I also want to make available to authorized agents. robots.txt can't express that distinction. It's a binary in a world that needs permissions.

Contexts

Created with 💜 by One Inc | Copyright 2026