Robots.txt

Technical & Infrastructure

A file telling search engine crawlers which pages to index and which to ignore on your site.

Definition

Robots.txt is a plain text file placed at the root of a website that instructs search engine crawlers which pages or directories they are allowed or disallowed from accessing. It follows the Robots Exclusion Protocol, a standard that all major search engines respect. The file contains rules for specific user agents (crawler names) along with Allow and Disallow directives that control access to different URL paths. While robots.txt is a recommendation rather than an access control mechanism — crawlers can technically ignore it — reputable search engines like Google, Bing, and AI-powered crawlers consistently honor these directives.

Why It Matters

Without a properly configured robots.txt, search engines may crawl and index pages you want to keep private, such as admin panels, staging environments, API endpoints, or duplicate content. Conversely, blocking the wrong paths can prevent your public flipbooks and publications from appearing in search results entirely. A well-maintained robots.txt helps search engines focus their crawl budget — the number of pages a crawler will visit on your site within a given time — on the content you actually want discovered. For publishers hosting flipbooks, this means ensuring that landing pages, [SEO metadata](/glossary/seo), and preview pages are fully accessible to crawlers.

How It Works in FlipLink

FlipLink's marketing site uses a robots.txt that allows all public pages, blog posts, feature pages, glossary entries, and guides to be crawled while blocking internal API routes and application paths. It also explicitly permits AI crawlers like GPTBot, ClaudeBot, Google-Extended, PerplexityBot, and Applebot-Extended so that flipbook-related content appears in AI-powered search answers. When you publish flipbooks on a [Custom Domain](/features/custom-domains), you can configure your own robots.txt on that domain to control how search engines treat your hosted publications. The [SEO & Social Previews](/features/seo-and-social-previews) feature works alongside robots.txt to ensure indexed pages present optimized metadata to both traditional and AI search engines.

Technical Details

A robots.txt file uses a straightforward syntax. Each block starts with a `User-agent` line specifying which crawler the rules apply to, followed by `Disallow` and `Allow` directives: - **User-agent: \*** — applies rules to all crawlers - **Disallow: /api/** — prevents crawlers from accessing anything under /api/ - **Allow: /blog/** — explicitly permits access to the blog directory - **Sitemap:** — declares the location of your XML [sitemap](/glossary/sitemap) for crawler discovery Rules are evaluated top to bottom, and more specific paths take priority. The file must be accessible at the exact URL `https://yourdomain.com/robots.txt` — no other location works. Note that robots.txt does not prevent pages from being indexed if other sites link to them; for that, you need `noindex` meta tags or HTTP headers.

Common Misconceptions

- **"Robots.txt blocks pages from appearing in search results."** Not entirely. While it prevents crawlers from visiting the page, if other sites link to that URL, search engines may still list it with limited information. Use `noindex` meta tags for pages you truly want excluded from search results. - **"I only need rules for Googlebot."** Bing, Yandex, DuckDuckGo, and AI crawlers all read robots.txt. Ignoring them means missing traffic from alternative search engines and AI answer tools. - **"Once I set it, I never need to update it."** Your robots.txt should evolve as your site grows. New sections, tools, and content paths need to be reviewed to ensure they are crawlable. AI crawler user agents are also constantly expanding. - **"Robots.txt is a security measure."** It is not. The file is publicly readable and provides no access control. Sensitive paths should be protected with authentication, not just a Disallow directive.

Setup Checklist

1. **Identify all public paths** — list every section of your site that should appear in search results (blog, features, glossary, landing pages). 2. **List all private paths** — API endpoints, admin routes, staging pages, embed endpoints, and internal tools. 3. **Write user-agent rules** — create a `User-agent: *` block with your Disallow directives for private paths. 4. **Add AI crawler permissions** — include explicit `User-agent` blocks for GPTBot, ClaudeBot, Google-Extended, PerplexityBot, and other AI crawlers with appropriate Allow rules. 5. **Declare your sitemap** — add a `Sitemap:` line pointing to your XML sitemap URL. 6. **Test with Google Search Console** — use the robots.txt tester to verify that important pages are accessible and private pages are blocked. 7. **Review quarterly** — as your site adds new sections or tools, update robots.txt to reflect the current structure.

Related Terms

Schema Markup

Structured data added to web pages to help search engines display rich results in SERPs.

SEO (Search Engine Optimization)

Practices that improve a website's visibility and ranking in search engine results pages.

Sitemap

An XML file listing all pages on a website to help search engines discover and index content.

SMTP (Simple Mail Transfer Protocol)

The standard protocol for sending emails between servers, used for notification delivery.

SPF (Sender Policy Framework)

An email authentication record that specifies which servers can send email for your domain.

Available in other languages

Ready to Transform
Your PDFs?

Join thousands of businesses using FlipLink to create engaging, interactive content from their PDFs. Start free — no credit card required.

Create Your First Flipbook View Pricing