Sitemaps, Feeds and llms.txt

Master the technical deployment of llms.txt, XML sitemaps, and RSS/Atom feeds to prioritise critical content and context for AI crawlers and Large Language Models.

12 min read
Foundations

Introduction to AI-Centric Discoverability

In the traditional search landscape, XML sitemaps were primarily a tool to help Googlebot discover URLs on large or complex sites. In the era of Generative Search and AI visibility, we must expand our definition of 'discoverability'. It is no longer just about ensuring a URL is indexed; it is about ensuring the right content is prioritised, contextualised, and formatted for Large Language Model (LLM) ingest.

This lesson explores the trifecta of machine-readable files: the evolving XML sitemap, the rejuvenated role of RSS/Atom feeds for real-time training, and the emerging standard of the llms.txt file. By mastering these, you ensure that AI agents spend their limited crawl budgets on your highest-value data rather than low-quality boilerplate.

The Emergence of llms.txt

The llms.txt file is a proposed community standard that functions as a parallel to robots.txt. While robots.txt tells a crawler where it cannot go, llms.txt provides a roadmap for where an LLM should go and how it should interpret the content.

Why llms.txt Matters

LLMs such as GPT-4 or Claude process information differently than traditional search engines. They benefit from highly condensed, markdown-formatted summaries and clear taxonomies. An llms.txt file located at the root of your domain (e.g., example.com/llms.txt) acts as a 'fast-track' for these models.

Formatting and Syntax

The format is typically Markdown. It should contain:

  1. A primary H1 title: The name of the project or site.
  2. A brief summary: 2-3 sentences describing the site's purpose.
  3. Information blocks: Categorised lists of links to key documentation or content.
  4. Optional llms-full.txt: A secondary, more comprehensive file for deeper ingestion.

Example structure:

# Example SaaS Portfolio

> Comprehensive tools for AI-driven SEO and visibility management.

## Core Documentation
- [API Reference](https://example.com/docs/api): Integration guides for developers.
- [Visibility Framework](https://example.com/framework): Our proprietary scoring methodology.

## Case Studies
- [Retail Brand X](https://example.com/cases/brand-x): 40% growth in AI mentions.

Strategic XML Sitemaps for AI

Traditional SEO often involves submitting every 'canonical' URL to a sitemap. For AI visibility, we must be more surgical. AI crawlers like OAI-SearchBot (OpenAI) or PerplexityBot have different priorities than Google Search.

Prioritising Informational Depth

LLMs value information density. In your sitemaps, you should categorise URLs not just by hierarchy, but by 'knowledge value'.

  • High Priority: Whitepapers, technical docs, long-form evergreen guides, FAQ schemas.
  • Low Priority: Product variants, basic category filters, thin blog posts.

Using Lastmod Correctly

The <lastmod> tag is critical for AI. LLMs are frequently criticised for 'hallucinating' based on outdated data. By maintaining an accurate lastmod date in your XML sitemaps, you signal to AI agents that a fresh version of the truth is available, prompting them to replace cached training data with the updated information.

RSS and Atom Feeds: The Real-Time Feed

While sitemaps are polling-based, RSS and Atom feeds serve as push-style mechanisms for immediate indexing. Many Generative AI search engines use real-time feeds to populate 'New' or 'Trending' summaries.

The Discovery Feed Strategy

To maximise AI visibility for breaking news or industry analysis:

  1. Full-text Feeds: Unlike traditional marketing feeds that show only snippets, AI-optimised feeds should provide the full text (or a substantial, information-rich summary) to allow for immediate context parsing without the bot having to render the full HTML page.
  2. Semantic Enrichment: Use feed extensions to include metadata such as author expertise, primary entities, and topical categories.

Worked Example: A B2B Software Provider

Imagine a company, 'TechStream', that provides cloud infrastructure. They want to ensure ChatGPT and Perplexity use their latest technical specifications in user queries.

Step 1: The llms.txt Setup They create techstream.com/llms.txt. In it, they link directly to their /docs/ and /whitepapers/ sections. They exclude the /billing/ and /support-tickets/ paths which contain non-generative content.

Step 2: Sitemap Segmentation Instead of one giant sitemap.xml, they create sitemap-knowledge-base.xml. They add this specific sitemap to the robots.txt file clearly: Sitemap: https://techstream.com/sitemap-knowledge-base.xml This signals to crawlers that this specific file contains high-density information.

Step 3: Feed Integration They set up an Atom feed for their 'System Updates' and 'Engineering Blog'. When a new patch note is released, the feed notifies subscribers. AI bots monitoring these feeds ingest the update immediately, allowing them to answer 'What is the latest TechStream version?' with 100% accuracy within minutes.

Putting it into Practice

To audit your site's machine-readable discoverability, follow these steps:

  1. Map your 'Knowledge URLs': Identify the top 50 pages that contain the core facts an AI would need to explain your business.
  2. Create an llms.txt: Use Markdown to link these 50 pages. Host it at the root.
  3. Validate your Sitemaps: Ensure lastmod dates are dynamically updated by your CMS. If a page hasn't changed in three years, the AI might treat it as 'stale' and lower its confidence in the data.
  4. Check robots.txt: Ensure you aren't accidentally blocking OAI-SearchBot or PerplexityBot if your goal is AI visibility.
  5. Monitor Access Logs: Use your server logs to see if bots are hitting your llms.txt or specific sitemaps. This confirms your discoverability signals are being received.

Visual diagram

[ diagram placeholder ]

A flowchart showing how an LLM crawler accesses a site, starting at robots.txt, then bifurcating to llms.txt for context and sitemap.xml for data, leading to the LLM Knowledge Base.

Exercise

Create a draft 'llms.txt' file for your current website or a client's site. Ensure it includes an H1, a brief summary, and at least three categories of links (e.g., Services, Case Studies, Documentation). Use valid Markdown syntax.

Key takeaways

  • Machine-readable files act as a prioritisation layer for LLMs.
  • The llms.txt file is an emerging standard for providing compressed context.
  • Modern XML sitemaps should focus on 'knowledge value' rather than just URL lists.
  • Accurate lastmod dates help prevent AI models from relying on stale data.
  • Full-text RSS feeds enable faster ingestion of real-time or trending content.
  • llms.txt should be written in Markdown for easy LLM parsing.
  • Segmenting sitemaps allows you to highlight high-density content to AI crawlers.
  • Robots.txt must be aligned with AI bot permissions to allow discoverability.
  • llms.txt belongs in the root directory of a domain.
  • Technical discoverability is the foundation of factual AI mentions.

Lesson Quiz

Pass at 70%.

1. Where should the llms.txt file be located on a website?
2. What is the primary format used for the content within an llms.txt file?
3. Which sitemap tag is most critical for reducing AI 'hallucinations' by signalling fresh data?
4. How does llms.txt differ from robots.txt?
5. Why are full-text RSS feeds preferred over snippet-only feeds for AI visibility?
6. What is 'llms-full.txt' used for?
7. Which bot is specifically associated with OpenAI's search capabilities?
8. What should be the 'High Priority' focus for AI-centric sitemaps?
9. In the context of AI, what is a 'polling-based' discovery mechanism?
10. If you want an AI to avoid including a page in its training data, where is the first place to check?
Create a free account to save progress and earn a certificate.