Audit Inputs and Data Collection

Master the methodology for identifying seed URLs, defining prompt libraries, and selecting competitors to create a robust data foundation for AI visibility audits.

12 min read
Foundations

Introduction to AI Visibility Data Collection

Transitioning from traditional SEO to AI Visibility (AIV) requires a shift in how we perceive 'data'. In a standard SEO audit, we rely on indexed pages and keyword rankings. In an AI Visibility audit, we must collect data that reflects how Large Language Models (LLMs) ingest, process, and cite information. This lesson focuses on the four pillars of data collection: URL selection, prompt engineering for auditing, competitor identification, and signal mapping.

To audit effectively, you cannot simply 'Google it'. You must simulate the user journey through AI-first interfaces like Perplexity, Gemini, and ChatGPT. This requires a structured approach to inputs to ensure that the audit results are reproducible and actionable for your clients.

1. Defining the Seed URL Set

An AI Visibility audit does not need to cover every page on a website. Instead, it focuses on high-impact 'knowledge hubs'. You should categorise your URLs into three buckets:

  1. Direct Answer Pages: High-authority content that answers specific 'How-to' or 'What is' queries.
  2. Product/Service Entities: Pages that define what the business offers and its unique value proposition.
  3. Third-Party Citations: URLs not on the client’s domain that speak about the brand (e.g., industry reviews, Wikipedia entries, niche directories), as these often serve as primary sources for LLM training and RAG (Retrieval-Augmented Generation) systems.

Actionable Step: Export your top 100 pages by organic traffic and filter for those with high 'informational intent' scores. Add 10-15 key industry press releases or review articles to this set.

2. Developing the Audit Prompt Library

The quality of your audit depends entirely on the prompts you use to test the AI. You need a mix of prompt types to see how the AI treats your brand across different stages of the funnel:

  • Zero-Shot Discovery: "Who are the best providers of [Service] in [Region]?"
  • Comparative Analysis: "Compare [Client Brand] with [Competitor A] and [Competitor B]."
  • Technical Deep-Dive: "How does [Client Product] handle [Specific Technical Problem]?"
  • Citation Request: "Provide three sources for information regarding [Client's Core Topic]."

Avoid 'leading the witness'. If you always include the brand name in the prompt, you aren't testing visibility; you are testing the AI's ability to read your site. You must test for 'unbranded' discovery.

3. Selecting the Competitive Peer Group

AI competitors are often different from your SEO competitors. While you might compete with a blog for keywords, you might compete with a software aggregator (like G2 or Capterra) or a news outlet for AI citations.

Define your competitors in three categories:

  1. Direct Business Competitors: Those who sell the same product.
  2. Information Competitors: Non-commercial sites that the AI frequently cites for your industry terms (e.g., Investopedia for finance).
  3. Aggregate Entities: The directories and forums (like Reddit) that AI models currently over-index for 'human-like' advice.

4. Signal Mapping: The External Data Points

Beyond the LLM response itself, you must collect data on the underlying signals. For each seed URL, record:

  • Schema Markup Completeness: Is the data structured for machine readability?
  • Entity Density: Are key industry entities mentioned clearly and linked to known knowledge bases (e.g., Wikidata)?
  • Sentence Complexity: Is the writing clear enough for basic NLP (Natural Language Processing) tools to parse?

Worked Example: B2B SaaS Audit (Project Management Software)

Scenario: A client provides an AI-powered project management tool for creative agencies.

Data Collection Steps:

  1. Seed Selection: We select the homepage, three core feature pages, and four 'Ultimate Guide' blog posts. We also include their G2 profile and a recent TechCrunch feature.
  2. Prompt Development: We create a set of 20 prompts. One example: "Which project management tools are best for managing video production workflows and why?"
  3. Competitor Set: We include Monday.com (Direct), Zapier (Integration partner), and Reddit threads discussing 'Video production tools' (Information competitor).
  4. Signal Capture: We use a tool to extract all Product and HowTo schema from these URLs to check for alignment with the prompts.

Putting it into Practice

To begin your audit data collection, follow this checklist:

  1. Select 20-50 high-value URLs that represent the 'brain' of the brand.
  2. Create a 'Prompt Matrix' categorized by user intent (Informational, Transactional, Navigational).
  3. Identify 3-5 'Hidden Competitors' by running your prompts through Perplexity and seeing who is cited most often.
  4. Log your baseline. Document the current ranking and citation status before making any changes. This serves as your 'Version 0' data point.

Visual diagram

[ diagram placeholder ]

A workflow diagram showing four input streams (URLs, Prompts, Competitor List, Technical Signals) feeding into a central 'Audit Processing' block, resulting in an 'AI Visibility Scorecard'.

Exercise

Select a client or your own website and identify five 'Seed URLs' that contain the most authoritative information about your core service. Then, write three different prompts (Informational, Comparative, and Brand-specific) that you would use to test an LLM's knowledge of these pages.

Key takeaways

  • AI Visibility audits require a mix of internal and third-party URLs.
  • Data collection must include unbranded prompts to test for discovery.
  • Information competitors are often different from traditional SEO competitors.
  • The 'Knowledge Hub' approach prioritises quality of data over quantity of pages.
  • Citation requests in prompts reveal which domains the AI trusts as sources.
  • Schema markup analysis is a critical secondary data point for AI audits.
  • Reddit and niche forums are increasingly important 'Information Competitors'.
  • Entity density helps LLMs associate your brand with specific industry topics.
  • A Prompt Matrix ensures consistency across multiple testing sessions.
  • Documenting baseline citations is essential for measuring AI-SEO ROI.

Lesson Quiz

Pass at 70%.

1. Which of the following should be included in the 'Seed URL' set for an AI audit?
2. What is a 'Zero-Shot Discovery' prompt?
3. Why are 'Information Competitors' important in an AI audit?
4. How many prompts are typically recommended for a baseline AI visibility test?
5. Which signal is most relevant to how an LLM parses data from a webpage?
6. What is the primary risk of only using 'branded' prompts in an audit?
7. In the context of AI Visibility, what is a 'Knowledge Hub'?
8. Why should an auditor include third-party forum links (like Reddit) in their competitor set?
9. What does a 'Citation Request' prompt specifically help you measure?
10. When documenting a baseline for an audit, what is 'Version 0' Reference?
Create a free account to save progress and earn a certificate.