A GEO audit is a structured review of a website against the technical, content, and authority signals that decide whether ChatGPT, Perplexity, Claude, and Google AI Mode will cite the business in their answers. The framework that follows has seven sections and forty-seven specific checks. It is the same audit APEX runs on every paid GEO engagement, written down in full so an in-house team can execute the first pass before booking a call.
AI-referred website sessions grew 527 percent year over year in 2025. Over 844,000 sites have shipped an llms.txt file. AI Overviews now appear on more than half of all Google searches. The discovery layer is shifting in real time, and the businesses that get cited are the ones that have done the structural work, not the ones with the loudest content.
If GEO is new territory, start with the introduction to AI discoverability, which covers the why and the high-level stack. This audit is the implementation layer. By the end of it, you have a prioritised list of fixes and a way to measure whether the work is paying off.
What a "GEO audit" actually means (and what it does not)
A GEO audit is not a longer SEO audit. The two disciplines share half their foundations and diverge sharply at the top. SEO ranks a page against ten blue links on a results page. GEO decides whether a page is selected as part of a synthesised answer by a generative model. The retrieval mechanics, the weighting, and the failure modes are all different.
A clean GEO audit answers four questions in order. Can the AI crawler reach the page? Once it arrives, can it parse the content cleanly enough to extract the answer? Once the answer is extracted, does the surrounding entity graph give the model enough authority signal to cite the page over a competitor? And finally, once the citation lands, is the page instrumented to convert the AI-referred visitor into a measurable business outcome?
The audit is intentionally exhaustive. AI systems do not give partial credit. A page with thirteen excellent signals and one broken robots.txt directive can be invisible. Working through the full forty-seven checks below is the cheapest insurance against that single missing item.
Section 1: Crawl access and AI bot permissions (8 checks)
The first failure mode is the simplest. The AI system cannot read the page. Crawl access has three layers that all need to be aligned: robots.txt, the edge (Cloudflare, Fastly, AWS WAF), and any application-level bot blocking. Get one wrong and the rest of the audit is academic.
AI crawlers operate in three tiers, and blocking one tier does not block the others. Training bots (GPTBot, anthropic-ai, Google-Extended, Applebot-Extended) consume content to improve model weights. Search and citation bots (OAI-SearchBot, Claude-SearchBot, PerplexityBot, Gemini-User in some flows) fetch pages to generate real-time answers. User-triggered bots (ChatGPT-User, Claude-User, Perplexity-User) retrieve pages when an end user pastes or clicks a link inside an AI conversation. For most businesses that want visibility, allowing all three tiers is the correct strategy.
The checks for this section:
- 1.1 Explicit Allow rules in robots.txt for the 14 known AI user agents. GPTBot, OAI-SearchBot, ChatGPT-User, anthropic-ai, ClaudeBot, Claude-SearchBot, Claude-User, PerplexityBot, Perplexity-User, Google-Extended, Gemini-User, YouBot, Meta-ExternalAgent, Meta-ExternalFetcher, Applebot-Extended. Silent permission is not enough for conservative citation bots that interpret ambiguity as a soft block.
- 1.2 Secondary crawlers acknowledged. Bytespider (ByteDance) and CCBot (Common Crawl) feed downstream AI ecosystems. Decide explicitly whether to allow or deny each one.
- 1.3 No conflicting Cloudflare AI toggle. The "Block AI Scrapers and Crawlers" switch in Cloudflare overrides robots.txt. Check the Bots dashboard at the edge level, not just the file.
- 1.4 No CAPTCHA or Bot Fight Mode on canonical content URLs. AI crawlers fail interactive challenges and quietly leave. Reserve protection for forms and admin paths only.
- 1.5 Server logs confirm actual AI bot hits. Grep access logs for the user-agent strings above over a thirty day window. Zero hits means the permission is theoretical.
- 1.6 No JavaScript-only rendering gate. Some AI crawlers do not execute JS or execute it on a delay. Confirm the canonical content is in the initial HTML response, not injected client-side after a fetch.
- 1.7 Sitemap declared in robots.txt and reachable. A clean Sitemap directive is the secondary route AI crawlers use to discover canonical URLs. Validate the XML returns 200 and lists every page that matters.
- 1.8 No legacy noindex on canonical pages. A stale
meta robotsnoindex on a high-value landing page silently removes it from AI retrieval. Crawl with Screaming Frog or a custom script and flag any noindex or nosnippet on canonical URLs.
Section 2: llms.txt and llms-full.txt (7 checks)
llms.txt is the single highest-leverage file APEX ships in every engagement. It is a Markdown summary of the site, placed at the root, that lets AI systems consume the structure in their native format without paying token cost on navigation menus and cookie banners. APEX shipped llms.txt on apexdigital.ro and detected AI fan-out queries in Google Search Console within twenty-four hours. The full timeline lives in our llms.txt case study.
- 2.1 llms.txt exists at the site root. The URL must be
/llms.txtexactly. Subdirectories or alternate paths are not picked up. - 2.2 llms-full.txt exists at the site root. The companion file contains the full site content as a single Markdown document for AI systems that prefer one fetch over many.
- 2.3 Correct MIME type. Both files served as
text/markdownortext/plainwith UTF-8 encoding. HTML responses break the contract. - 2.4 ASCII-clean content. Curly quotes, em-dashes, and other smart-typography characters cause tokenizer issues across some AI systems. Stick to plain ASCII or simple UTF-8.
- 2.5 Structure matches the specification. H1 site name, one blockquote summary,
then H2 sections of Markdown links with a one-sentence description after the colon on each line.
Optional content goes under a final
## OptionalH2 so AI systems can skip it under tight token budgets. - 2.6 Freshness timestamp inside the file. A
Last updatedline at the top tells AI systems the document is maintained. Bump it whenever the site adds material content. - 2.7 robots.txt links to the files. Optional but recommended:
Sitemap: /llms.txtandSitemap: /llms-full.txtdeclarations make the files discoverable by crawlers that do not check the root path automatically.
Section 3: Content extractability (9 checks)
Once a crawler reaches a page, the next failure mode is parsing. AI systems excel at extracting clean factual answers from clean factual structure. They underperform on dense marketing prose where the answer is buried under three paragraphs of throat-clearing. Extractability is the highest-leverage content discipline in GEO.
- 3.1 Direct-answer paragraph in the first 200 words. Every page that targets a specific question should answer it explicitly inside the first viewport. Long-form follows. Inverted-pyramid structure mirrors how AI systems extract.
- 3.2 Single-fact sentences for key claims. "Performance Max accounts for 67 percent of Shopping spend in 2026" is extractable. "Performance Max has become quite significant recently across the Shopping landscape" is not.
- 3.3 Semantic HTML5 elements.
article,section,aside,nav,header,footer, and proper heading hierarchy (one H1, multiple H2, no skipped levels). AI parsers use HTML5 structure as a navigation map. - 3.4 FAQ blocks aligned with real prompts. Write FAQ questions in the exact phrasing a user would type into ChatGPT. Match Google Search Console query phrasing where possible. Then answer in a single complete paragraph below each question.
- 3.5 Numbered lists for procedural content. Anything stepwise (audits, setups, checklists) becomes a numbered list. AI systems route list-structured content into HowTo and stepwise summaries with measurably higher fidelity.
- 3.6 Tables for comparative content. When the answer is a comparison across two
or more options, use a real HTML
tablewiththandtdcells. Markdown tables convert cleanly. Image screenshots of tables do not. - 3.7 Speakable CSS selectors declared. The
speakableproperty inside Article schema points AI systems and voice surfaces at the snippets worth reading aloud. Best practice is to point atarticle h2andarticle h2 + p, which surfaces section headings and their immediately following definition paragraph. - 3.8 No critical text inside images. Pricing, definitions, and headline claims in image form are invisible to most AI parsers. Alt text helps but does not substitute for extractable HTML.
- 3.9 No client-side-only rendered text. If a page returns an empty
divin the initial HTML response and populates it via JS, AI crawlers that skip JS see nothing. Pre-render or server-render canonical content.
Section 4: Structured data and the entity graph (8 checks)
Structured data is the single highest-leverage technical signal. A Data World study found GPT-4 accuracy on web content jumped from 16 percent to 54 percent when pages included schema markup. Pages with structured data get cited by AI systems at roughly 2.5 times the rate of unmarked pages. The schemas below are the production set APEX ships on every site.
- 4.1 Organization or LocalBusiness with canonical
@id. One Org entity, one canonical@id, referenced sitewide. Fragmented Org IDs across pages dilute entity signal. ThesameAsarray should include Wikidata first, Google Business Profile CID, LinkedIn, and any owned media properties. - 4.2 Article schema on every editorial page. Headline, description, datePublished,
dateModified, author (Organization plus Person), publisher, mainEntityOfPage, articleSection,
wordCount, keywords, and image. Include
speakablewith a realcssSelectorarray. - 4.3 FAQPage schema on pages with FAQ blocks. Map each visible FAQ question to a
Questionentity with anacceptedAnswer. The text must match the visible page content exactly. Hidden FAQPage schema with no visible counterpart is a manual action risk. - 4.4 HowTo schema on procedural content. Audit guides, setup tutorials, and
stepwise playbooks deserve HowTo with named
HowToStepentries pointing at heading anchors viaurl. This is the schema that maps cleanest to AI procedural summaries. - 4.5 BreadcrumbList on every non-root page. Three or four levels deep where applicable. AI systems use breadcrumbs to reconstruct site taxonomy when the navigation is JavaScript-heavy.
- 4.6 Service plus Offer on commercial pages. Service schema with a nested
Offercontainingprice,priceCurrency,priceValidUntil, andavailability. Even on services where pricing varies, a clear price range signals concreteness that AI systems weight. - 4.7 Person schema for named authors. Author bylines deserve a real
Personentity withjobTitle,worksFor,alumniOf,knowsAbout, andsameAspointing to LinkedIn and any public profiles. AI citation models favour content attributable to identifiable expertise. - 4.8 Schema validator pass on every type. Paste rendered JSON-LD into the Schema Markup Validator and the Rich Results Test for every schema type used. Warnings count.
Section 5: Authority and citation signals (6 checks)
Once the page is accessible, parseable, and structured, the next question is whether the AI selects it over a competitor. Selection is an authority decision. AI systems triangulate E-E-A-T signals (Experience, Expertise, Authoritativeness, Trustworthiness) and weight third-party validation heavily.
- 5.1 Named author with a real bio page. Every editorial post bylined by a specific person, linked to a populated About page that includes credentials, experience, and external links. "The Team" or "Admin" attribution underperforms by a measurable margin.
- 5.2 Wikidata entity with bidirectional sameAs. A Wikidata entry is the most
authoritative entity-disambiguation signal available. The site declares the Wikidata URL in
Organization.sameAs, and the Wikidata entry references the site as its official URL. - 5.3 Google Business Profile completeness for local-bias queries. Category,
address, phone, hours, services, posts, and reviews all populated. The GBP CID URL referenced in
LocalBusiness.sameAson the site closes the loop. - 5.4 Third-party review platforms aligned. Clutch, G2, Trustpilot, or industry-specific platforms describing the business consistently. AI systems resolve ambiguity by ignoring inconsistent sources, so identical NAP and offering descriptions across platforms is the baseline.
- 5.5 Factual claims sourced inline. Numbers and benchmarks linked to named
sources via real
a hreftags. Content with sourced claims achieves 30 to 40 percent higher citation rates in independent GEO studies. - 5.6 Original research or proprietary data published. Case studies with concrete metrics, benchmark reports, or survey data unique to the business. Original numbers are the single most AI-citeable content category because they cannot be sourced anywhere else.
Section 6: Freshness and lead capture (5 checks)
GEO traffic is high-intent. The user has already filtered through the AI answer and chosen the citation worth clicking. Lead capture instrumentation matters more on AI-referred sessions than on cold organic traffic, because the conversion rate is meaningfully higher and the volume is harder to attribute. This is the section that turns visibility into pipeline.
- 6.1 article:modified_time meta tag on every editorial page. Pages updated
within the last sixty days earn roughly 28 percent more AI citations than stale content. The
mechanism is the Open Graph
article:modified_timeproperty, which AI systems read directly. - 6.2 dateModified in Article schema matches visible updated stamp. Both must reflect a real edit, not a build-pipeline auto-bump. Date-faking is detectable and discounted.
- 6.3 utm_source tagging for AI referrers. Add a small script that captures
document.referreron form submissions and tags the lead withchatgpt.com,perplexity.ai,gemini.google.com,claude.ai, orcopilot.microsoft.comas the source. Without this, GA4 and most CRMs lump AI traffic into "Direct" and the conversion rate is invisible. - 6.4 Conversion-aware page templates. Lead-capture form, quote calculator, or contact CTA visible without scrolling on every editorial and service page. AI-referred users typically arrive deeper into the consideration funnel and the friction tolerance is lower.
- 6.5 Sitemap lastmod stamps accurate. XML sitemap
lastmodvalues updated whenever a page changes materially. Stale sitemap dates undermine the freshness signal that article:modified_time carries.
Section 7: Monitoring and measurement (4 checks)
Visibility is not a one-time shipping milestone. AI training cycles, retrieval-index refreshes, and competitor updates move the position of any cited business on a monthly cadence. Measurement is what separates a one-time GEO project from a compounding GEO programme.
- 7.1 Google Search Console fan-out detection. Filter GSC for queries above ten
words containing Boolean operators or
-site:exclusion chains. These are machine- generated sub-queries from AI systems, and their appearance is the earliest measurable signal that the site is being retrieved. - 7.2 AI visibility tracker subscribed. Otterly, Peec AI, Profound, or a similar tool that programmatically queries ChatGPT, Perplexity, and Google AI Mode against a tracked prompt list and reports the share of voice.
- 7.3 GA4 referral source captured. Custom dimensions in GA4 capturing the AI-referrer utm_source values from Section 6 so AI-referred sessions are visible in standard reports.
- 7.4 Monthly AI share-of-voice report. One document each month containing the prompt list, the AI surfaces tracked, the citation rate per surface, and the delta against the previous month. Includes a list of new fan-out queries detected in GSC.
A worked example: APEX's own GEO audit results
APEX ran this exact framework on apexdigital.ro between March and May 2026. The technical implementation took eleven working days. Within twenty-four hours of deploying llms.txt and updating robots.txt, AI fan-out queries began appearing in Google Search Console. The full timeline and the original screenshots live in the case study.
Six measurable outcomes followed in the first ninety days. AI fan-out queries grew from zero to dozens per week. Branded SERP recovery on "apex digital alliance" moved from position 8.68 back to 1.18 after the Wikidata entity anchor was added and the Organization schema consolidated. Impression volume on the case study page itself reached 1,709 in twenty-eight days at an average position of 8.12, which is the highest non-branded click-through rate on the site at 0.35 percent. New AI search query clusters surfaced ("llms.txt geo", "geo llms.txt", "ai search visibility audit") that did not exist in GSC before the implementation. The site has not yet been cited inside ChatGPT or Perplexity for branded queries in the tracked prompt set, which is consistent with the published twelve-week minimum for citation appearance in those tools.
The lesson from running the audit on a live site: the order matters. Sections one and two ship fastest and produce the earliest signal. Section four (structured data) is the heaviest lift but delivers the largest accuracy gain. Section seven (monitoring) is the discipline that compounds, and skipping it turns the rest of the audit into a one-time spend rather than a programme.
What APEX hands clients on the AI Search Optimization engagement
The forty-seven checks above are the framework. The full AI Search Optimization (GEO) service at 750 euros per month delivers the framework as an executed programme: an initial audit and remediation pass over the first thirty days, ongoing monthly implementation against the seven sections, and a monthly AI share-of-voice report that tracks citation rate across ChatGPT, Perplexity, Claude, and Google AI Mode.
The engagement is built for businesses that have already invested in SEO and want a parallel layer targeted at the AI surfaces specifically. It pairs cleanly with the Google Ads e-commerce playbook for stores where AI Overviews and Shopping ads overlap, and with the static website services for any business rebuilding from the ground up where extractability is easier to ship correctly the first time than retrofit later.
Frequently asked questions
What is a GEO audit?
A GEO audit is a structured review of a website against the technical, content, and authority signals that determine whether AI systems like ChatGPT, Perplexity, Claude, and Google AI Mode cite the site in their answers. It covers seven layers: crawl access permissions for AI bots, the llms.txt and llms-full.txt files, content extractability, structured data, authority and citation signals, freshness and lead capture, and ongoing AI visibility measurement. A complete audit takes around six hours for a small site and yields a prioritised remediation plan.
What is llms.txt for GEO and AEO?
llms.txt is a Markdown file placed at the root of a website that helps large language models understand the site structure, content, and authority. Proposed by Jeremy Howard of Answer.AI in September 2024, it is the de facto standard for both Generative Engine Optimization (GEO) and Answer Engine Optimization (AEO). Over 844,000 sites have adopted it, including Anthropic, Cloudflare, and Stripe. The companion file llms-full.txt contains the complete site content in a single AI-ingestible document.
How does llms.txt affect lead capture from AI search?
llms.txt does not directly create leads. It increases the probability that AI systems cite the site
accurately, which compounds two stages downstream into lead capture. Sites cited by ChatGPT,
Perplexity, and Google AI Mode receive AI-referred traffic that arrives pre-qualified: the user
already trusts the source enough to follow the citation. APEX measured this by tagging contact-form
submissions with utm_source values like chatgpt.com and perplexity.ai.
AI-referred sessions converted at roughly twice the rate of organic Google traffic in the first
cohort, primarily because the AI answer pre-screens for intent before the click.
What is the llms.txt specification for website AI crawlers in 2026?
The llms.txt specification is a single UTF-8 Markdown file at the site root. The structure is fixed:
one H1 with the site name, one blockquote summarising the business, then H2 sections containing
Markdown link lists pointing to canonical pages with a one-sentence description after the colon.
Optional sections may live under a final "Optional" H2 for content that AI systems can skip in tight
context windows. The companion file, llms-full.txt, holds the complete site content concatenated
into one Markdown document. Both files must be served as text/markdown or
text/plain, must be reachable without authentication, and must use ASCII or plain UTF-8
without curly quotes or smart-typography characters that confuse some AI tokenizers.
How do I run an AI search visibility audit?
Run the audit in seven sections. Section one is crawl access: verify robots.txt explicitly allows the fourteen known AI user agents and check server logs for actual hits. Section two is the llms.txt and llms-full.txt files at the site root. Section three is content extractability: answer-first paragraphs, semantic HTML, FAQ blocks, no critical text inside JavaScript-only rendering. Section four is structured data: Organization, Article, FAQPage, HowTo, BreadcrumbList, Service with Offer, Person, and speakable. Section five is authority signals: named authors, third-party citations, Wikidata, Google Business Profile. Section six is freshness and lead-capture instrumentation including utm_source tagging for AI referrers. Section seven is monitoring: GSC fan-out query detection, Otterly or Peec AI tracking, and monthly visibility reports.
How long does generative engine optimization take to show results?
Technical implementation takes one to three weeks. APEX detected AI fan-out queries in Google Search Console within twenty-four hours of deploying llms.txt and updating robots.txt, which is the earliest measurable signal that AI systems have begun retrieving the site. Citation appearances in ChatGPT, Perplexity, and Google AI Mode typically follow over the next four to twelve weeks as the AI training and retrieval cycles refresh. Steady-state visibility takes ninety to one hundred eighty days and requires consistent publishing and freshness maintenance, not just a one-time technical pass.
What is the difference between GEO, AEO, and SEO?
SEO (Search Engine Optimization) targets traditional ranked search results: Google blue links. AEO (Answer Engine Optimization) targets featured snippets, People Also Ask, and direct-answer boxes inside traditional search results. GEO (Generative Engine Optimization) targets being cited and recommended by AI systems like ChatGPT, Perplexity, Claude, and Google AI Overviews and AI Mode. All three share a foundation: structured data, semantic HTML, authority signals, and content quality. GEO adds new requirements: llms.txt files, explicit AI crawler permissions in robots.txt, answer-first content format, freshness signals via article modified time, and ongoing visibility tracking across multiple AI surfaces. A complete strategy runs all three layers.
Want the 47-point audit run on your site?
APEX's AI Search Optimization (GEO) service at 750 euros per month delivers this framework as an executed programme: initial audit, monthly implementation, and a monthly AI share-of-voice report across ChatGPT, Perplexity, Claude, and Google AI Mode. See the case study for the timeline and the GSC fan-out evidence.