Skip to content

Layer 1 - Crawl Access

Weight: 25% of open score

Layer 1 checks whether AI crawlers can reach the site and whether the site has explicitly declared its stance.


robots-present - robots.txt present

Possible statuses: pass / fail / warn

Fetches {domain}/robots.txt with Accept: text/plain. Passes if the file is present and contains recognisable directives (User-agent, Disallow, Allow, Sitemap, Crawl-delay). Content takes priority over Content-Type - some servers misconfigure the MIME type but serve valid content.

Fails if the URL returns HTTP 4xx/5xx, a network error, or an HTML page (indicating a WAF or SPA fallback intercept).

Why it matters: robots.txt is the foundational crawl contract. Without it, AI crawlers have no explicit guidance and must make assumptions. Its absence doesn’t block crawling, but its presence - or absence - signals how deliberate a site’s AI stance is.


sitemap-in-robots - Sitemap linked from robots.txt

Possible statuses: pass / warn

Checks for a Sitemap: directive in robots.txt.

Why it matters: Linking to the sitemap from robots.txt is standard practice and ensures crawlers can discover all content from a single starting point, without relying on link traversal.


robots-{agent} - Per-agent access

Possible statuses: pass (allowed) / warn (blocked)

One check per tracked AI agent. In open mode, blocked agents score as warn (half point); in blocking mode the same data feeds the blocking score formula separately.

Tracked agents

Agent IDOperator
GPTBotOpenAI
ClaudeBotAnthropic
anthropic-aiAnthropic
PerplexityBotPerplexity
Googlebot-ExtendedGoogle
CCBotCommon Crawl
omgilibotOmgili/Webz.io
FacebookBotMeta

Blocking logic

An agent is considered blocked if it has a specific User-agent: block with Disallow: / or Disallow: /*. If no specific block exists, the wildcard User-agent: * rules apply. If neither exists, the agent is treated as allowed.

Why it matters: Many sites block all bots via User-agent: * intending to block scrapers, but this also blocks legitimate AI crawlers. Each agent check makes this visible individually.


llms-txt-present - llms.txt present

Possible statuses: pass / fail

Fetches {domain}/llms.txt with Accept: text/plain. Fails if the file is missing, returns an error status, or the response body is HTML.

Why it matters: llms.txt is an emerging standard for sites to communicate their AI stance and capabilities to LLM-based crawlers - analogous to robots.txt but purpose-built for AI agents. Its presence signals a deliberate, forward-looking approach to AI readiness.


llms.txt is validated against the official spec. The table below maps each criterion to its check:

CriterionSpec levelCheck ID
H1 title presentRequiredllms-txt-format
Blockquote summaryRecommendedllms-txt-blockquote
H2 sections presentRecommendedllms-txt-sections
File is conciseImpliedllms-txt-concise
Contains linksExpectedllms-txt-links
Links have descriptive anchor textRecommendedllms-txt-link-quality

llms-txt-format - llms.txt has H1 title

Possible statuses: pass / fail

Checks that the file begins with an H1 heading (# Title). This is the only element the spec marks as required. A missing H1 is a fail — there is no partial credit for omitting the one mandatory field.


llms-txt-blockquote - llms.txt has summary blockquote

Possible statuses: pass / warn

Checks for a blockquote (> ...) after the H1 title. The spec recommends a brief summary here containing key information about the site.


llms-txt-sections - llms.txt has H2 sections

Possible statuses: pass / warn

Checks for H2 headings (## Section Name). The spec uses H2 sections to group related file lists (e.g. ## Docs, ## API, ## Optional).


llms-txt-concise - llms.txt is appropriately concise

Possible statuses: pass / warn / fail

The spec explicitly calls for “concise, clear language.” A file containing thousands of product SKUs, every page URL, or other bulk content defeats the purpose — it cannot be efficiently consumed by an LLM and signals a misunderstanding of the format.

ConditionStatus
≤ 500 lines and ≤ 50 KBpass
500–2,000 lines or 50–500 KBwarn
> 2,000 lines or > 500 KBfail

Possible statuses: pass / warn / fail

Counts markdown links ([text](url)) in the file. An excessively high link count is treated as a quality failure — the file should curate key resources, not enumerate entire sitemaps or product catalogues.

LinksStatus
1–500pass
501–2,000warn
> 2,000fail
0warn

Possible statuses: pass / warn / fail / info

Checks that link anchor text is meaningful. The primary pattern caught is bare URL anchor text — links where the anchor is the raw URL itself (e.g. [https://example.com/page](https://example.com/page)). These provide no context to an LLM.

Poor anchor ratioStatus
0%pass
1–20%warn
> 20%fail
No linksinfo

sitemap-present - Sitemap present

Possible statuses: pass / fail

Probes candidates in this order:

  1. URL from the Sitemap: directive in robots.txt (if found)
  2. {domain}/sitemap.xml
  3. {domain}/sitemap_index.xml

Validates that the response contains <urlset or <sitemapindex XML structure. Content takes priority over Content-Type.

Why it matters: A sitemap gives AI crawlers a complete, structured index of all pages, avoiding reliance on link discovery which may miss deep or orphaned content.


sitemap-lastmod - Sitemap has lastmod dates

Possible statuses: pass / warn

Checks for at least one <lastmod> element in the sitemap XML.

Why it matters: lastmod dates let crawlers prioritise recently changed content and avoid re-crawling unchanged pages. Without them, crawlers must either crawl everything on every run or use heuristics.


waf-detected - CDN or WAF detected

Possible statuses: warn (detected) / info (not detected)

Detects the following providers by their response headers:

ProviderHeader signals
Cloudflarecf-ray, cf-cache-status
Akamaix-akamai-transformed, akamai-origin-hop
Fastlyx-served-by containing cache-
Sucurix-sucuri-id, x-sucuri-cache
Imperva/Incapsulax-iinfo, incap_ses cookie
AWS CloudFrontx-amz-cf-id, x-amz-cf-pop

When detected, the status is warn because default WAF bot-management rules frequently block AI crawlers without the site owner realising.

Because “not detected” is not meaningful (unknown CDNs may still be in use), a clean result is info rather than pass and does not contribute to the score.