Source: airops-fan-out-effect-citation-study-2026-04-13.md — AirOps Team in partnership with Kevin Indig (Growth Memo). Published 2026-04-13 on airops.com/report.

AirOps + Kevin Indig ran an observational study at unusual scale on ChatGPT’s citation behavior: 16,851 unique queries, 50,553 ChatGPT responses (3 runs each), 353,799 pages scraped, 815,484 scoring rows. The headline finding — retrieval rank dominates everything else: rank-1 pages are cited 58.4% of the time, rank-10 pages 14.2% (a 4.1× gap). On-page signals (schema markup, heading match, word count, FK readability) all matter, but at +5-11 percentage point magnitudes — meaningful but secondary to “did the retrieval system put your page in the candidate set?” Domain authority shows no positive correlation in ChatGPT-only data. This is the deepest public dataset on ChatGPT citation mechanics to date.

Key Takeaways

  • Retrieval rank is the dominant signal — by a factor of 4×. Rank 1: 58.4% citation rate. Rank 10: 14.2%. Winning the retrieval candidate set matters more than any individual on-page signal.
  • Query fan-out is real but bounded. 88.6% of ChatGPT queries trigger exactly 2 fan-out sub-queries. 8.8% trigger zero. Only 2.5% trigger 4+. The popular “1 query → 6-10 fan-out queries” narrative is overstated for ChatGPT’s actual behavior in this dataset.
  • Heading-to-query similarity matters (+10.8pp). Pages where H1-H4 headings match the user’s query at cosine similarity ≥0.90 are cited 41.0% vs 30.2% for pages at <0.50.
  • JSON-LD schema shows a +6.5pp citation advantage in AirOps’s stratified analysis (38.5% vs 32.0%, independent of word count / heading count / DA / query-match score). Top types: MedicalWebPage (47.0%), BreadcrumbList (46.2%), FAQPage (45.6%).
  • Schema finding is correlational — pairs with the Ahrefs causal contradiction. AirOps’s stratification controls don’t isolate schema the way Ahrefs’s matched difference-in-differences does. Both findings reconcile if schema is a marker of editorial/technical maturity, not a causal lever. See contradiction callout below.
  • Focused beats exhaustive coverage (+4.2pp). Pages covering 26-50% of fan-out sub-queries beat pages covering 100%, controlling for primary similarity ≥0.8. Over-covering the topic is a drag on citation, not a help.
  • Word count sweet spot 500-2,000. Pages over 5,000 words underperform pages under 500.
  • FK readability 16-17 (college) wins (+6.3pp over FK <8). College-level writing beats both simple and overly academic text.
  • Domain authority shows no positive correlation in ChatGPT. Slight inverse trend in the highest DA quartile. This directly contradicts Digital Applied’s AIO finding of DA Pearson +0.61. The likely explanation: ChatGPT and Google AIO weight authority differently; ChatGPT leans heavily on retrieval system signals, AIO weights traditional Google ranking signals (which include DA-correlated factors) more.
  • Page age: fresh content wins (+5.2pp). 30-89-day-old pages cited 32.8%, 5+-year-old pages cited 27.6%.
  • ChatGPT-only scope. Authors explicitly flag that findings may not transfer to Google AI Mode, Perplexity, or Gemini. The retrieval-rank dominance is system-specific.

Schema effect: AirOps says +6.5pp, Ahrefs says no causal lift

AirOps says (this article, correlational, ChatGPT, stratified controls) — JSON-LD schema independently predicts +6.5pp citation rate (38.5% vs 32.0%). Ahrefs says (ChatGPT) — adding schema mid-period produces no statistically meaningful citation lift on any AI surface. Reconciliation: Correlational evidence shows schema-using pages are cited more; causal evidence shows adding schema doesn’t cause the lift. Most parsimonious interpretation: schema is a marker of editorial / technical / publication-infrastructure maturity that correlates with citation, not the lever itself. AirOps’s stratification controls for word count / heading count / DA / query-match but cannot match on unobserved publisher quality dimensions. Both can be simultaneously true. Status: resolved (2026-05-19) — methodological-difference, not factual.

Methodology

  • Scale: 16,851 queries × 3 ChatGPT runs = 50,553 responses, 353,799 pages scraped, 815,484 scoring rows.
  • Capture: UI scraping (not API).
  • Embedding model: BAAI/bge-base-en-v1.5, 768 dimensions, page H1-H4 vs query embeddings.
  • Design: Observational with stratification controls (e.g., holding primary similarity constant when isolating other signals). NOT a matched difference-in-differences or randomized assignment design.
  • Engine: ChatGPT only.

Practical Implications

The retrieval-rank finding (4.1× gap between rank-1 and rank-10) is the dominant practical takeaway: the page-level signals AirOps measures are “table stakes for AI visibility, not differentiators.” None of them can overcome poor retrieval rank or weak query-page relevance. The practitioner playbook AirOps proposes:

  1. Optimize the page’s retrieval candidacy first — ensure indexability, quality content, semantic relevance. Don’t ship schema before fixing retrieval.
  2. Match headings to query intent — write H1-H4 that literally answers the query at high cosine similarity.
  3. Use focused content, not exhaustive coverage — covering 26-50% of fan-out sub-queries deeply beats covering 100% shallowly.
  4. Word count 500-2,000, FK grade 14-17, 4-10 H2-H4 headings — the sweet spots for articles.
  5. Ship JSON-LD schema — at minimum FAQPage / BreadcrumbList / Article for editorial pages, MedicalWebPage for healthcare. Ship for the parseability benefit AirOps measures, but don’t treat schema as a citation lever in isolation — Ahrefs’s causal study shows adding it to a page doesn’t cause the lift.

Open Questions

  • Why does ChatGPT diverge from AIO on DA? ChatGPT shows no positive DA correlation (slight inverse). Digital Applied’s AIO study shows DA Pearson +0.61. Two engines, opposite findings on the same signal. Hypotheses: ChatGPT relies primarily on the underlying retrieval system (publicly Bing) which weights DA-proxies less; AIO inherits Google ranking signals which encode DA proxies heavily.
  • Schema causation vs correlation. Pairs of methodologies (AirOps stratification vs Ahrefs matched DiD) disagree directionally on whether schema is causal. Resolution likely requires a randomized intervention study (impractical at scale).
  • 2-step fan-out. Why does ChatGPT preferentially fan out to exactly 2 sub-queries 88.6% of the time? Authors don’t speculate. ^[inferred] Likely a configured-not-emergent parameter.

Try It

  1. Pull ChatGPT’s retrieval rank for your top 20 queries. If you’re not in the top 10 candidate set, no on-page tactic will rescue you. Fix retrieval candidacy first.
  2. Audit H1-H4 cosine similarity against your target queries. Use any sentence-transformer model (BAAI/bge-base-en-v1.5 is what AirOps used) to score heading-to-query similarity. Rewrite headings under 0.50.
  3. Resist over-coverage. Cover 26-50% of the fan-out sub-queries deeply; don’t try to cover all of them. Use AI Mode’s expanded query view to see what the fan-out looks like.
  4. Word-count audit. Trim pages over 5,000 words to 2,000-3,000. Length is a drag past the sweet spot.
  5. FK readability check. Most enterprise SEO content sits at FK 10-13. Push to 16-17 for AI-citation eligibility.
  6. Schema as table stakes, not lever. Ship FAQPage / Article / BreadcrumbList / HowTo where editorially valid. Don’t expect schema alone to move citations — the causal evidence (Ahrefs) is against it.

Refresh — AirOps “From Retrieved to Cited” commercial-content companion (added 2026-05-19)

AirOps published a commercial-content companion to this study — “From Retrieved to Cited: How Commercial Content Earns Citations in AI Search” (airops-from-retrieved-to-cited-2026-05-19.md). Where the April 13 study mapped retrieval→citation mechanics, this one holds retrieval constant and asks which page structures earn citations at each buyer-journey stage (awareness → consideration → comparison → validation). The two reconcile cleanly: retrieval rank gates candidacy; structure decides selection.

  • Comparison pages with 3 tables earn +25.7% more citations — the single strongest lift in the study. Versus / competitor-comparison pages relying on prose underperform; structured tables (pricing, features, limitations, tradeoffs) give AI search a cleaner side-by-side format. Audit these first.
  • Validation pages with 8 list sections earn up to +26.9% more citations. More broadly, commercial pages with 7-26 list sections were +6% to +15.2% more likely to be cited — lists are the strongest shared signal across every journey stage.
  • Early-discovery / awareness pages with 5-7 statistics earn +20% higher citation likelihood. Grounding category-introduction claims in data gives AI search more confidence to cite.
  • Shortlist pages averaging ≤10 words/sentence earn +18.8% more citations; pages averaging 11-14 words/sentence earn ~+7%. Reinforces the FK-readability and “easy to read, parse, extract” findings from the main study.
  • Through-line: content that is easier to read, parse, and extract performs better at every stage. The lever is structure (tables, lists, short sentences, inline stats), not length — the commercial-page-specific corroboration of this study’s “focused beats exhaustive” finding.