Source: airops-fan-out-effect-citation-study-2026-04-13.md — AirOps Team in partnership with Kevin Indig (Growth Memo). Published 2026-04-13 on airops.com/report.
AirOps + Kevin Indig ran an observational study at unusual scale on ChatGPT’s citation behavior: 16,851 unique queries, 50,553 ChatGPT responses (3 runs each), 353,799 pages scraped, 815,484 scoring rows. The headline finding — retrieval rank dominates everything else: rank-1 pages are cited 58.4% of the time, rank-10 pages 14.2% (a 4.1× gap). On-page signals (schema markup, heading match, word count, FK readability) all matter, but at +5-11 percentage point magnitudes — meaningful but secondary to “did the retrieval system put your page in the candidate set?” Domain authority shows no positive correlation in ChatGPT-only data. This is the deepest public dataset on ChatGPT citation mechanics to date.
Key Takeaways
- Retrieval rank is the dominant signal — by a factor of 4×. Rank 1: 58.4% citation rate. Rank 10: 14.2%. Winning the retrieval candidate set matters more than any individual on-page signal.
- Query fan-out is real but bounded. 88.6% of ChatGPT queries trigger exactly 2 fan-out sub-queries. 8.8% trigger zero. Only 2.5% trigger 4+. The popular “1 query → 6-10 fan-out queries” narrative is overstated for ChatGPT’s actual behavior in this dataset.
- Heading-to-query similarity matters (+10.8pp). Pages where H1-H4 headings match the user’s query at cosine similarity ≥0.90 are cited 41.0% vs 30.2% for pages at <0.50.
- JSON-LD schema shows a +6.5pp citation advantage in AirOps’s stratified analysis (38.5% vs 32.0%, independent of word count / heading count / DA / query-match score). Top types: MedicalWebPage (47.0%), BreadcrumbList (46.2%), FAQPage (45.6%).
- Schema finding is correlational — pairs with the Ahrefs causal contradiction. AirOps’s stratification controls don’t isolate schema the way Ahrefs’s matched difference-in-differences does. Both findings reconcile if schema is a marker of editorial/technical maturity, not a causal lever. See contradiction callout below.
- Focused beats exhaustive coverage (+4.2pp). Pages covering 26-50% of fan-out sub-queries beat pages covering 100%, controlling for primary similarity ≥0.8. Over-covering the topic is a drag on citation, not a help.
- Word count sweet spot 500-2,000. Pages over 5,000 words underperform pages under 500.
- FK readability 16-17 (college) wins (+6.3pp over FK <8). College-level writing beats both simple and overly academic text.
- Domain authority shows no positive correlation in ChatGPT. Slight inverse trend in the highest DA quartile. This directly contradicts Digital Applied’s AIO finding of DA Pearson +0.61. The likely explanation: ChatGPT and Google AIO weight authority differently; ChatGPT leans heavily on retrieval system signals, AIO weights traditional Google ranking signals (which include DA-correlated factors) more.
- Page age: fresh content wins (+5.2pp). 30-89-day-old pages cited 32.8%, 5+-year-old pages cited 27.6%.
- ChatGPT-only scope. Authors explicitly flag that findings may not transfer to Google AI Mode, Perplexity, or Gemini. The retrieval-rank dominance is system-specific.
Schema effect: AirOps says +6.5pp, Ahrefs says no causal lift
AirOps says (this article, correlational, ChatGPT, stratified controls) — JSON-LD schema independently predicts +6.5pp citation rate (38.5% vs 32.0%). Ahrefs says (ChatGPT) — adding schema mid-period produces no statistically meaningful citation lift on any AI surface. Reconciliation: Correlational evidence shows schema-using pages are cited more; causal evidence shows adding schema doesn’t cause the lift. Most parsimonious interpretation: schema is a marker of editorial / technical / publication-infrastructure maturity that correlates with citation, not the lever itself. AirOps’s stratification controls for word count / heading count / DA / query-match but cannot match on unobserved publisher quality dimensions. Both can be simultaneously true. Status: resolved (2026-05-19) — methodological-difference, not factual.
Methodology
- Scale: 16,851 queries × 3 ChatGPT runs = 50,553 responses, 353,799 pages scraped, 815,484 scoring rows.
- Capture: UI scraping (not API).
- Embedding model: BAAI/bge-base-en-v1.5, 768 dimensions, page H1-H4 vs query embeddings.
- Design: Observational with stratification controls (e.g., holding primary similarity constant when isolating other signals). NOT a matched difference-in-differences or randomized assignment design.
- Engine: ChatGPT only.
Practical Implications
The retrieval-rank finding (4.1× gap between rank-1 and rank-10) is the dominant practical takeaway: the page-level signals AirOps measures are “table stakes for AI visibility, not differentiators.” None of them can overcome poor retrieval rank or weak query-page relevance. The practitioner playbook AirOps proposes:
- Optimize the page’s retrieval candidacy first — ensure indexability, quality content, semantic relevance. Don’t ship schema before fixing retrieval.
- Match headings to query intent — write H1-H4 that literally answers the query at high cosine similarity.
- Use focused content, not exhaustive coverage — covering 26-50% of fan-out sub-queries deeply beats covering 100% shallowly.
- Word count 500-2,000, FK grade 14-17, 4-10 H2-H4 headings — the sweet spots for articles.
- Ship JSON-LD schema — at minimum FAQPage / BreadcrumbList / Article for editorial pages, MedicalWebPage for healthcare. Ship for the parseability benefit AirOps measures, but don’t treat schema as a citation lever in isolation — Ahrefs’s causal study shows adding it to a page doesn’t cause the lift.
Open Questions
- Why does ChatGPT diverge from AIO on DA? ChatGPT shows no positive DA correlation (slight inverse). Digital Applied’s AIO study shows DA Pearson +0.61. Two engines, opposite findings on the same signal. Hypotheses: ChatGPT relies primarily on the underlying retrieval system (publicly Bing) which weights DA-proxies less; AIO inherits Google ranking signals which encode DA proxies heavily.
- Schema causation vs correlation. Pairs of methodologies (AirOps stratification vs Ahrefs matched DiD) disagree directionally on whether schema is causal. Resolution likely requires a randomized intervention study (impractical at scale).
- 2-step fan-out. Why does ChatGPT preferentially fan out to exactly 2 sub-queries 88.6% of the time? Authors don’t speculate. ^[inferred] Likely a configured-not-emergent parameter.
Related
- Ahrefs Schema → AI Citations Causal Study — Matched DiD on 1,885 pages adding schema. Causal counterpoint to AirOps’s correlational +6.5pp finding.
- Zyppy AI Citation Ranking Factors Meta-Analysis — Cyrus Shepard’s 54-study aggregation. Lists Search Rank #2 (9.7/10) and Fan-out Rank #3 (9.3/10) — AirOps is the deepest single empirical backing for that #2 + #3 weight.
- Digital Applied 1,000 AIO Citation Pattern Study — Companion correlational study on AIO. Schema lift 2.3× there vs +6.5pp here; the magnitude difference is the AIO-vs-ChatGPT divergence.
- GEO-16 Framework (arXiv 2509.10762v1) — Academic correlational study on Brave/AIO/Perplexity. Structured Data r=0.63 corroborates schema-correlated-with-citation across engines.
- Google’s Generative AI Search Optimization Guide — Google’s official position aligns: AI Overviews + AI Mode use the same Search index, so winning retrieval rank IS winning AI citation candidacy.
- FLUQs Framework — Citation Labs’ content-strategy framework. AirOps’s heading-match-and-focused-coverage findings give the empirical backing for FLUQs’s “structure facts to survive LLM compression” thesis.
- GSC Autonomous SEO Engine — Operationalizes the retrieval-rank-first playbook AirOps validates.
Try It
- Pull ChatGPT’s retrieval rank for your top 20 queries. If you’re not in the top 10 candidate set, no on-page tactic will rescue you. Fix retrieval candidacy first.
- Audit H1-H4 cosine similarity against your target queries. Use any sentence-transformer model (BAAI/bge-base-en-v1.5 is what AirOps used) to score heading-to-query similarity. Rewrite headings under 0.50.
- Resist over-coverage. Cover 26-50% of the fan-out sub-queries deeply; don’t try to cover all of them. Use AI Mode’s expanded query view to see what the fan-out looks like.
- Word-count audit. Trim pages over 5,000 words to 2,000-3,000. Length is a drag past the sweet spot.
- FK readability check. Most enterprise SEO content sits at FK 10-13. Push to 16-17 for AI-citation eligibility.
- Schema as table stakes, not lever. Ship FAQPage / Article / BreadcrumbList / HowTo where editorially valid. Don’t expect schema alone to move citations — the causal evidence (Ahrefs) is against it.
Refresh — AirOps “From Retrieved to Cited” commercial-content companion (added 2026-05-19)
AirOps published a commercial-content companion to this study — “From Retrieved to Cited: How Commercial Content Earns Citations in AI Search” (airops-from-retrieved-to-cited-2026-05-19.md). Where the April 13 study mapped retrieval→citation mechanics, this one holds retrieval constant and asks which page structures earn citations at each buyer-journey stage (awareness → consideration → comparison → validation). The two reconcile cleanly: retrieval rank gates candidacy; structure decides selection.
- Comparison pages with 3 tables earn +25.7% more citations — the single strongest lift in the study. Versus / competitor-comparison pages relying on prose underperform; structured tables (pricing, features, limitations, tradeoffs) give AI search a cleaner side-by-side format. Audit these first.
- Validation pages with 8 list sections earn up to +26.9% more citations. More broadly, commercial pages with 7-26 list sections were +6% to +15.2% more likely to be cited — lists are the strongest shared signal across every journey stage.
- Early-discovery / awareness pages with 5-7 statistics earn +20% higher citation likelihood. Grounding category-introduction claims in data gives AI search more confidence to cite.
- Shortlist pages averaging ≤10 words/sentence earn +18.8% more citations; pages averaging 11-14 words/sentence earn ~+7%. Reinforces the FK-readability and “easy to read, parse, extract” findings from the main study.
- Through-line: content that is easier to read, parse, and extract performs better at every stage. The lever is structure (tables, lists, short sentences, inline stats), not length — the commercial-page-specific corroboration of this study’s “focused beats exhaustive” finding.