Source: geo-16-framework-arxiv-kumar-palkhouski-2025-09.md — Arlen Kumar (UC Berkeley) and Leanid Palkhouski (Wrodium Research). arXiv:2509.10762v1, submitted September 13, 2025.
The first academic paper to publish a structured AEO/GEO auditing framework derived from a multi-engine empirical study. Kumar and Palkhouski audited 1,100 URLs against 70 B2B SaaS prompts and harvested 1,702 citations across Brave Summary, Google AI Overviews, and Perplexity sonar-pro. They built GEO-16 — a 16-pillar scoring framework — from the dataset. Headline: pages scoring ≥0.70 on GEO with ≥12 pillar hits achieve 78% cross-engine citation rates. The top three correlated pillars are Metadata & Freshness (r=0.68), Semantic HTML (r=0.65), and Structured Data (r=0.63), all p<0.001. The paper is the strongest academic-rigor source on AI citation correlations as of this writing — and explicitly self-flags its observational design as non-causal.
Key Takeaways
- First academic AEO/GEO citation study. Authors are affiliated with UC Berkeley and Wrodium Research; submitted to arXiv September 2025 (preprint).
- Multi-engine dataset. 1,100 URLs / 1,702 citations / 70 prompts / 16 B2B SaaS verticals. Citation distribution: Brave 36.0%, Google AIO 35.1%, Perplexity 28.9%.
- GEO-16 = 16 pillars across 6 principles. People-first content (3 pillars), structured data (3), provenance (3), freshness (2), risk management (2), RAG optimisation (3). Each pillar scored 0-3; aggregate G(u) = (1/48)Σ ∈ [0,1].
- Threshold finding. Pages with GEO ≥0.70 AND ≥12 pillar hits → 78% cross-engine citation rate. This is the headline operational threshold.
- Top three correlated pillars (r values, all p<0.001):
- Metadata & Freshness: r=0.68, 95% CI [0.64, 0.72], +47% citation impact
- Semantic HTML: r=0.65, 95% CI [0.61, 0.69], +42%
- Structured Data: r=0.63, 95% CI [0.59, 0.67], +39%
- Mid-tier correlations: Evidence & Citations (r=0.61, +37%), Authority & Trust (r=0.59, +35%), Internal Linking (r=0.57, +33%).
- Observational, NOT causal. Authors explicitly write: “Our observational design may suffer from unobserved confounding (internal validity)” and “we do not experimentally vary publication venues, so causal effects of off-page authority remain unverified.” This is critical context for interpreting the correlation values.
- Pairs with the Ahrefs causal contradiction on Structured Data. GEO-16’s r=0.63 / +39% is correlational; Ahrefs’s matched DiD found no causal schema lift. Same reconciliation as the AirOps study: schema is a marker, not a lever.
- B2B SaaS scope. All 16 verticals are SaaS-adjacent. Authors flag external validity: results may not generalize to consumer / healthcare / news / non-English content.
Structured Data: GEO-16 says r=0.63 / +39%, Ahrefs says no causal lift
GEO-16 says (this paper, cross-sectional observational, Brave/AIO/Perplexity, B2B SaaS) — Structured Data r=0.63, 95% CI [0.59, 0.67], +39% citation impact. Ahrefs says (ChatGPT) — adding schema mid-period produces no statistically meaningful citation lift. Reconciliation: GEO-16’s authors explicitly self-flag the observational/non-causal design. The r=0.63 captures correlation between “pages that have schema” and “pages that get cited” — but pages with schema are not a random subset of pages; they’re systematically more mature on editorial and technical dimensions. Ahrefs’s matched DiD isolates the intervention (adding schema) from the publisher characteristics; that isolation removes the lift. The framework’s overall threshold finding (GEO ≥0.70 + ≥12 pillar hits → 78% citation) is still actionable as a what predicts citation model, but readers should not interpret “+39% citation impact” as “add schema and citations rise 39%.” Status: resolved (2026-05-19) — methodological-difference, not factual.
The 16 Pillars (Grouped by Principle)
| Principle | Pillars | Count |
|---|---|---|
| People-first content | UX & Readability; Claims & Accuracy; Microcontent | 3 |
| Structured data | Semantic HTML; Structured Data; Metadata & Freshness | 3 |
| Provenance | Authority & Trust; Evidence & Citations; Transparency & Ethics | 3 |
| Freshness | Metadata & Freshness; Content Depth | 2 |
| Risk management | Claims & Accuracy; Transparency & Ethics | 2 |
| RAG optimisation | Internal Linking; External Linking; Engagement & Interaction | 3 |
Pillar scoring: Each pillar receives a band score b_j(u) ∈ {0,1,2,3}. A “pillar hit” occurs when b_j(u) ≥ 2.
Aggregate GEO score: G(u) = (1/48)Σ b_j(u) ∈ [0,1]. The denominator (48) = 16 pillars × max band 3.
Individual sub-signal weights (w_j,i) within pillars are not disclosed in v1.
The Six Principles — Verbatim Author Guidance
People-first content
“Lead with an answer-first summary (TL;DR or key takeaways), keep paragraphs compact, use descriptive headings/lists, and mark claims versus opinions explicitly.”
Structured data
“Maintain a single <h1> and logical <h2>/<h3> hierarchy; provide valid JSON-LD (Article/TechArticle/FAQPage) with datePublished, dateModified, author, and breadcrumb where relevant; expose canonical URLs and social cards. Ensure schema matches visible content.”
Provenance
“Cite primary sources inline, include a reference section, favour authoritative domains (.gov/.edu/standards bodies), and perform link-health checks to avoid rot/redirect loops.”
Freshness, Risk Management, RAG Optimisation
Per-pillar guidance follows the same pattern. The full pillar-by-pillar guidance is in the source PDF.
Reported Correlations
| Pillar | Correlation (r) | p-value | 95% CI | Citation Impact |
|---|---|---|---|---|
| Metadata & Freshness | 0.68 | <0.001 | [0.64, 0.72] | +47% |
| Semantic HTML | 0.65 | <0.001 | [0.61, 0.69] | +42% |
| Structured Data | 0.63 | <0.001 | [0.59, 0.67] | +39% |
| Evidence & Citations | 0.61 | <0.001 | [0.57, 0.65] | +37% |
| Authority & Trust | 0.59 | <0.001 | [0.55, 0.63] | +35% |
| Internal Linking | 0.57 | <0.001 | [0.53, 0.61] | +33% |
Correlations for the other 10 pillars are not provided in v1.
Self-Flagged Limitations (Verbatim)
- Internal validity: “Our observational design may suffer from unobserved confounding.”
- Construct validity: GEO-16 captures only a subset of on-page quality signals.
- External validity: Dataset limited to English-language B2B SaaS pages from a single time point; results may not generalize to other languages, verticals, or future engine versions.
- Experimental limitation: “We do not experimentally vary publication venues, so causal effects of off-page authority remain unverified.”
- Confounding: Engine-specific personalization and A/B variation not fully accounted for.
Practical Use
GEO-16 is the best-formalized audit framework in this thesis cluster. Treat it as a score-this-page-out-of-1.0 rubric rather than a causal recipe:
- Threshold to chase: G(u) ≥ 0.70 + ≥12 pillar hits → 78% probability cross-engine citation.
- Pillars to prioritize (highest measured correlations): Metadata & Freshness, Semantic HTML, Structured Data, Evidence & Citations.
- Pillars that the framework formalizes but doesn’t yet measure correlations for: UX & Readability, Microcontent, Content Depth, Transparency & Ethics, External Linking, Engagement & Interaction, Claims & Accuracy.
Open Questions
- Individual sub-signal weights (w_j,i). v1 doesn’t disclose how each sub-signal within a pillar is weighted. Update needed when v2 or final publication ships.
- Correlations for the other 10 pillars. Only six pillars have published correlation values. The other ten are formalized in the framework but not yet measured against citation.
- External validity outside B2B SaaS. Authors flag this. A consumer-vertical or healthcare replication would be the highest-value follow-up.
- Versioning. v1 was submitted September 2025. A v2 or peer-reviewed final version may shift weights as engines change.
- Engine-specific weights. All six published correlations aggregate across Brave + AIO + Perplexity. Per-engine weights would resolve some of the contradictions visible across AirOps’s ChatGPT-only finding and Digital Applied’s AIO-only finding.
Related
- Ahrefs Schema → AI Citations Causal Study — Causal counterpoint to GEO-16’s correlational findings on Structured Data. Both authors are upfront about the methodological boundary.
- AirOps + Kevin Indig Fan-Out Effect ChatGPT Study — Companion observational study. AirOps’s stratified +6.5pp schema finding maps directly onto GEO-16’s r=0.63 Structured Data correlation from a different dataset.
- Digital Applied 1,000 AIO Citation Pattern Study — AIO-only observational study with regression-style DA control. Schema 2.3× finding consistent with GEO-16’s +39% impact.
- Zyppy AI Citation Ranking Factors Meta-Analysis — Cyrus Shepard’s 54-study aggregation. GEO-16 is likely one of those 54 underlying studies. ^[inferred]
- Google’s Generative AI Search Optimization Guide — Google’s official position. Mostly aligned with GEO-16 on structured-data + semantic-HTML guidance; the divergence is causal-vs-correlational framing.
- FLUQs Framework — Practitioner framework. GEO-16’s Microcontent + Evidence-and-Citations pillars map onto FLUQs’s EchoBlocks-as-causal-triplets pattern.
Try It
- Score a sample of your top pages against the 16 pillars. Use the v1 paper’s band-score rubric (0-3 per pillar). Compute G(u) and count pillar hits.
- Target the threshold: GEO ≥0.70 + ≥12 pillar hits. Below that, the citation rate drops sharply per the paper.
- Sequence improvements by correlation strength: start with Metadata & Freshness (r=0.68), then Semantic HTML (r=0.65), then Structured Data (r=0.63). These are the three pillars with the strongest published evidence in the dataset.
- Don’t over-index on Structured Data alone. Per the contradiction with Ahrefs’s causal study, schema is best framed as a marker of editorial maturity. Get the underlying editorial and metadata practices right; let schema follow.
- Re-audit quarterly. Engines change. GEO-16’s authors flag that “results may not generalize to future engine versions.” Treat the framework as a snapshot, not a permanent ranking.