Source: geo-16-framework-arxiv-kumar-palkhouski-2025-09.md — Arlen Kumar (UC Berkeley) and Leanid Palkhouski (Wrodium Research). arXiv:2509.10762v1, submitted September 13, 2025.

The first academic paper to publish a structured AEO/GEO auditing framework derived from a multi-engine empirical study. Kumar and Palkhouski audited 1,100 URLs against 70 B2B SaaS prompts and harvested 1,702 citations across Brave Summary, Google AI Overviews, and Perplexity sonar-pro. They built GEO-16 — a 16-pillar scoring framework — from the dataset. Headline: pages scoring ≥0.70 on GEO with ≥12 pillar hits achieve 78% cross-engine citation rates. The top three correlated pillars are Metadata & Freshness (r=0.68), Semantic HTML (r=0.65), and Structured Data (r=0.63), all p<0.001. The paper is the strongest academic-rigor source on AI citation correlations as of this writing — and explicitly self-flags its observational design as non-causal.

Key Takeaways

  • First academic AEO/GEO citation study. Authors are affiliated with UC Berkeley and Wrodium Research; submitted to arXiv September 2025 (preprint).
  • Multi-engine dataset. 1,100 URLs / 1,702 citations / 70 prompts / 16 B2B SaaS verticals. Citation distribution: Brave 36.0%, Google AIO 35.1%, Perplexity 28.9%.
  • GEO-16 = 16 pillars across 6 principles. People-first content (3 pillars), structured data (3), provenance (3), freshness (2), risk management (2), RAG optimisation (3). Each pillar scored 0-3; aggregate G(u) = (1/48)Σ ∈ [0,1].
  • Threshold finding. Pages with GEO ≥0.70 AND ≥12 pillar hits → 78% cross-engine citation rate. This is the headline operational threshold.
  • Top three correlated pillars (r values, all p<0.001):
    • Metadata & Freshness: r=0.68, 95% CI [0.64, 0.72], +47% citation impact
    • Semantic HTML: r=0.65, 95% CI [0.61, 0.69], +42%
    • Structured Data: r=0.63, 95% CI [0.59, 0.67], +39%
  • Mid-tier correlations: Evidence & Citations (r=0.61, +37%), Authority & Trust (r=0.59, +35%), Internal Linking (r=0.57, +33%).
  • Observational, NOT causal. Authors explicitly write: “Our observational design may suffer from unobserved confounding (internal validity)” and “we do not experimentally vary publication venues, so causal effects of off-page authority remain unverified.” This is critical context for interpreting the correlation values.
  • Pairs with the Ahrefs causal contradiction on Structured Data. GEO-16’s r=0.63 / +39% is correlational; Ahrefs’s matched DiD found no causal schema lift. Same reconciliation as the AirOps study: schema is a marker, not a lever.
  • B2B SaaS scope. All 16 verticals are SaaS-adjacent. Authors flag external validity: results may not generalize to consumer / healthcare / news / non-English content.

Structured Data: GEO-16 says r=0.63 / +39%, Ahrefs says no causal lift

GEO-16 says (this paper, cross-sectional observational, Brave/AIO/Perplexity, B2B SaaS) — Structured Data r=0.63, 95% CI [0.59, 0.67], +39% citation impact. Ahrefs says (ChatGPT) — adding schema mid-period produces no statistically meaningful citation lift. Reconciliation: GEO-16’s authors explicitly self-flag the observational/non-causal design. The r=0.63 captures correlation between “pages that have schema” and “pages that get cited” — but pages with schema are not a random subset of pages; they’re systematically more mature on editorial and technical dimensions. Ahrefs’s matched DiD isolates the intervention (adding schema) from the publisher characteristics; that isolation removes the lift. The framework’s overall threshold finding (GEO ≥0.70 + ≥12 pillar hits → 78% citation) is still actionable as a what predicts citation model, but readers should not interpret “+39% citation impact” as “add schema and citations rise 39%.” Status: resolved (2026-05-19) — methodological-difference, not factual.

The 16 Pillars (Grouped by Principle)

PrinciplePillarsCount
People-first contentUX & Readability; Claims & Accuracy; Microcontent3
Structured dataSemantic HTML; Structured Data; Metadata & Freshness3
ProvenanceAuthority & Trust; Evidence & Citations; Transparency & Ethics3
FreshnessMetadata & Freshness; Content Depth2
Risk managementClaims & Accuracy; Transparency & Ethics2
RAG optimisationInternal Linking; External Linking; Engagement & Interaction3

Pillar scoring: Each pillar receives a band score b_j(u) ∈ {0,1,2,3}. A “pillar hit” occurs when b_j(u) ≥ 2.

Aggregate GEO score: G(u) = (1/48)Σ b_j(u) ∈ [0,1]. The denominator (48) = 16 pillars × max band 3.

Individual sub-signal weights (w_j,i) within pillars are not disclosed in v1.

The Six Principles — Verbatim Author Guidance

People-first content

“Lead with an answer-first summary (TL;DR or key takeaways), keep paragraphs compact, use descriptive headings/lists, and mark claims versus opinions explicitly.”

Structured data

“Maintain a single <h1> and logical <h2>/<h3> hierarchy; provide valid JSON-LD (Article/TechArticle/FAQPage) with datePublished, dateModified, author, and breadcrumb where relevant; expose canonical URLs and social cards. Ensure schema matches visible content.”

Provenance

“Cite primary sources inline, include a reference section, favour authoritative domains (.gov/.edu/standards bodies), and perform link-health checks to avoid rot/redirect loops.”

Freshness, Risk Management, RAG Optimisation

Per-pillar guidance follows the same pattern. The full pillar-by-pillar guidance is in the source PDF.

Reported Correlations

PillarCorrelation (r)p-value95% CICitation Impact
Metadata & Freshness0.68<0.001[0.64, 0.72]+47%
Semantic HTML0.65<0.001[0.61, 0.69]+42%
Structured Data0.63<0.001[0.59, 0.67]+39%
Evidence & Citations0.61<0.001[0.57, 0.65]+37%
Authority & Trust0.59<0.001[0.55, 0.63]+35%
Internal Linking0.57<0.001[0.53, 0.61]+33%

Correlations for the other 10 pillars are not provided in v1.

Self-Flagged Limitations (Verbatim)

  • Internal validity: “Our observational design may suffer from unobserved confounding.”
  • Construct validity: GEO-16 captures only a subset of on-page quality signals.
  • External validity: Dataset limited to English-language B2B SaaS pages from a single time point; results may not generalize to other languages, verticals, or future engine versions.
  • Experimental limitation: “We do not experimentally vary publication venues, so causal effects of off-page authority remain unverified.”
  • Confounding: Engine-specific personalization and A/B variation not fully accounted for.

Practical Use

GEO-16 is the best-formalized audit framework in this thesis cluster. Treat it as a score-this-page-out-of-1.0 rubric rather than a causal recipe:

  • Threshold to chase: G(u) ≥ 0.70 + ≥12 pillar hits → 78% probability cross-engine citation.
  • Pillars to prioritize (highest measured correlations): Metadata & Freshness, Semantic HTML, Structured Data, Evidence & Citations.
  • Pillars that the framework formalizes but doesn’t yet measure correlations for: UX & Readability, Microcontent, Content Depth, Transparency & Ethics, External Linking, Engagement & Interaction, Claims & Accuracy.

Open Questions

  • Individual sub-signal weights (w_j,i). v1 doesn’t disclose how each sub-signal within a pillar is weighted. Update needed when v2 or final publication ships.
  • Correlations for the other 10 pillars. Only six pillars have published correlation values. The other ten are formalized in the framework but not yet measured against citation.
  • External validity outside B2B SaaS. Authors flag this. A consumer-vertical or healthcare replication would be the highest-value follow-up.
  • Versioning. v1 was submitted September 2025. A v2 or peer-reviewed final version may shift weights as engines change.
  • Engine-specific weights. All six published correlations aggregate across Brave + AIO + Perplexity. Per-engine weights would resolve some of the contradictions visible across AirOps’s ChatGPT-only finding and Digital Applied’s AIO-only finding.

Try It

  1. Score a sample of your top pages against the 16 pillars. Use the v1 paper’s band-score rubric (0-3 per pillar). Compute G(u) and count pillar hits.
  2. Target the threshold: GEO ≥0.70 + ≥12 pillar hits. Below that, the citation rate drops sharply per the paper.
  3. Sequence improvements by correlation strength: start with Metadata & Freshness (r=0.68), then Semantic HTML (r=0.65), then Structured Data (r=0.63). These are the three pillars with the strongest published evidence in the dataset.
  4. Don’t over-index on Structured Data alone. Per the contradiction with Ahrefs’s causal study, schema is best framed as a marker of editorial maturity. Get the underlying editorial and metadata practices right; let schema follow.
  5. Re-audit quarterly. Engines change. GEO-16’s authors flag that “results may not generalize to future engine versions.” Treat the framework as a snapshot, not a permanent ranking.