The Voice-Agent Lead-Qualification Stack — Model, Orchestration, CRM

Source: wiki synthesis: Voice Agent Comparison — Moshi vs ElevenLabs vs OpenAI Realtime API vs Cartesia Sonic, n8n + Vapi Voice Agents — Outbound Lead Qualification, Voice AI & Conversation AI Public APIs

The wiki covers the local-business voice-call funnel (the running example: a dental practice) in three articles that never reference each other’s layer: which speech stack listens and speaks (the 4-way comparison), which workflow engine fires and tracks the calls (n8n + Vapi), and which CRM the agents and results live in (GoHighLevel’s Voice AI public API). This article assigns ownership per layer, flags where two layers claim the same job, and gives a minimal build order. It also partially answers the comparison article’s own open question — whether any of the four stacks fits a dental voice use case — by showing that for platform-assembled funnels the model choice is largely made for you, so the 4-way matters mainly when you self-assemble. ^[inferred]

Key Takeaways

Three layers, three owners. Model layer: which speech stack converses (Moshi / ElevenLabs / OpenAI Realtime / Cartesia, per the comparison). Orchestration layer: n8n owns the event glue — form trigger, phone normalization, call creation, polling, voicemail branch, result write. CRM layer: GoHighLevel owns where agents, call logs, transcripts, and qualified leads live, with full CRUD APIs and webhooks.
The demonstrated build is outbound speed-to-lead. The n8n + Vapi workflow calls every new website-form lead — the dental translation is a callback within 60 seconds, “before they shop a competitor” — qualifies conversationally, and writes Vapi structured outputs (interest, motivation, urgency, past experience, budget, paid intent, status) to a sheet or CRM.
Two layers overlap on owning the call itself. Vapi (driven from n8n over raw HTTP) and GHL Voice AI (CRM-native phone-call agents) both provide the “AI agent on a phone call.” The differences that matter: the n8n build polls (GET /call/:id in a wait loop) while GHL pushes webhooks for call outcomes and transcripts; GHL actions ship appointment booking and call transfer as first-class sub-resources; and GHL agents live in the same platform as the lead pipeline.
The model layer is only your decision if you self-assemble. OpenAI Realtime brings the most agent-native tool calling plus built-in SIP telephony; ElevenLabs is the fastest path to a full pipelined agent with a Twilio phone surface; Cartesia is a TTS leg for a pipeline you build; Moshi is the only self-hostable option. Neither the Vapi walkthrough nor the GHL API article documents which speech model runs underneath those platforms — the platform layers abstract the choice away. ^[inferred from what the sources do and do not state]
Treat every latency number as a hypothesis. The comparison’s sharpest finding: Cartesia markets sub-90ms TTS while an independent production benchmark measured P50 188ms. Benchmark your top two candidates in your own conditions before committing a phone funnel to either.
One ethics baseline survives every configuration: the agent introduces itself as AI on the first turn — the n8n source treats this as a keep-it default for any deployment.

What each layer owns

Model layer (from the 4-way comparison): architecture (native speech-to-speech vs pipeline), latency profile, pricing model (free-self-host vs token-based vs character/minute tiers), openness. Decision drivers: self-host requirement → Moshi; tool-calling depth + SIP → OpenAI Realtime; voice cloning/emotional range with minimal build → ElevenLabs; fastest TTS leg in a hand-rolled pipeline → Cartesia.
Orchestration layer (from the n8n + Vapi build): the six-stage loop — form trigger → code-node phone normalization (emit incorrect format on bad numbers rather than burning a failed call) → IF branch → POST /call with assistantOverrides.variableValues for per-lead personalization ({{lead_name}}, {{lead_request}}) → poll until status: ended → voicemail check via endedReason → structured-output write. Also owns telephony pragmatics: Vapi’s 10 free US numbers carry a daily outbound cap; production means importing a Twilio number.
CRM layer (from Conversation AI public APIs): programmatic agent management (create/patch/list/get/delete), agent actions (webhook invocation, appointment booking, call transfer, follow-up settings), call logs and transcripts filterable by agent/contact/call type/date range, Generations details for QA and compliance, and webhooks that push call outcomes instead of being polled. Auth discipline: a Private Integration Token scoped at the sub-account (location) level, not an Agency token.

Where the layers overlap — and how to resolve it

Structured extraction lives twice. Vapi structured outputs capture qualification fields during the call; GHL call logs, transcripts, and Generation details capture the same conversation from the CRM side. Resolution: whichever platform owns the call owns extraction; the other side should consume, not re-extract. ^[inferred]
Polling vs. webhooks. The n8n build’s wait-and-repoll loop (60s initial wait, 10s re-polls, plus a limit node to trim a Vapi duplicate-response bug) exists because the walkthrough polls. GHL’s surfaces push webhooks for outcomes and transcripts — if GHL owns the call layer, an n8n Webhook trigger replaces the entire polling branch. ^[inferred]
Actions vs. prompt-only agents. The Vapi assistant in the build qualifies and ends the call; GHL agent actions add appointment booking and call transfer as managed sub-resources. A dental funnel that should end in a booked slot, not just a qualified row, has a native home on the GHL side. ^[inferred]
Whether GHL replaces the Vapi layer outright is not established. The GHL source documents agent CRUD, actions, logs, and webhooks — it does not document firing an outbound call on a form event, which is the n8n + Vapi build’s whole premise. Until that’s confirmed, the honest architecture keeps n8n + Vapi for outbound speed-to-lead and uses GHL as the destination and management plane.

Minimal build order

Prototype the documented recipe end-to-end — n8n form trigger through Vapi call to a Google Sheet. Pin sample data in n8n so flow edits don’t burn real Vapi minutes.
Swap the sheet write for the CRM write. The n8n source’s own Try It says it: wire the result row into GoHighLevel so the qualified lead lands in the same pipeline as a manually-handled inquiry. Use a sub-account-scoped PIT.
Replace polling with pushes where GHL owns the data — subscribe to call-outcome/transcript webhooks rather than re-polling. ^[inferred]
Evaluate consolidating the call layer into GHL Voice AI once the funnel runs inside GHL: fewer moving parts, booking/transfer actions, programmatic rollout of consistent agent configs across locations. Gate this on resolving the outbound-trigger open question below.
Only then revisit the model layer — and only if latency, cost, or voice quality actually bites. Use the comparison’s decision table, and benchmark the top two candidates rather than trusting marketed numbers.

Try It

Build the Vapi assistant with 5-7 dynamic variables and structured-output fields matched to what a dental front desk actually asks (urgency, insurance, procedure interest, preferred appointment window — per the n8n article’s dental adaptation).
Import the Twilio number the practice already uses so callbacks show a familiar caller ID.
In a test GHL sub-account, create a Private Integration Token with Voice AI / Conversation AI scopes, call List Agents to get a real agentId, and confirm you can read call logs before wiring anything automated.
Wire the GHL webhook for call outcomes so transcripts arrive as events.
Keep the first-turn AI self-disclosure line in every assistant prompt.

Voice Agent Comparison — Moshi vs ElevenLabs vs OpenAI Realtime API vs Cartesia Sonic
n8n + Vapi Voice Agents — Outbound Lead Qualification
Voice AI & Conversation AI Public APIs
Voice Agents with ElevenLabs + Claude Code — the fastest self-assembled full-agent path if the platform layers don’t fit
OpenAI Realtime API — the self-assembled option with SIP telephony built in
AI Voice

Open Questions

Can GHL Voice AI initiate outbound, form-triggered calls (the speed-to-lead pattern), or is it inbound-answering only? The wiki’s GHL source covers agent management, not call initiation semantics.
Per-call economics across the assembled stack: the n8n source states Vapi pricing is per-minute but cites no dollar figure, and GHL Voice AI pricing is not covered by the API article — cost data not available.
Which speech models and voices actually run under Vapi and GHL Voice AI (GHL exposes only a “Voices catalog” API reference; the Vapi demo names a stock voice, “Elliot”). Without this, the model-layer comparison can’t be applied to the platform paths even in principle.

Jonathon's AI Wiki

Explorer

The Voice-Agent Lead-Qualification Stack — Model, Orchestration, CRM

Key Takeaways

What each layer owns

Where the layers overlap — and how to resolve it

Minimal build order

Try It

Open Questions

Graph View

Table of Contents

Backlinks

Jonathon's AI Wiki

Explorer

The Voice-Agent Lead-Qualification Stack — Model, Orchestration, CRM

Key Takeaways

What each layer owns

Where the layers overlap — and how to resolve it

Minimal build order

Try It

Related

Open Questions

Graph View

Table of Contents

Backlinks