Source: wiki synthesis: Voice Agent Comparison — Moshi vs ElevenLabs vs OpenAI Realtime API vs Cartesia Sonic, n8n + Vapi Voice Agents — Outbound Lead Qualification, Voice AI & Conversation AI Public APIs
The wiki covers the local-business voice-call funnel (the running example: a dental practice) in three articles that never reference each other’s layer: which speech stack listens and speaks (the 4-way comparison), which workflow engine fires and tracks the calls (n8n + Vapi), and which CRM the agents and results live in (GoHighLevel’s Voice AI public API). This article assigns ownership per layer, flags where two layers claim the same job, and gives a minimal build order. It also partially answers the comparison article’s own open question — whether any of the four stacks fits a dental voice use case — by showing that for platform-assembled funnels the model choice is largely made for you, so the 4-way matters mainly when you self-assemble. ^[inferred]
Key Takeaways
- Three layers, three owners. Model layer: which speech stack converses (Moshi / ElevenLabs / OpenAI Realtime / Cartesia, per the comparison). Orchestration layer: n8n owns the event glue — form trigger, phone normalization, call creation, polling, voicemail branch, result write. CRM layer: GoHighLevel owns where agents, call logs, transcripts, and qualified leads live, with full CRUD APIs and webhooks.
- The demonstrated build is outbound speed-to-lead. The n8n + Vapi workflow calls every new website-form lead — the dental translation is a callback within 60 seconds, “before they shop a competitor” — qualifies conversationally, and writes Vapi structured outputs (interest, motivation, urgency, past experience, budget, paid intent, status) to a sheet or CRM.
- Two layers overlap on owning the call itself. Vapi (driven from n8n over raw HTTP) and GHL Voice AI (CRM-native phone-call agents) both provide the “AI agent on a phone call.” The differences that matter: the n8n build polls (
GET /call/:idin a wait loop) while GHL pushes webhooks for call outcomes and transcripts; GHL actions ship appointment booking and call transfer as first-class sub-resources; and GHL agents live in the same platform as the lead pipeline. - The model layer is only your decision if you self-assemble. OpenAI Realtime brings the most agent-native tool calling plus built-in SIP telephony; ElevenLabs is the fastest path to a full pipelined agent with a Twilio phone surface; Cartesia is a TTS leg for a pipeline you build; Moshi is the only self-hostable option. Neither the Vapi walkthrough nor the GHL API article documents which speech model runs underneath those platforms — the platform layers abstract the choice away. ^[inferred from what the sources do and do not state]
- Treat every latency number as a hypothesis. The comparison’s sharpest finding: Cartesia markets sub-90ms TTS while an independent production benchmark measured P50 188ms. Benchmark your top two candidates in your own conditions before committing a phone funnel to either.
- One ethics baseline survives every configuration: the agent introduces itself as AI on the first turn — the n8n source treats this as a keep-it default for any deployment.
What each layer owns
- Model layer (from the 4-way comparison): architecture (native speech-to-speech vs pipeline), latency profile, pricing model (free-self-host vs token-based vs character/minute tiers), openness. Decision drivers: self-host requirement → Moshi; tool-calling depth + SIP → OpenAI Realtime; voice cloning/emotional range with minimal build → ElevenLabs; fastest TTS leg in a hand-rolled pipeline → Cartesia.
- Orchestration layer (from the n8n + Vapi build): the six-stage loop — form trigger → code-node phone normalization (emit
incorrect formaton bad numbers rather than burning a failed call) → IF branch →POST /callwithassistantOverrides.variableValuesfor per-lead personalization ({{lead_name}},{{lead_request}}) → poll untilstatus: ended→ voicemail check viaendedReason→ structured-output write. Also owns telephony pragmatics: Vapi’s 10 free US numbers carry a daily outbound cap; production means importing a Twilio number. - CRM layer (from Conversation AI public APIs): programmatic agent management (create/patch/list/get/delete), agent actions (webhook invocation, appointment booking, call transfer, follow-up settings), call logs and transcripts filterable by agent/contact/call type/date range, Generations details for QA and compliance, and webhooks that push call outcomes instead of being polled. Auth discipline: a Private Integration Token scoped at the sub-account (location) level, not an Agency token.
Where the layers overlap — and how to resolve it
- Structured extraction lives twice. Vapi structured outputs capture qualification fields during the call; GHL call logs, transcripts, and Generation details capture the same conversation from the CRM side. Resolution: whichever platform owns the call owns extraction; the other side should consume, not re-extract. ^[inferred]
- Polling vs. webhooks. The n8n build’s wait-and-repoll loop (60s initial wait, 10s re-polls, plus a limit node to trim a Vapi duplicate-response bug) exists because the walkthrough polls. GHL’s surfaces push webhooks for outcomes and transcripts — if GHL owns the call layer, an n8n Webhook trigger replaces the entire polling branch. ^[inferred]
- Actions vs. prompt-only agents. The Vapi assistant in the build qualifies and ends the call; GHL agent actions add appointment booking and call transfer as managed sub-resources. A dental funnel that should end in a booked slot, not just a qualified row, has a native home on the GHL side. ^[inferred]
- Whether GHL replaces the Vapi layer outright is not established. The GHL source documents agent CRUD, actions, logs, and webhooks — it does not document firing an outbound call on a form event, which is the n8n + Vapi build’s whole premise. Until that’s confirmed, the honest architecture keeps n8n + Vapi for outbound speed-to-lead and uses GHL as the destination and management plane.
Minimal build order
- Prototype the documented recipe end-to-end — n8n form trigger through Vapi call to a Google Sheet. Pin sample data in n8n so flow edits don’t burn real Vapi minutes.
- Swap the sheet write for the CRM write. The n8n source’s own Try It says it: wire the result row into GoHighLevel so the qualified lead lands in the same pipeline as a manually-handled inquiry. Use a sub-account-scoped PIT.
- Replace polling with pushes where GHL owns the data — subscribe to call-outcome/transcript webhooks rather than re-polling. ^[inferred]
- Evaluate consolidating the call layer into GHL Voice AI once the funnel runs inside GHL: fewer moving parts, booking/transfer actions, programmatic rollout of consistent agent configs across locations. Gate this on resolving the outbound-trigger open question below.
- Only then revisit the model layer — and only if latency, cost, or voice quality actually bites. Use the comparison’s decision table, and benchmark the top two candidates rather than trusting marketed numbers.
Try It
- Build the Vapi assistant with 5-7 dynamic variables and structured-output fields matched to what a dental front desk actually asks (urgency, insurance, procedure interest, preferred appointment window — per the n8n article’s dental adaptation).
- Import the Twilio number the practice already uses so callbacks show a familiar caller ID.
- In a test GHL sub-account, create a Private Integration Token with Voice AI / Conversation AI scopes, call
List Agentsto get a realagentId, and confirm you can read call logs before wiring anything automated. - Wire the GHL webhook for call outcomes so transcripts arrive as events.
- Keep the first-turn AI self-disclosure line in every assistant prompt.
Related
- Voice Agent Comparison — Moshi vs ElevenLabs vs OpenAI Realtime API vs Cartesia Sonic
- n8n + Vapi Voice Agents — Outbound Lead Qualification
- Voice AI & Conversation AI Public APIs
- Voice Agents with ElevenLabs + Claude Code — the fastest self-assembled full-agent path if the platform layers don’t fit
- OpenAI Realtime API — the self-assembled option with SIP telephony built in
- AI Voice
Open Questions
- Can GHL Voice AI initiate outbound, form-triggered calls (the speed-to-lead pattern), or is it inbound-answering only? The wiki’s GHL source covers agent management, not call initiation semantics.
- Per-call economics across the assembled stack: the n8n source states Vapi pricing is per-minute but cites no dollar figure, and GHL Voice AI pricing is not covered by the API article — cost data not available.
- Which speech models and voices actually run under Vapi and GHL Voice AI (GHL exposes only a “Voices catalog” API reference; the Vapi demo names a stock voice, “Elliot”). Without this, the model-layer comparison can’t be applied to the platform paths even in principle.