Voice Agent Comparison — Moshi vs ElevenLabs vs OpenAI Realtime API vs Cartesia Sonic

Source: wiki synthesis: moshi-kyutai, voice-agents-elevenlabs-claude-code, openai-realtime-api, cartesia-sonic, whisper-cli

Four voice-tech stacks now live in this topic, each occupying a genuinely different point in the design space rather than competing head-on. This article synthesizes them side by side — architecture, latency, cost, and openness — so the choice of stack for a given use case is a lookup, not a fresh research pass each time. whisper-cli sits alongside as the free/offline reference point, though it isn’t a live-conversation competitor to the other three.

Key Takeaways

Two architectures, not four. Moshi and OpenAI’s Realtime API are native speech-to-speech — one model processes and generates audio directly, no separate STT/LLM/TTS stages. ElevenLabs and Cartesia Sonic are pipeline components: ElevenLabs is a full pipelined stack (STT→LLM→TTS) that also does the orchestration; Cartesia Sonic is TTS-only, one leg of a pipeline you assemble yourself.
Moshi is the only open, self-hostable option. Code MIT+Apache 2.0, weights CC-BY 4.0. Every other stack here is closed and hosted-only. This is the single sharpest dividing line in the group — it determines whether you can run the model on your own infrastructure at all.
No single “fastest” stack — it depends which latency number you trust. Moshi publishes ~200ms practical latency. Cartesia markets sub-90ms TTS latency but an independent production benchmark measured P50 188ms with a long tail past 269ms. OpenAI doesn’t publish a headline number at all — independent sources cite 300ms to 2.3 seconds depending on the reasoning-effort setting. ElevenLabs’ pipelined architecture typically runs 500-800ms end-to-end. Vendor-marketed and independently-measured numbers disagree often enough that any single figure should be treated as a starting hypothesis, not a fact, until benchmarked in your own conditions.
Pricing models aren’t directly comparable. Moshi is free (self-hosted compute cost only). OpenAI is token-based on audio duration (~ $0.18 - 0.46/ min u n c a c h e d,$ 0.05-0.10/min cached, blended). ElevenLabs and Cartesia are character-or-minute-based subscription tiers (Cartesia: $0 f ree t i er u pt o$ 299/mo Scale; Voice Agents billed separately at $0.06/min+telephony). Converting between these requires knowing your actual usage pattern, not just a quoted rate.
Independent comparisons converge on a three-way, not four-way, tradeoff: OpenAI wins on conversational intelligence and tool-use reliability; ElevenLabs wins on emotional range and voice-cloning quality; Cartesia wins on raw TTS latency (when the marketed number holds). Moshi’s differentiator isn’t in that same competition — it’s the only one you can self-host, which matters for privacy/cost-at-scale rather than head-to-head conversational quality.

Side-by-Side

Dimension	Moshi	ElevenLabs	OpenAI Realtime API	Cartesia Sonic
Architecture	Native speech-to-speech (single model)	Pipelined (STT→LLM→TTS)	Native speech-to-speech (single model)	TTS-only (one pipeline stage)
Open vs. closed	Open-weight, self-hostable	Closed, hosted only	Closed, hosted only	Closed, hosted only
Latency (marketed)	~160ms theoretical / ~200ms practical (L4 GPU)	~500-800ms typical	Not published	Sonic 3.5: sub-90ms model latency
Latency (independently measured)	Not independently benchmarked in sources gathered	Not independently benchmarked in sources gathered	300ms-2.3s (reasoning-effort dependent)	Coval production P50: 188ms, P75: 269ms
Pricing model	Free (self-host compute only)	Character/minute subscription tiers	Token-based (audio in/out)	Character/minute subscription tiers + separate Voice Agents rate
Independent-comparison strength	Self-host + latency	Emotional/voice-cloning quality	Conversational intelligence, tool-use	Raw TTS latency (marketed)
Voice cloning	Two fixed voices (Moshika/Moshiko)	Yes, full library + cloning	10 voices (Cedar, Marin + 8 legacy)	Yes, instant clone from 10-second sample
Deployment surfaces	PyTorch / MLX (Apple Silicon) / Rust runtimes	Dashboard, website widget, phone (Twilio)	WebRTC, WebSocket, SIP	REST/WebSocket API

Which One, For What

Need to self-host, control the model, or avoid per-minute vendor billing entirely → Moshi. The only option that runs on your own infrastructure; MLX runtime makes it genuinely usable on a local Mac.
Need a full agent (persona + knowledge base + tool calls) with minimal build effort, and voice-cloning/emotional range matters → ElevenLabs. Its pipelined architecture and dashboard tooling (or the Claude-Code-configured path in this vault’s ElevenLabs article) is the fastest path to a working sales/support agent.
Need strong tool-calling reliability, MCP integration, or telephony (SIP) built in → OpenAI Realtime API. gpt-realtime-2’s parallel tool calls and remote MCP support are the most agent-native feature set of the four.
Building your own pipeline and want the fastest TTS leg specifically, willing to bring your own STT+LLM → Cartesia Sonic — but benchmark the actual latency in production rather than trusting the marketed figure alone.
Need free, offline, private transcription (not conversation) → whisper-cli — not a competitor in this table’s conversational-latency race, but the right default for batch/dictation/captioning work where no live back-and-forth is needed.

Implementation

Tool/Service: N/A — this is a comparison article; see each linked article’s own Implementation section for setup details. Integration notes: None of these four are strictly interchangeable at the API level — switching stacks means re-architecting the STT/LLM/TTS boundary (or removing it entirely for Moshi/OpenAI Realtime). Prototype the conversational flow against one stack before assuming portability to another.

Moshi — Kyutai Labs’ Full-Duplex Speech Foundation Model
Voice Agents with Claude Code + ElevenLabs (Nate Herk)
OpenAI Realtime API — Native Speech-to-Speech Voice Agents
Cartesia Sonic — Low-Latency Text-to-Speech (State-Space Model)
whisper-cli (whisper.cpp) — Free Local Speech-to-Text
Vapi Voice Agents (n8n) — the workflow-orchestration alternative to any of these four for building a voice agent without an ElevenLabs/Claude-Code-direct integration
OpenClaw on Rabbit R1 — a hardware surface that could pair with Moshi or ElevenLabs at the model layer

Try It

Match the “Which One, For What” section above against the actual constraint driving the decision (self-host requirement? tool-calling depth? voice-cloning fidelity? raw latency?) before defaulting to whichever vendor is best-known.
If latency is the deciding factor, don’t trust any single marketed number in the table above — the Cartesia-vs-Coval gap shows why. Run a small production benchmark against the top 2 candidates in your actual network/geographic conditions.
For a first build with minimal setup, the ElevenLabs + Claude Code recipe has the most complete end-to-end walkthrough already in this wiki.

Open Questions

No source in this cluster independently benchmarked Moshi’s or ElevenLabs’ production latency the way Coval did for Cartesia — the comparison table’s “not independently benchmarked” cells are a real gap, not just an omission.
Voice quality / naturalness (as opposed to latency and architecture) was not systematically compared across all four in any single source gathered — this table is latency/cost/architecture-focused, not a blind-listening-test result.
Whether any of these four is actually the right fit for a WEO/OmniPresence dental-client voice-agent use case (e.g. appointment booking, intake) is unexplored — this article is a general-purpose comparison, not a WEO-specific recommendation.

Jonathon's AI Wiki

Explorer

Voice Agent Comparison — Moshi vs ElevenLabs vs OpenAI Realtime API vs Cartesia Sonic

Key Takeaways

Side-by-Side

Which One, For What

Implementation

Try It

Open Questions

Graph View

Table of Contents

Backlinks

Jonathon's AI Wiki

Explorer

Voice Agent Comparison — Moshi vs ElevenLabs vs OpenAI Realtime API vs Cartesia Sonic

Key Takeaways

Side-by-Side

Which One, For What

Implementation

Related

Try It

Open Questions

Graph View

Table of Contents

Backlinks