Source: wiki synthesis: moshi-kyutai, voice-agents-elevenlabs-claude-code, openai-realtime-api, cartesia-sonic, whisper-cli
Four voice-tech stacks now live in this topic, each occupying a genuinely different point in the design space rather than competing head-on. This article synthesizes them side by side — architecture, latency, cost, and openness — so the choice of stack for a given use case is a lookup, not a fresh research pass each time. whisper-cli sits alongside as the free/offline reference point, though it isn’t a live-conversation competitor to the other three.
Key Takeaways
- Two architectures, not four. Moshi and OpenAI’s Realtime API are native speech-to-speech — one model processes and generates audio directly, no separate STT/LLM/TTS stages. ElevenLabs and Cartesia Sonic are pipeline components: ElevenLabs is a full pipelined stack (STT→LLM→TTS) that also does the orchestration; Cartesia Sonic is TTS-only, one leg of a pipeline you assemble yourself.
- Moshi is the only open, self-hostable option. Code MIT+Apache 2.0, weights CC-BY 4.0. Every other stack here is closed and hosted-only. This is the single sharpest dividing line in the group — it determines whether you can run the model on your own infrastructure at all.
- No single “fastest” stack — it depends which latency number you trust. Moshi publishes ~200ms practical latency. Cartesia markets sub-90ms TTS latency but an independent production benchmark measured P50 188ms with a long tail past 269ms. OpenAI doesn’t publish a headline number at all — independent sources cite 300ms to 2.3 seconds depending on the reasoning-effort setting. ElevenLabs’ pipelined architecture typically runs 500-800ms end-to-end. Vendor-marketed and independently-measured numbers disagree often enough that any single figure should be treated as a starting hypothesis, not a fact, until benchmarked in your own conditions.
- Pricing models aren’t directly comparable. Moshi is free (self-hosted compute cost only). OpenAI is token-based on audio duration (~0.05-0.10/min cached, blended). ElevenLabs and Cartesia are character-or-minute-based subscription tiers (Cartesia: 299/mo Scale; Voice Agents billed separately at $0.06/min+telephony). Converting between these requires knowing your actual usage pattern, not just a quoted rate.
- Independent comparisons converge on a three-way, not four-way, tradeoff: OpenAI wins on conversational intelligence and tool-use reliability; ElevenLabs wins on emotional range and voice-cloning quality; Cartesia wins on raw TTS latency (when the marketed number holds). Moshi’s differentiator isn’t in that same competition — it’s the only one you can self-host, which matters for privacy/cost-at-scale rather than head-to-head conversational quality.
Side-by-Side
| Dimension | Moshi | ElevenLabs | OpenAI Realtime API | Cartesia Sonic |
|---|---|---|---|---|
| Architecture | Native speech-to-speech (single model) | Pipelined (STT→LLM→TTS) | Native speech-to-speech (single model) | TTS-only (one pipeline stage) |
| Open vs. closed | Open-weight, self-hostable | Closed, hosted only | Closed, hosted only | Closed, hosted only |
| Latency (marketed) | ~160ms theoretical / ~200ms practical (L4 GPU) | ~500-800ms typical | Not published | Sonic 3.5: sub-90ms model latency |
| Latency (independently measured) | Not independently benchmarked in sources gathered | Not independently benchmarked in sources gathered | 300ms-2.3s (reasoning-effort dependent) | Coval production P50: 188ms, P75: 269ms |
| Pricing model | Free (self-host compute only) | Character/minute subscription tiers | Token-based (audio in/out) | Character/minute subscription tiers + separate Voice Agents rate |
| Independent-comparison strength | Self-host + latency | Emotional/voice-cloning quality | Conversational intelligence, tool-use | Raw TTS latency (marketed) |
| Voice cloning | Two fixed voices (Moshika/Moshiko) | Yes, full library + cloning | 10 voices (Cedar, Marin + 8 legacy) | Yes, instant clone from 10-second sample |
| Deployment surfaces | PyTorch / MLX (Apple Silicon) / Rust runtimes | Dashboard, website widget, phone (Twilio) | WebRTC, WebSocket, SIP | REST/WebSocket API |
Which One, For What
- Need to self-host, control the model, or avoid per-minute vendor billing entirely → Moshi. The only option that runs on your own infrastructure; MLX runtime makes it genuinely usable on a local Mac.
- Need a full agent (persona + knowledge base + tool calls) with minimal build effort, and voice-cloning/emotional range matters → ElevenLabs. Its pipelined architecture and dashboard tooling (or the Claude-Code-configured path in this vault’s ElevenLabs article) is the fastest path to a working sales/support agent.
- Need strong tool-calling reliability, MCP integration, or telephony (SIP) built in → OpenAI Realtime API.
gpt-realtime-2’s parallel tool calls and remote MCP support are the most agent-native feature set of the four. - Building your own pipeline and want the fastest TTS leg specifically, willing to bring your own STT+LLM → Cartesia Sonic — but benchmark the actual latency in production rather than trusting the marketed figure alone.
- Need free, offline, private transcription (not conversation) → whisper-cli — not a competitor in this table’s conversational-latency race, but the right default for batch/dictation/captioning work where no live back-and-forth is needed.
Implementation
Tool/Service: N/A — this is a comparison article; see each linked article’s own Implementation section for setup details. Integration notes: None of these four are strictly interchangeable at the API level — switching stacks means re-architecting the STT/LLM/TTS boundary (or removing it entirely for Moshi/OpenAI Realtime). Prototype the conversational flow against one stack before assuming portability to another.
Related
- Moshi — Kyutai Labs’ Full-Duplex Speech Foundation Model
- Voice Agents with Claude Code + ElevenLabs (Nate Herk)
- OpenAI Realtime API — Native Speech-to-Speech Voice Agents
- Cartesia Sonic — Low-Latency Text-to-Speech (State-Space Model)
- whisper-cli (whisper.cpp) — Free Local Speech-to-Text
- Vapi Voice Agents (n8n) — the workflow-orchestration alternative to any of these four for building a voice agent without an ElevenLabs/Claude-Code-direct integration
- OpenClaw on Rabbit R1 — a hardware surface that could pair with Moshi or ElevenLabs at the model layer
Try It
- Match the “Which One, For What” section above against the actual constraint driving the decision (self-host requirement? tool-calling depth? voice-cloning fidelity? raw latency?) before defaulting to whichever vendor is best-known.
- If latency is the deciding factor, don’t trust any single marketed number in the table above — the Cartesia-vs-Coval gap shows why. Run a small production benchmark against the top 2 candidates in your actual network/geographic conditions.
- For a first build with minimal setup, the ElevenLabs + Claude Code recipe has the most complete end-to-end walkthrough already in this wiki.
Open Questions
- No source in this cluster independently benchmarked Moshi’s or ElevenLabs’ production latency the way Coval did for Cartesia — the comparison table’s “not independently benchmarked” cells are a real gap, not just an omission.
- Voice quality / naturalness (as opposed to latency and architecture) was not systematically compared across all four in any single source gathered — this table is latency/cost/architecture-focused, not a blind-listening-test result.
- Whether any of these four is actually the right fit for a WEO/OmniPresence dental-client voice-agent use case (e.g. appointment booking, intake) is unexplored — this article is a general-purpose comparison, not a WEO-specific recommendation.