Source: wiki synthesis: moshi-kyutai, voice-agents-elevenlabs-claude-code, openai-realtime-api, cartesia-sonic, whisper-cli

Four voice-tech stacks now live in this topic, each occupying a genuinely different point in the design space rather than competing head-on. This article synthesizes them side by side — architecture, latency, cost, and openness — so the choice of stack for a given use case is a lookup, not a fresh research pass each time. whisper-cli sits alongside as the free/offline reference point, though it isn’t a live-conversation competitor to the other three.

Key Takeaways

  • Two architectures, not four. Moshi and OpenAI’s Realtime API are native speech-to-speech — one model processes and generates audio directly, no separate STT/LLM/TTS stages. ElevenLabs and Cartesia Sonic are pipeline components: ElevenLabs is a full pipelined stack (STT→LLM→TTS) that also does the orchestration; Cartesia Sonic is TTS-only, one leg of a pipeline you assemble yourself.
  • Moshi is the only open, self-hostable option. Code MIT+Apache 2.0, weights CC-BY 4.0. Every other stack here is closed and hosted-only. This is the single sharpest dividing line in the group — it determines whether you can run the model on your own infrastructure at all.
  • No single “fastest” stack — it depends which latency number you trust. Moshi publishes ~200ms practical latency. Cartesia markets sub-90ms TTS latency but an independent production benchmark measured P50 188ms with a long tail past 269ms. OpenAI doesn’t publish a headline number at all — independent sources cite 300ms to 2.3 seconds depending on the reasoning-effort setting. ElevenLabs’ pipelined architecture typically runs 500-800ms end-to-end. Vendor-marketed and independently-measured numbers disagree often enough that any single figure should be treated as a starting hypothesis, not a fact, until benchmarked in your own conditions.
  • Pricing models aren’t directly comparable. Moshi is free (self-hosted compute cost only). OpenAI is token-based on audio duration (~0.05-0.10/min cached, blended). ElevenLabs and Cartesia are character-or-minute-based subscription tiers (Cartesia: 299/mo Scale; Voice Agents billed separately at $0.06/min+telephony). Converting between these requires knowing your actual usage pattern, not just a quoted rate.
  • Independent comparisons converge on a three-way, not four-way, tradeoff: OpenAI wins on conversational intelligence and tool-use reliability; ElevenLabs wins on emotional range and voice-cloning quality; Cartesia wins on raw TTS latency (when the marketed number holds). Moshi’s differentiator isn’t in that same competition — it’s the only one you can self-host, which matters for privacy/cost-at-scale rather than head-to-head conversational quality.

Side-by-Side

DimensionMoshiElevenLabsOpenAI Realtime APICartesia Sonic
ArchitectureNative speech-to-speech (single model)Pipelined (STT→LLM→TTS)Native speech-to-speech (single model)TTS-only (one pipeline stage)
Open vs. closedOpen-weight, self-hostableClosed, hosted onlyClosed, hosted onlyClosed, hosted only
Latency (marketed)~160ms theoretical / ~200ms practical (L4 GPU)~500-800ms typicalNot publishedSonic 3.5: sub-90ms model latency
Latency (independently measured)Not independently benchmarked in sources gatheredNot independently benchmarked in sources gathered300ms-2.3s (reasoning-effort dependent)Coval production P50: 188ms, P75: 269ms
Pricing modelFree (self-host compute only)Character/minute subscription tiersToken-based (audio in/out)Character/minute subscription tiers + separate Voice Agents rate
Independent-comparison strengthSelf-host + latencyEmotional/voice-cloning qualityConversational intelligence, tool-useRaw TTS latency (marketed)
Voice cloningTwo fixed voices (Moshika/Moshiko)Yes, full library + cloning10 voices (Cedar, Marin + 8 legacy)Yes, instant clone from 10-second sample
Deployment surfacesPyTorch / MLX (Apple Silicon) / Rust runtimesDashboard, website widget, phone (Twilio)WebRTC, WebSocket, SIPREST/WebSocket API

Which One, For What

  • Need to self-host, control the model, or avoid per-minute vendor billing entirely → Moshi. The only option that runs on your own infrastructure; MLX runtime makes it genuinely usable on a local Mac.
  • Need a full agent (persona + knowledge base + tool calls) with minimal build effort, and voice-cloning/emotional range matters → ElevenLabs. Its pipelined architecture and dashboard tooling (or the Claude-Code-configured path in this vault’s ElevenLabs article) is the fastest path to a working sales/support agent.
  • Need strong tool-calling reliability, MCP integration, or telephony (SIP) built in → OpenAI Realtime API. gpt-realtime-2’s parallel tool calls and remote MCP support are the most agent-native feature set of the four.
  • Building your own pipeline and want the fastest TTS leg specifically, willing to bring your own STT+LLM → Cartesia Sonic — but benchmark the actual latency in production rather than trusting the marketed figure alone.
  • Need free, offline, private transcription (not conversation)whisper-cli — not a competitor in this table’s conversational-latency race, but the right default for batch/dictation/captioning work where no live back-and-forth is needed.

Implementation

Tool/Service: N/A — this is a comparison article; see each linked article’s own Implementation section for setup details. Integration notes: None of these four are strictly interchangeable at the API level — switching stacks means re-architecting the STT/LLM/TTS boundary (or removing it entirely for Moshi/OpenAI Realtime). Prototype the conversational flow against one stack before assuming portability to another.

Try It

  1. Match the “Which One, For What” section above against the actual constraint driving the decision (self-host requirement? tool-calling depth? voice-cloning fidelity? raw latency?) before defaulting to whichever vendor is best-known.
  2. If latency is the deciding factor, don’t trust any single marketed number in the table above — the Cartesia-vs-Coval gap shows why. Run a small production benchmark against the top 2 candidates in your actual network/geographic conditions.
  3. For a first build with minimal setup, the ElevenLabs + Claude Code recipe has the most complete end-to-end walkthrough already in this wiki.

Open Questions

  • No source in this cluster independently benchmarked Moshi’s or ElevenLabs’ production latency the way Coval did for Cartesia — the comparison table’s “not independently benchmarked” cells are a real gap, not just an omission.
  • Voice quality / naturalness (as opposed to latency and architecture) was not systematically compared across all four in any single source gathered — this table is latency/cost/architecture-focused, not a blind-listening-test result.
  • Whether any of these four is actually the right fit for a WEO/OmniPresence dental-client voice-agent use case (e.g. appointment booking, intake) is unexplored — this article is a general-purpose comparison, not a WEO-specific recommendation.