Source: ai-research/cartesia-sonic-product-page-2026.md, ai-research/cartesia-sonic-announcement-blog.md, ai-research/cartesia-sonic-3-5-docs.md, ai-research/cartesia-pricing-2026.md, ai-research/coval-tts-providers-2026-benchmark.md, ai-research/marktechpost-tts-models-2026-benchmark.md
Sonic is Cartesia’s real-time text-to-speech (TTS-only) model line — not a full-duplex conversational model like Moshi or OpenAI Realtime API, but a single pipeline stage typically paired with any LLM plus a separate STT stage, the same architectural role ElevenLabs occupies. Cartesia was founded by Albert Gu and Karan Goel, who (with Tri Dao) originated the S4 and Mamba state-space-model (SSM) architectures in academia before commercializing them — Sonic runs on Cartesia’s own SSM inference stack rather than a Transformer, which is the mechanical reason it markets itself specifically on latency.
Key Takeaways
- SSM architecture, not Transformer. Per Cartesia’s own launch claims (vs. a parameter-matched Transformer baseline): 20% lower validation perplexity, 2x lower WER, 1.5x lower time-to-first-audio, 2x faster inference, 4x higher throughput. SSM inference scales linearly with sequence length vs. Transformers’ quadratic scaling — the structural reason it stays fast under load.
- Version history moved fast: Sonic 1 (135ms model latency at launch) → Sonic 2 (40ms) → Sonic 3 (GA April 2026, ~90ms model latency / 190ms end-to-end per Cartesia) → Sonic 3.5 (May 2026, current recommended-stable release, sub-90ms marketed / ~82ms end-to-end time-to-first-audio per one independent benchmark) → a Sonic Turbo variant (~40ms). Briefly held #1 on the Artificial Analysis TTS leaderboard before being overtaken.
- Marketed latency and independently-measured latency disagree — this is a real, unresolved gap, not a rounding difference. ^[ambiguous] Cartesia’s own marketing claims sub-90–100ms “model latency” for Sonic 3.5. An independent production benchmark (Coval) measured Sonic 3 at P50 188ms, IQR 100ms, P75 269ms in practice — meaningfully higher and far more variable than the marketed figure, with a non-trivial share of requests exceeding the ~300ms threshold where conversational latency starts to feel laggy. Different sources also measure genuinely different things (model-latency vs. time-to-first-byte vs. time-to-first-audio vs. end-to-end vs. production P50), so cross-vendor latency numbers — including the comparisons in this article — are not strictly apples-to-apples.
- The ranking flips depending on which number you read. Using Cartesia’s own marketed figures, ElevenLabs Flash v2.5 (~75ms) beats Cartesia’s ~90ms claim. Using Coval’s independent production P50 data, Cartesia (188ms) beats ElevenLabs Turbo/Flash (264–288ms). Treat any single-number latency claim for any TTS vendor, including this one, with real skepticism until benchmarked in your own production conditions.
- Voice cloning: instant clone from a 10-second sample (per the product page; some secondary sources cite 3 seconds). Two tiers — Instant Voice Cloning (from the Pro plan) and Professional Voice Cloning (higher tiers, training fee).
- 42 languages natively, including 9 Indian languages. Sonic 3.5 added multilingual improvements (Hebrew, Japanese, Spanish, Hindi, German, Korean, French) and fixed long-session voice drift.
- Even a fast TTS leg doesn’t collapse a pipeline. Sonic makes the TTS stage faster, but in a pipelined stack (STT → LLM → TTS) the end-to-end budget per Coval’s analysis still runs 500–700ms — Sonic narrows one stage, it doesn’t eliminate the pipeline the way Moshi’s or OpenAI Realtime API’s single-model architecture does.
Pricing
| Tier | Price | Credits/mo | Notes |
|---|---|---|---|
| Free | $0 | 20,000 | — |
| Pro | 4 annual) | 100,000 | Adds commercial license + instant voice cloning |
| Startup | 37 annual) | 1,250,000 | Adds Professional voice cloning |
| Scale | 224 annual) | 8,000,000 | — |
| Enterprise | Custom | Custom | — |
Voice Agents (Cartesia’s “Line” product) bill separately: 0.014/min for telephony.
Implementation
Tool/Service: Cartesia Sonic (sonic-3.5 rolling-stable, dated snapshots like sonic-3.5-2026-05-04, or sonic-latest beta). REST/WebSocket API.
Setup: API key from Cartesia dashboard; third-party framework plugins exist, e.g. LiveKit’s livekit-agents[cartesia].
Cost: Free tier (20K credits/mo) up through 0.06/min + telephony surcharge.
Integration notes: TTS-only — pair with a separate STT model (e.g. whisper-cli for local/offline, or a cloud STT) and any LLM to build a full pipelined voice agent, the same shape as an ElevenLabs-based stack. $100M raised late 2025 (Kleiner Perkins, Index, Lightspeed, NVIDIA) — a well-capitalized vendor, not an early-stage risk.
Related
- Moshi — Kyutai Labs’ Full-Duplex Speech Foundation Model — the open-weight, single-model full-duplex alternative to a Sonic-based pipeline
- OpenAI Realtime API — the other native-speech-to-speech option; unlike Sonic, doesn’t require pairing a separate STT/LLM stage
- ElevenLabs voice agents on Claude Code — the closest direct competitor: same pipelined-TTS role, closed/commercial, stronger on emotional/voice-cloning quality per independent comparisons
- Voice Agent Comparison — Moshi vs ElevenLabs vs OpenAI Realtime vs Cartesia Sonic — the full four-way comparison, including the marketed-vs-measured latency caveat in context
- whisper-cli (whisper.cpp) — a free/local STT stage that could pair with Sonic’s TTS stage in a self-hosted-leaning pipeline
Try It
- Before trusting any vendor’s marketed latency number (Cartesia’s or anyone else’s), test time-to-first-audio in your own production network conditions — the Coval benchmark shows marketed and measured latency can diverge by 2x or more.
- If building a pipelined voice agent, evaluate Sonic 3.5 against ElevenLabs Flash/Turbo on the specific latency percentile that matters for the use case (P50 vs P75 vs worst-case), not just the headline number.
- Use the Free tier (20,000 credits/mo) to benchmark voice quality and instant-cloning fidelity before committing to a paid tier.
Open Questions
- No source reconciled why Cartesia’s marketed model-latency and Coval’s measured production P50 diverge so widely (network conditions, request batching, geographic routing, and measurement methodology are all plausible factors, but none were confirmed in the sources gathered). ^[ambiguous]
- Whether Sonic’s Voice Agents (“Line”) product is a viable alternative to the DIY STT+LLM+TTS pipeline for any use case this vault tracks (e.g. dental-client intake) is unexplored.