Cartesia Sonic — Low-Latency Text-to-Speech (State-Space Model)

Source: ai-research/cartesia-sonic-product-page-2026.md, ai-research/cartesia-sonic-announcement-blog.md, ai-research/cartesia-sonic-3-5-docs.md, ai-research/cartesia-pricing-2026.md, ai-research/coval-tts-providers-2026-benchmark.md, ai-research/marktechpost-tts-models-2026-benchmark.md

Sonic is Cartesia’s real-time text-to-speech (TTS-only) model line — not a full-duplex conversational model like Moshi or OpenAI Realtime API, but a single pipeline stage typically paired with any LLM plus a separate STT stage, the same architectural role ElevenLabs occupies. Cartesia was founded by Albert Gu and Karan Goel, who (with Tri Dao) originated the S4 and Mamba state-space-model (SSM) architectures in academia before commercializing them — Sonic runs on Cartesia’s own SSM inference stack rather than a Transformer, which is the mechanical reason it markets itself specifically on latency.

Key Takeaways

SSM architecture, not Transformer. Per Cartesia’s own launch claims (vs. a parameter-matched Transformer baseline): 20% lower validation perplexity, 2x lower WER, 1.5x lower time-to-first-audio, 2x faster inference, 4x higher throughput. SSM inference scales linearly with sequence length vs. Transformers’ quadratic scaling — the structural reason it stays fast under load.
Version history moved fast: Sonic 1 (135ms model latency at launch) → Sonic 2 (40ms) → Sonic 3 (GA April 2026, ~90ms model latency / 190ms end-to-end per Cartesia) → Sonic 3.5 (May 2026, current recommended-stable release, sub-90ms marketed / ~82ms end-to-end time-to-first-audio per one independent benchmark) → a Sonic Turbo variant (~40ms). Briefly held #1 on the Artificial Analysis TTS leaderboard before being overtaken.
Marketed latency and independently-measured latency disagree — this is a real, unresolved gap, not a rounding difference. ^[ambiguous] Cartesia’s own marketing claims sub-90–100ms “model latency” for Sonic 3.5. An independent production benchmark (Coval) measured Sonic 3 at P50 188ms, IQR 100ms, P75 269ms in practice — meaningfully higher and far more variable than the marketed figure, with a non-trivial share of requests exceeding the ~300ms threshold where conversational latency starts to feel laggy. Different sources also measure genuinely different things (model-latency vs. time-to-first-byte vs. time-to-first-audio vs. end-to-end vs. production P50), so cross-vendor latency numbers — including the comparisons in this article — are not strictly apples-to-apples.
The ranking flips depending on which number you read. Using Cartesia’s own marketed figures, ElevenLabs Flash v2.5 (~75ms) beats Cartesia’s ~90ms claim. Using Coval’s independent production P50 data, Cartesia (188ms) beats ElevenLabs Turbo/Flash (264–288ms). Treat any single-number latency claim for any TTS vendor, including this one, with real skepticism until benchmarked in your own production conditions.
Voice cloning: instant clone from a 10-second sample (per the product page; some secondary sources cite 3 seconds). Two tiers — Instant Voice Cloning (from the Pro plan) and Professional Voice Cloning (higher tiers, training fee).
42 languages natively, including 9 Indian languages. Sonic 3.5 added multilingual improvements (Hebrew, Japanese, Spanish, Hindi, German, Korean, French) and fixed long-session voice drift.
Even a fast TTS leg doesn’t collapse a pipeline. Sonic makes the TTS stage faster, but in a pipelined stack (STT → LLM → TTS) the end-to-end budget per Coval’s analysis still runs 500–700ms — Sonic narrows one stage, it doesn’t eliminate the pipeline the way Moshi’s or OpenAI Realtime API’s single-model architecture does.

Pricing

Tier	Price	Credits/mo	Notes
Free	$0	20,000	—
Pro	$5/ m o ($ 4 annual)	100,000	Adds commercial license + instant voice cloning
Startup	$49/ m o ($ 37 annual)	1,250,000	Adds Professional voice cloning
Scale	$299/ m o ($ 224 annual)	8,000,000	—
Enterprise	Custom	Custom	—

Voice Agents (Cartesia’s “Line” product) bill separately: $0.06/ min +$ 0.014/min for telephony.

Implementation

Tool/Service: Cartesia Sonic (sonic-3.5 rolling-stable, dated snapshots like sonic-3.5-2026-05-04, or sonic-latest beta). REST/WebSocket API. Setup: API key from Cartesia dashboard; third-party framework plugins exist, e.g. LiveKit’s livekit-agents[cartesia]. Cost: Free tier (20K credits/mo) up through $299/ m o S c a l e t i er; V o i ce A g e n t s bi ll e d se p a r a t e l y a t$ 0.06/min + telephony surcharge. Integration notes: TTS-only — pair with a separate STT model (e.g. whisper-cli for local/offline, or a cloud STT) and any LLM to build a full pipelined voice agent, the same shape as an ElevenLabs-based stack. $100M raised late 2025 (Kleiner Perkins, Index, Lightspeed, NVIDIA) — a well-capitalized vendor, not an early-stage risk.

Moshi — Kyutai Labs’ Full-Duplex Speech Foundation Model — the open-weight, single-model full-duplex alternative to a Sonic-based pipeline
OpenAI Realtime API — the other native-speech-to-speech option; unlike Sonic, doesn’t require pairing a separate STT/LLM stage
ElevenLabs voice agents on Claude Code — the closest direct competitor: same pipelined-TTS role, closed/commercial, stronger on emotional/voice-cloning quality per independent comparisons
Voice Agent Comparison — Moshi vs ElevenLabs vs OpenAI Realtime vs Cartesia Sonic — the full four-way comparison, including the marketed-vs-measured latency caveat in context
whisper-cli (whisper.cpp) — a free/local STT stage that could pair with Sonic’s TTS stage in a self-hosted-leaning pipeline

Try It

Before trusting any vendor’s marketed latency number (Cartesia’s or anyone else’s), test time-to-first-audio in your own production network conditions — the Coval benchmark shows marketed and measured latency can diverge by 2x or more.
If building a pipelined voice agent, evaluate Sonic 3.5 against ElevenLabs Flash/Turbo on the specific latency percentile that matters for the use case (P50 vs P75 vs worst-case), not just the headline number.
Use the Free tier (20,000 credits/mo) to benchmark voice quality and instant-cloning fidelity before committing to a paid tier.

Open Questions

No source reconciled why Cartesia’s marketed model-latency and Coval’s measured production P50 diverge so widely (network conditions, request batching, geographic routing, and measurement methodology are all plausible factors, but none were confirmed in the sources gathered). ^[ambiguous]
Whether Sonic’s Voice Agents (“Line”) product is a viable alternative to the DIY STT+LLM+TTS pipeline for any use case this vault tracks (e.g. dental-client intake) is unexplored.

Jonathon's AI Wiki

Explorer

Cartesia Sonic — Low-Latency Text-to-Speech (State-Space Model)

Key Takeaways

Pricing

Implementation

Try It

Open Questions

Graph View

Table of Contents

Backlinks

Jonathon's AI Wiki

Explorer

Cartesia Sonic — Low-Latency Text-to-Speech (State-Space Model)

Key Takeaways

Pricing

Implementation

Related

Try It

Open Questions

Graph View

Table of Contents

Backlinks