Source: ai-research/cartesia-sonic-product-page-2026.md, ai-research/cartesia-sonic-announcement-blog.md, ai-research/cartesia-sonic-3-5-docs.md, ai-research/cartesia-pricing-2026.md, ai-research/coval-tts-providers-2026-benchmark.md, ai-research/marktechpost-tts-models-2026-benchmark.md

Sonic is Cartesia’s real-time text-to-speech (TTS-only) model line — not a full-duplex conversational model like Moshi or OpenAI Realtime API, but a single pipeline stage typically paired with any LLM plus a separate STT stage, the same architectural role ElevenLabs occupies. Cartesia was founded by Albert Gu and Karan Goel, who (with Tri Dao) originated the S4 and Mamba state-space-model (SSM) architectures in academia before commercializing them — Sonic runs on Cartesia’s own SSM inference stack rather than a Transformer, which is the mechanical reason it markets itself specifically on latency.

Key Takeaways

  • SSM architecture, not Transformer. Per Cartesia’s own launch claims (vs. a parameter-matched Transformer baseline): 20% lower validation perplexity, 2x lower WER, 1.5x lower time-to-first-audio, 2x faster inference, 4x higher throughput. SSM inference scales linearly with sequence length vs. Transformers’ quadratic scaling — the structural reason it stays fast under load.
  • Version history moved fast: Sonic 1 (135ms model latency at launch) → Sonic 2 (40ms) → Sonic 3 (GA April 2026, ~90ms model latency / 190ms end-to-end per Cartesia) → Sonic 3.5 (May 2026, current recommended-stable release, sub-90ms marketed / ~82ms end-to-end time-to-first-audio per one independent benchmark) → a Sonic Turbo variant (~40ms). Briefly held #1 on the Artificial Analysis TTS leaderboard before being overtaken.
  • Marketed latency and independently-measured latency disagree — this is a real, unresolved gap, not a rounding difference. ^[ambiguous] Cartesia’s own marketing claims sub-90–100ms “model latency” for Sonic 3.5. An independent production benchmark (Coval) measured Sonic 3 at P50 188ms, IQR 100ms, P75 269ms in practice — meaningfully higher and far more variable than the marketed figure, with a non-trivial share of requests exceeding the ~300ms threshold where conversational latency starts to feel laggy. Different sources also measure genuinely different things (model-latency vs. time-to-first-byte vs. time-to-first-audio vs. end-to-end vs. production P50), so cross-vendor latency numbers — including the comparisons in this article — are not strictly apples-to-apples.
  • The ranking flips depending on which number you read. Using Cartesia’s own marketed figures, ElevenLabs Flash v2.5 (~75ms) beats Cartesia’s ~90ms claim. Using Coval’s independent production P50 data, Cartesia (188ms) beats ElevenLabs Turbo/Flash (264–288ms). Treat any single-number latency claim for any TTS vendor, including this one, with real skepticism until benchmarked in your own production conditions.
  • Voice cloning: instant clone from a 10-second sample (per the product page; some secondary sources cite 3 seconds). Two tiers — Instant Voice Cloning (from the Pro plan) and Professional Voice Cloning (higher tiers, training fee).
  • 42 languages natively, including 9 Indian languages. Sonic 3.5 added multilingual improvements (Hebrew, Japanese, Spanish, Hindi, German, Korean, French) and fixed long-session voice drift.
  • Even a fast TTS leg doesn’t collapse a pipeline. Sonic makes the TTS stage faster, but in a pipelined stack (STT → LLM → TTS) the end-to-end budget per Coval’s analysis still runs 500–700ms — Sonic narrows one stage, it doesn’t eliminate the pipeline the way Moshi’s or OpenAI Realtime API’s single-model architecture does.

Pricing

TierPriceCredits/moNotes
Free$020,000
Pro4 annual)100,000Adds commercial license + instant voice cloning
Startup37 annual)1,250,000Adds Professional voice cloning
Scale224 annual)8,000,000
EnterpriseCustomCustom

Voice Agents (Cartesia’s “Line” product) bill separately: 0.014/min for telephony.

Implementation

Tool/Service: Cartesia Sonic (sonic-3.5 rolling-stable, dated snapshots like sonic-3.5-2026-05-04, or sonic-latest beta). REST/WebSocket API. Setup: API key from Cartesia dashboard; third-party framework plugins exist, e.g. LiveKit’s livekit-agents[cartesia]. Cost: Free tier (20K credits/mo) up through 0.06/min + telephony surcharge. Integration notes: TTS-only — pair with a separate STT model (e.g. whisper-cli for local/offline, or a cloud STT) and any LLM to build a full pipelined voice agent, the same shape as an ElevenLabs-based stack. $100M raised late 2025 (Kleiner Perkins, Index, Lightspeed, NVIDIA) — a well-capitalized vendor, not an early-stage risk.

Try It

  1. Before trusting any vendor’s marketed latency number (Cartesia’s or anyone else’s), test time-to-first-audio in your own production network conditions — the Coval benchmark shows marketed and measured latency can diverge by 2x or more.
  2. If building a pipelined voice agent, evaluate Sonic 3.5 against ElevenLabs Flash/Turbo on the specific latency percentile that matters for the use case (P50 vs P75 vs worst-case), not just the headline number.
  3. Use the Free tier (20,000 credits/mo) to benchmark voice quality and instant-cloning fidelity before committing to a paid tier.

Open Questions

  • No source reconciled why Cartesia’s marketed model-latency and Coval’s measured production P50 diverge so widely (network conditions, request batching, geographic routing, and measurement methodology are all plausible factors, but none were confirmed in the sources gathered). ^[ambiguous]
  • Whether Sonic’s Voice Agents (“Line”) product is a viable alternative to the DIY STT+LLM+TTS pipeline for any use case this vault tracks (e.g. dental-client intake) is unexplored.