OpenAI Realtime API — Native Speech-to-Speech Voice Agents

Source: ai-research/openai-realtime-api-gpt-realtime-launch-2025.md, ai-research/openai-realtime-api-gpt-realtime-2-launch-2026.md, ai-research/openai-realtime-api-docs-overview-2026.md, ai-research/openai-realtime-vs-elevenlabs-cartesia-gemini-comparison-2026.md, ai-research/getstream-top-6-realtime-speech-to-speech-apis-2025.md

The OpenAI Realtime API is OpenAI’s product for building live voice agents — a single model processes incoming audio and generates audio output directly, the same architectural category as Moshi and the opposite of pipelined stacks like ElevenLabs Conversational AI (STT → LLM → TTS). It reached general availability August 28, 2025 (after a public beta since October 2024) on the gpt-realtime model, with a successor generation — gpt-realtime-2, plus siblings gpt-realtime-translate and gpt-realtime-whisper — shipping May 7, 2026.

Key Takeaways

Native speech-to-speech, not a pipeline. OpenAI’s own framing: “Unlike traditional pipelines that chain together multiple models across speech-to-text and text-to-speech, the Realtime API processes and generates audio directly through a single model and API.” This puts it in the same architectural category as Moshi and squarely against ElevenLabs’ and Cartesia’s pipelined/TTS-only approach.
No published headline latency number — unlike Moshi’s ~200ms or Cartesia’s marketed sub-100ms TTS latency, OpenAI doesn’t publish a single millisecond figure. Independent sources describe “very low” latency with strong barge-in/interruption handling; one comparison piece cites first-audio latency of 300ms to 2.3 seconds depending on the reasoning-effort level set for the session — a latency/intelligence tradeoff gpt-realtime-2 introduced that didn’t exist in the original model.
Token-based pricing, not per-minute. $32 p er 1 M a u d i o in p u tt o k e n s ($ 0.40 cached), $64 p er 1 M a u d i oo u tp u tt o k e n s, f l a t a cross ‘ g pt - re a lt im e ‘ an d ‘ g pt - re a lt im e - 2‘. U ser a u d i o = 1 t o k e n p er 100 m s; a ss i s t an t a u d i o = 1 t o k e n p er 50 m s (so 60 seco n d so f u sers p eec h \approx 600 t o k e n s; 60 seco n d so f TTS o u tp u t \approx 1, 200 t o k e n s) . Bl e n d e d re a l - w or l d cos t i sro ug h l y$ 0.18–0.46/minute uncached, $0.05–0.10/minute with prompt caching.
gpt-realtime-2 (May 2026) added: adjustable reasoning effort (minimal/low/medium/high/xhigh, low default), 128K context (up from 32K), parallel tool calls with audible “tool transparency” phrases, remote MCP server support, image input mid-session, SIP phone calling, graceful-failure recovery phrases, and two new voices (Cedar and Marin, alongside 8 legacy voices).
Three integration transports: WebRTC (recommended for browser/mobile — native VAD and echo-cancellation, near-zero added latency), WebSocket (server-side agents and telephony bridges — you handle base64 PCM16 audio and VAD yourself), and SIP (direct PSTN/PBX telephony). The official path is the Agents SDK (RealtimeAgent/RealtimeSession via @openai/agents/realtime), which wraps ephemeral-key auth and WebRTC.
Sibling models are per-minute, not token-based: gpt-realtime-translate ( $0.034/ min, 70 + l an gu a g es in /13 o u t) an d ‘ g pt - re a lt im e - w hi s p er ‘ ($ 0.017/min) — narrower, cheaper models for translation-only and transcription-only use cases respectively, distinct from the full conversational gpt-realtime line.

How It Compares

Independent sources frame this as a three-way tradeoff rather than a strict ranking:

Dimension	OpenAI Realtime API	Moshi	ElevenLabs	Cartesia Sonic
Architecture	Native speech-to-speech	Native speech-to-speech	Pipelined (STT→LLM→TTS)	TTS-only (one pipeline stage)
Open vs. closed	Closed, hosted only	Open-weight, self-hostable	Closed, hosted only	Closed, hosted only
Headline strength (per independent comparisons)	Conversational intelligence, tool-use reliability	Latency + self-host	Emotional/voice-cloning quality	Raw TTS latency
Pricing model	Token-based (audio in/out)	Free (self-hosted compute cost only)	Character/minute-based	Character/minute-based

OpenAI + Gemini Flash Live are consistently grouped as native S2S systems; ElevenLabs and Cartesia are grouped as pipeline/TTS-based per the independent comparison sources gathered here. The sharpest single-axis contrast is open vs. closed: Moshi is the only self-hostable, open-weight option among the four voice stacks this topic covers.

Implementation

Tool/Service: OpenAI Realtime API, gpt-realtime / gpt-realtime-2, endpoint /v1/realtime. Setup: Ephemeral secrets via POST /v1/realtime/client_secrets; official Agents SDK (@openai/agents/realtime) recommended over raw WebRTC/WebSocket handling for most use cases. Cost: $32/1 M a u d i o in p u tt o k e n s ($ 0.40 cached), $64/1 M a u d i oo u tp u tt o k e n s; ‘ g pt - re a lt im e - t r an s l a t e ‘$ 0.034/min; gpt-realtime-whisper $0.017/min. Integration notes: WebRTC for browser/mobile clients (recommended default), WebSocket for server-side/telephony-bridge agents, SIP for direct phone-system integration. low reasoning effort is the default — raising it trades latency for response quality, a tuning knob unique to gpt-realtime-2 among the voice stacks in this topic.

Moshi — Kyutai Labs’ Full-Duplex Speech Foundation Model — the open-weight, self-hostable counterpart in the same native-S2S architecture category
ElevenLabs voice agents on Claude Code — the pipelined (STT→LLM→TTS) commercial alternative, strongest on voice-cloning/emotional quality
Cartesia Sonic — a pipelined TTS-only model optimized specifically for the text-to-speech leg’s latency
Voice Agent Comparison — Moshi vs ElevenLabs vs OpenAI Realtime vs Cartesia Sonic — the full four-way comparison this article feeds into
whisper-cli (whisper.cpp) — the free/offline/non-conversational counterpart for batch transcription use cases this API isn’t priced or designed for

Try It

Read the Agents SDK quickstart (@openai/agents/realtime) before touching raw WebRTC/WebSocket handling — it wraps the ephemeral-key auth flow that’s otherwise easy to get wrong.
If building a browser-based voice feature, default to the WebRTC transport for the built-in VAD/echo-cancellation; reserve WebSocket for server-side or telephony-bridge use cases.
Test the low vs high reasoning-effort settings on gpt-realtime-2 for a target use case and measure the actual latency delta before committing to a tier — the 300ms-to-2.3s range is wide enough to matter for UX.

Open Questions

No independently-measured production latency benchmark (analogous to the Coval benchmark used for Cartesia Sonic) was found for the OpenAI Realtime API — the 300ms–2.3s figure is from a single comparison source, not corroborated.
Whether gpt-realtime-2’s SIP/telephony support has been used in any dental/marketing/agency context this vault tracks is unestablished — worth revisiting if a voice-agent-for-client-intake use case comes up (see GoHighLevel for the adjacent CRM/voice-AI surface).

Jonathon's AI Wiki

Explorer

OpenAI Realtime API — Native Speech-to-Speech Voice Agents

Key Takeaways

How It Compares

Implementation

Try It

Open Questions

Graph View

Table of Contents

Backlinks

Jonathon's AI Wiki

Explorer

OpenAI Realtime API — Native Speech-to-Speech Voice Agents

Key Takeaways

How It Compares

Implementation

Related

Try It

Open Questions

Graph View

Table of Contents

Backlinks