Source: ai-research/openai-realtime-api-gpt-realtime-launch-2025.md, ai-research/openai-realtime-api-gpt-realtime-2-launch-2026.md, ai-research/openai-realtime-api-docs-overview-2026.md, ai-research/openai-realtime-vs-elevenlabs-cartesia-gemini-comparison-2026.md, ai-research/getstream-top-6-realtime-speech-to-speech-apis-2025.md
The OpenAI Realtime API is OpenAI’s product for building live voice agents — a single model processes incoming audio and generates audio output directly, the same architectural category as Moshi and the opposite of pipelined stacks like ElevenLabs Conversational AI (STT → LLM → TTS). It reached general availability August 28, 2025 (after a public beta since October 2024) on the gpt-realtime model, with a successor generation — gpt-realtime-2, plus siblings gpt-realtime-translate and gpt-realtime-whisper — shipping May 7, 2026.
Key Takeaways
- Native speech-to-speech, not a pipeline. OpenAI’s own framing: “Unlike traditional pipelines that chain together multiple models across speech-to-text and text-to-speech, the Realtime API processes and generates audio directly through a single model and API.” This puts it in the same architectural category as Moshi and squarely against ElevenLabs’ and Cartesia’s pipelined/TTS-only approach.
- No published headline latency number — unlike Moshi’s ~200ms or Cartesia’s marketed sub-100ms TTS latency, OpenAI doesn’t publish a single millisecond figure. Independent sources describe “very low” latency with strong barge-in/interruption handling; one comparison piece cites first-audio latency of 300ms to 2.3 seconds depending on the reasoning-effort level set for the session — a latency/intelligence tradeoff
gpt-realtime-2introduced that didn’t exist in the original model. - Token-based pricing, not per-minute. 0.40 cached), 0.18–0.46/minute uncached, $0.05–0.10/minute with prompt caching.
gpt-realtime-2(May 2026) added: adjustable reasoning effort (minimal/low/medium/high/xhigh,lowdefault), 128K context (up from 32K), parallel tool calls with audible “tool transparency” phrases, remote MCP server support, image input mid-session, SIP phone calling, graceful-failure recovery phrases, and two new voices (Cedar and Marin, alongside 8 legacy voices).- Three integration transports: WebRTC (recommended for browser/mobile — native VAD and echo-cancellation, near-zero added latency), WebSocket (server-side agents and telephony bridges — you handle base64 PCM16 audio and VAD yourself), and SIP (direct PSTN/PBX telephony). The official path is the Agents SDK (
RealtimeAgent/RealtimeSessionvia@openai/agents/realtime), which wraps ephemeral-key auth and WebRTC. - Sibling models are per-minute, not token-based:
gpt-realtime-translate(0.017/min) — narrower, cheaper models for translation-only and transcription-only use cases respectively, distinct from the full conversationalgpt-realtimeline.
How It Compares
Independent sources frame this as a three-way tradeoff rather than a strict ranking:
| Dimension | OpenAI Realtime API | Moshi | ElevenLabs | Cartesia Sonic |
|---|---|---|---|---|
| Architecture | Native speech-to-speech | Native speech-to-speech | Pipelined (STT→LLM→TTS) | TTS-only (one pipeline stage) |
| Open vs. closed | Closed, hosted only | Open-weight, self-hostable | Closed, hosted only | Closed, hosted only |
| Headline strength (per independent comparisons) | Conversational intelligence, tool-use reliability | Latency + self-host | Emotional/voice-cloning quality | Raw TTS latency |
| Pricing model | Token-based (audio in/out) | Free (self-hosted compute cost only) | Character/minute-based | Character/minute-based |
OpenAI + Gemini Flash Live are consistently grouped as native S2S systems; ElevenLabs and Cartesia are grouped as pipeline/TTS-based per the independent comparison sources gathered here. The sharpest single-axis contrast is open vs. closed: Moshi is the only self-hostable, open-weight option among the four voice stacks this topic covers.
Implementation
Tool/Service: OpenAI Realtime API, gpt-realtime / gpt-realtime-2, endpoint /v1/realtime.
Setup: Ephemeral secrets via POST /v1/realtime/client_secrets; official Agents SDK (@openai/agents/realtime) recommended over raw WebRTC/WebSocket handling for most use cases.
Cost: 0.40 cached), 0.034/min; gpt-realtime-whisper $0.017/min.
Integration notes: WebRTC for browser/mobile clients (recommended default), WebSocket for server-side/telephony-bridge agents, SIP for direct phone-system integration. low reasoning effort is the default — raising it trades latency for response quality, a tuning knob unique to gpt-realtime-2 among the voice stacks in this topic.
Related
- Moshi — Kyutai Labs’ Full-Duplex Speech Foundation Model — the open-weight, self-hostable counterpart in the same native-S2S architecture category
- ElevenLabs voice agents on Claude Code — the pipelined (STT→LLM→TTS) commercial alternative, strongest on voice-cloning/emotional quality
- Cartesia Sonic — a pipelined TTS-only model optimized specifically for the text-to-speech leg’s latency
- Voice Agent Comparison — Moshi vs ElevenLabs vs OpenAI Realtime vs Cartesia Sonic — the full four-way comparison this article feeds into
- whisper-cli (whisper.cpp) — the free/offline/non-conversational counterpart for batch transcription use cases this API isn’t priced or designed for
Try It
- Read the Agents SDK quickstart (
@openai/agents/realtime) before touching raw WebRTC/WebSocket handling — it wraps the ephemeral-key auth flow that’s otherwise easy to get wrong. - If building a browser-based voice feature, default to the WebRTC transport for the built-in VAD/echo-cancellation; reserve WebSocket for server-side or telephony-bridge use cases.
- Test the
lowvshighreasoning-effort settings ongpt-realtime-2for a target use case and measure the actual latency delta before committing to a tier — the 300ms-to-2.3s range is wide enough to matter for UX.
Open Questions
- No independently-measured production latency benchmark (analogous to the Coval benchmark used for Cartesia Sonic) was found for the OpenAI Realtime API — the 300ms–2.3s figure is from a single comparison source, not corroborated.
- Whether
gpt-realtime-2’s SIP/telephony support has been used in any dental/marketing/agency context this vault tracks is unestablished — worth revisiting if a voice-agent-for-client-intake use case comes up (see GoHighLevel for the adjacent CRM/voice-AI surface).