Source: ai-research/openai-realtime-api-gpt-realtime-launch-2025.md, ai-research/openai-realtime-api-gpt-realtime-2-launch-2026.md, ai-research/openai-realtime-api-docs-overview-2026.md, ai-research/openai-realtime-vs-elevenlabs-cartesia-gemini-comparison-2026.md, ai-research/getstream-top-6-realtime-speech-to-speech-apis-2025.md

The OpenAI Realtime API is OpenAI’s product for building live voice agents — a single model processes incoming audio and generates audio output directly, the same architectural category as Moshi and the opposite of pipelined stacks like ElevenLabs Conversational AI (STT → LLM → TTS). It reached general availability August 28, 2025 (after a public beta since October 2024) on the gpt-realtime model, with a successor generation — gpt-realtime-2, plus siblings gpt-realtime-translate and gpt-realtime-whisper — shipping May 7, 2026.

Key Takeaways

  • Native speech-to-speech, not a pipeline. OpenAI’s own framing: “Unlike traditional pipelines that chain together multiple models across speech-to-text and text-to-speech, the Realtime API processes and generates audio directly through a single model and API.” This puts it in the same architectural category as Moshi and squarely against ElevenLabs’ and Cartesia’s pipelined/TTS-only approach.
  • No published headline latency number — unlike Moshi’s ~200ms or Cartesia’s marketed sub-100ms TTS latency, OpenAI doesn’t publish a single millisecond figure. Independent sources describe “very low” latency with strong barge-in/interruption handling; one comparison piece cites first-audio latency of 300ms to 2.3 seconds depending on the reasoning-effort level set for the session — a latency/intelligence tradeoff gpt-realtime-2 introduced that didn’t exist in the original model.
  • Token-based pricing, not per-minute. 0.40 cached), 0.18–0.46/minute uncached, $0.05–0.10/minute with prompt caching.
  • gpt-realtime-2 (May 2026) added: adjustable reasoning effort (minimal/low/medium/high/xhigh, low default), 128K context (up from 32K), parallel tool calls with audible “tool transparency” phrases, remote MCP server support, image input mid-session, SIP phone calling, graceful-failure recovery phrases, and two new voices (Cedar and Marin, alongside 8 legacy voices).
  • Three integration transports: WebRTC (recommended for browser/mobile — native VAD and echo-cancellation, near-zero added latency), WebSocket (server-side agents and telephony bridges — you handle base64 PCM16 audio and VAD yourself), and SIP (direct PSTN/PBX telephony). The official path is the Agents SDK (RealtimeAgent/RealtimeSession via @openai/agents/realtime), which wraps ephemeral-key auth and WebRTC.
  • Sibling models are per-minute, not token-based: gpt-realtime-translate (0.017/min) — narrower, cheaper models for translation-only and transcription-only use cases respectively, distinct from the full conversational gpt-realtime line.

How It Compares

Independent sources frame this as a three-way tradeoff rather than a strict ranking:

DimensionOpenAI Realtime APIMoshiElevenLabsCartesia Sonic
ArchitectureNative speech-to-speechNative speech-to-speechPipelined (STT→LLM→TTS)TTS-only (one pipeline stage)
Open vs. closedClosed, hosted onlyOpen-weight, self-hostableClosed, hosted onlyClosed, hosted only
Headline strength (per independent comparisons)Conversational intelligence, tool-use reliabilityLatency + self-hostEmotional/voice-cloning qualityRaw TTS latency
Pricing modelToken-based (audio in/out)Free (self-hosted compute cost only)Character/minute-basedCharacter/minute-based

OpenAI + Gemini Flash Live are consistently grouped as native S2S systems; ElevenLabs and Cartesia are grouped as pipeline/TTS-based per the independent comparison sources gathered here. The sharpest single-axis contrast is open vs. closed: Moshi is the only self-hostable, open-weight option among the four voice stacks this topic covers.

Implementation

Tool/Service: OpenAI Realtime API, gpt-realtime / gpt-realtime-2, endpoint /v1/realtime. Setup: Ephemeral secrets via POST /v1/realtime/client_secrets; official Agents SDK (@openai/agents/realtime) recommended over raw WebRTC/WebSocket handling for most use cases. Cost: 0.40 cached), 0.034/min; gpt-realtime-whisper $0.017/min. Integration notes: WebRTC for browser/mobile clients (recommended default), WebSocket for server-side/telephony-bridge agents, SIP for direct phone-system integration. low reasoning effort is the default — raising it trades latency for response quality, a tuning knob unique to gpt-realtime-2 among the voice stacks in this topic.

Try It

  1. Read the Agents SDK quickstart (@openai/agents/realtime) before touching raw WebRTC/WebSocket handling — it wraps the ephemeral-key auth flow that’s otherwise easy to get wrong.
  2. If building a browser-based voice feature, default to the WebRTC transport for the built-in VAD/echo-cancellation; reserve WebSocket for server-side or telephony-bridge use cases.
  3. Test the low vs high reasoning-effort settings on gpt-realtime-2 for a target use case and measure the actual latency delta before committing to a tier — the 300ms-to-2.3s range is wide enough to matter for UX.

Open Questions

  • No independently-measured production latency benchmark (analogous to the Coval benchmark used for Cartesia Sonic) was found for the OpenAI Realtime API — the 300ms–2.3s figure is from a single comparison source, not corroborated.
  • Whether gpt-realtime-2’s SIP/telephony support has been used in any dental/marketing/agency context this vault tracks is unestablished — worth revisiting if a voice-agent-for-client-intake use case comes up (see GoHighLevel for the adjacent CRM/voice-AI surface).