AI Voice

Voice models, voice agents, real-time speech infrastructure, and end-to-end conversational AI. Covers both open-source foundation models (Moshi, Mimi) and commercial voice agent stacks (ElevenLabs Conversational AI). Distinct from AI Video & Content Production (which covers HeyGen avatar models, lipsync, and video composition with audio); this topic is voice-first / audio-first.

Articles

Moshi — Kyutai Labs’ Full-Duplex Speech Foundation Model — 7B Temporal Transformer + small Depth Transformer + Mimi neural audio codec. Real-time conversational AI with theoretical 160ms / practical ~200ms latency on L4 GPU — single foundation model handles full-duplex dialogue with no STT→LLM→TTS pipeline. Three runtimes: PyTorch (research, 24GB+ VRAM), MLX (Apple Silicon local), Rust (production with CUDA/Metal). Code MIT + Apache, weights CC-BY 4.0. 10,163 stars.

Cross-cutting links

Voice-relevant articles in other topics:

ElevenLabs voice agents on Claude Code — commercial closed-source voice-agent stack with broad voice library + multi-language support. Pipeline architecture (STT → LLM → TTS) with ~500-800ms typical latency. Closest competitor to Moshi for builder use cases.
yt-dlp — YouTube transcript / audio extraction tool used by bin/yt-transcript in this vault, plus local whisper-cli fallback for captionless videos via ~/.whisper-models/ggml-*.en.bin. Source dependency for last30days and claude-video.
HeyGen Hyperframes — HTML video composition with TTS / lipsync layers.
HeyGen Studio Automation — Avatar V production pipeline (TTS + lipsync + multi-clip composition).
OpenClaw on Rabbit R1 — voice as input to a self-hosted agent fleet. Pocket-hardware voice surface; pairs with Moshi or ElevenLabs at the model layer if you self-host both.
Code with Claude 2026 keynote — Anthropic conference frame for voice/agent integration.

Open Questions

Should voice-agents-elevenlabs-claude-code migrate from claude-ai/ to this topic? Probably yes — the article is structurally about the voice stack, and the Claude-Code-orchestration angle is one section of it. Hold off until at least one more voice article justifies the topic, then move with cross-link.
A voice-agent comparison article (Moshi vs ElevenLabs vs OpenAI Realtime API vs Cartesia Sonic) is the obvious next connection candidate once 3+ voice articles exist.
Whisper-cli local STT, OpenAI Realtime API, and Cartesia Sonic are obvious next ingest candidates as siblings to the Moshi article.

Jonathon's AI Wiki

Explorer

AI Voice

Articles

Cross-cutting links

Open Questions

Moshi — Kyutai Labs' Full-Duplex Speech Foundation Model