Voice models, voice agents, real-time speech infrastructure, and end-to-end conversational AI. Covers both open-source foundation models (Moshi, Mimi) and commercial voice agent stacks (ElevenLabs Conversational AI). Distinct from AI Video & Content Production (which covers HeyGen avatar models, lipsync, and video composition with audio); this topic is voice-first / audio-first.
Articles
- Moshi — Kyutai Labs’ Full-Duplex Speech Foundation Model — 7B Temporal Transformer + small Depth Transformer + Mimi neural audio codec. Real-time conversational AI with theoretical 160ms / practical ~200ms latency on L4 GPU — single foundation model handles full-duplex dialogue with no STT→LLM→TTS pipeline. Three runtimes: PyTorch (research, 24GB+ VRAM), MLX (Apple Silicon local), Rust (production with CUDA/Metal). Code MIT + Apache, weights CC-BY 4.0. 10,163 stars.
Cross-cutting links
Voice-relevant articles in other topics:
- ElevenLabs voice agents on Claude Code — commercial closed-source voice-agent stack with broad voice library + multi-language support. Pipeline architecture (STT → LLM → TTS) with ~500-800ms typical latency. Closest competitor to Moshi for builder use cases.
- yt-dlp — YouTube transcript / audio extraction tool used by
bin/yt-transcriptin this vault, plus local whisper-cli fallback for captionless videos via~/.whisper-models/ggml-*.en.bin. Source dependency for last30days and claude-video. - HeyGen Hyperframes — HTML video composition with TTS / lipsync layers.
- HeyGen Studio Automation — Avatar V production pipeline (TTS + lipsync + multi-clip composition).
- OpenClaw on Rabbit R1 — voice as input to a self-hosted agent fleet. Pocket-hardware voice surface; pairs with Moshi or ElevenLabs at the model layer if you self-host both.
- Code with Claude 2026 keynote — Anthropic conference frame for voice/agent integration.
Open Questions
- Should voice-agents-elevenlabs-claude-code migrate from
claude-ai/to this topic? Probably yes — the article is structurally about the voice stack, and the Claude-Code-orchestration angle is one section of it. Hold off until at least one more voice article justifies the topic, then move with cross-link. - A voice-agent comparison article (Moshi vs ElevenLabs vs OpenAI Realtime API vs Cartesia Sonic) is the obvious next connection candidate once 3+ voice articles exist.
- Whisper-cli local STT, OpenAI Realtime API, and Cartesia Sonic are obvious next ingest candidates as siblings to the Moshi article.