Voice models, voice agents, real-time speech infrastructure, and end-to-end conversational AI. Covers both open-source foundation models (Moshi, Mimi) and commercial voice agent stacks (ElevenLabs Conversational AI). Distinct from AI Video & Content Production (which covers HeyGen avatar models, lipsync, and video composition with audio); this topic is voice-first / audio-first.
Articles
- Moshi — Kyutai Labs’ Full-Duplex Speech Foundation Model — 7B Temporal Transformer + small Depth Transformer + Mimi neural audio codec. Real-time conversational AI with theoretical 160ms / practical ~200ms latency on L4 GPU — single foundation model handles full-duplex dialogue with no STT→LLM→TTS pipeline. Three runtimes: PyTorch (research, 24GB+ VRAM), MLX (Apple Silicon local), Rust (production with CUDA/Metal). Code MIT + Apache, weights CC-BY 4.0. 10,163 stars.
Cross-cutting links
Voice-relevant articles in other topics:
- ElevenLabs voice agents on Claude Code — commercial closed-source voice-agent stack with broad voice library + multi-language support. Pipeline architecture (STT → LLM → TTS) with ~500-800ms typical latency. Closest competitor to Moshi for builder use cases.
- yt-dlp — YouTube transcript / audio extraction tool used by
bin/yt-transcriptin this vault, plus local whisper-cli fallback for captionless videos via~/.whisper-models/ggml-*.en.bin. Source dependency for last30days and claude-video. - HeyGen Hyperframes — HTML video composition with TTS / lipsync layers.
- HeyGen Studio Automation — Avatar V production pipeline (TTS + lipsync + multi-clip composition).
- OpenClaw on Rabbit R1 — voice as input to a self-hosted agent fleet. Pocket-hardware voice surface; pairs with Moshi or ElevenLabs at the model layer if you self-host both.
- Code with Claude 2026 keynote — Anthropic conference frame for voice/agent integration.