Voice models, voice agents, real-time speech infrastructure, and end-to-end conversational AI. Covers both open-source foundation models (Moshi, Mimi) and commercial voice agent stacks (ElevenLabs Conversational AI). Distinct from AI Video & Content Production (which covers HeyGen avatar models, lipsync, and video composition with audio); this topic is voice-first / audio-first.

Articles

  • Moshi — Kyutai Labs’ Full-Duplex Speech Foundation Model — 7B Temporal Transformer + small Depth Transformer + Mimi neural audio codec. Real-time conversational AI with theoretical 160ms / practical ~200ms latency on L4 GPU — single foundation model handles full-duplex dialogue with no STT→LLM→TTS pipeline. Three runtimes: PyTorch (research, 24GB+ VRAM), MLX (Apple Silicon local), Rust (production with CUDA/Metal). Code MIT + Apache, weights CC-BY 4.0. 10,163 stars.

Voice-relevant articles in other topics:

  • ElevenLabs voice agents on Claude Code — commercial closed-source voice-agent stack with broad voice library + multi-language support. Pipeline architecture (STT → LLM → TTS) with ~500-800ms typical latency. Closest competitor to Moshi for builder use cases.
  • yt-dlp — YouTube transcript / audio extraction tool used by bin/yt-transcript in this vault, plus local whisper-cli fallback for captionless videos via ~/.whisper-models/ggml-*.en.bin. Source dependency for last30days and claude-video.
  • HeyGen Hyperframes — HTML video composition with TTS / lipsync layers.
  • HeyGen Studio Automation — Avatar V production pipeline (TTS + lipsync + multi-clip composition).
  • OpenClaw on Rabbit R1 — voice as input to a self-hosted agent fleet. Pocket-hardware voice surface; pairs with Moshi or ElevenLabs at the model layer if you self-host both.
  • Code with Claude 2026 keynote — Anthropic conference frame for voice/agent integration.

Open Questions

  • Should voice-agents-elevenlabs-claude-code migrate from claude-ai/ to this topic? Probably yes — the article is structurally about the voice stack, and the Claude-Code-orchestration angle is one section of it. Hold off until at least one more voice article justifies the topic, then move with cross-link.
  • A voice-agent comparison article (Moshi vs ElevenLabs vs OpenAI Realtime API vs Cartesia Sonic) is the obvious next connection candidate once 3+ voice articles exist.
  • Whisper-cli local STT, OpenAI Realtime API, and Cartesia Sonic are obvious next ingest candidates as siblings to the Moshi article.