Moshi — Kyutai Labs' Full-Duplex Speech Foundation Model

Source: Kyutai Labs Moshi README (2026-05-08) (github.com/kyutai-labs/moshi)

Moshi is Kyutai Labs’ speech-text foundation model and full-duplex spoken dialogue framework — a single 7B-parameter model that handles real-time voice conversation end-to-end with theoretical 160ms / practical ~200ms latency on an L4 GPU. Distinct from cascaded voice-agent stacks (STT → LLM → TTS) like ElevenLabs Conversational AI: Moshi has dual simultaneous audio streams (its own speech + user input), predicts text tokens as inner monologue, and uses Mimi, a state-of-the-art streaming neural audio codec that compresses 24 kHz audio to 12.5 Hz / 1.1 kbps with 80ms streaming latency. Three runtimes ship in the repo: PyTorch (research, 24GB+ GPU VRAM), MLX (Apple Silicon local), and Rust (production with CUDA/Metal). Code is MIT + Apache 2.0; weights (Moshika female / Moshiko male) are CC-BY 4.0 on HuggingFace. 10,163 GitHub stars at fetch, 951 forks, active maintenance (last push 2026-05-05). Paper: arXiv:2410.00037. Demo: moshi.chat.

Key Takeaways

Full-duplex is the load-bearing architectural choice. Conventional voice agents are pipeline-stitched: speech-to-text → LLM → text-to-speech, each stage adding latency. Moshi is a single foundation model with dual audio streams — it speaks while the user is speaking and listens while it speaks. The result is sub-300ms practical latency, which beats nearly every cascaded stack including ElevenLabs Conversational AI (typically 500-800ms in good conditions).
The latency claim has two numbers.
- Theoretical: 160ms = 80ms frame size + 80ms acoustic delay, baked into the architecture.
- Practical: ~200ms on L4 GPU. This is the response-onset latency, not the perceived end-of-response.
- Practical latency on Apple Silicon (MLX) and Rust runtimes will differ; not published in the README.
Inner monologue via predicted text tokens. Moshi predicts text tokens alongside speech generation as an internal scratchpad to improve coherence and quality. This is the same conceptual move as extended thinking in text models, applied to speech: the model “thinks” in text while it produces audio.
Mimi is independently load-bearing. The streaming neural audio codec is shipped as a separate module and is reusable for anyone building real-time speech systems even outside Moshi.
- 24 kHz audio → 12.5 Hz representation at 1.1 kbps, with 80ms streaming latency.
- Transformers in both encoder and decoder.
- Distillation loss matching WavLM self-supervised representations — captures semantic-rich features, not just acoustic ones.
- Adversarial training with feature matching for output quality.
- There’s a separate Rust crate rustymimi (current 0.2.2, Sept 2024) for production use.
Three runtimes, three deployment targets.
- PyTorch (pip install -U moshi) — research path. Web server + Gradio UI on localhost:8998. Requires 24GB+ GPU VRAM.
- MLX (pip install -U moshi_mlx) — Apple Silicon native, Python 3.12 recommended. Run python -m moshi_mlx.local -q 4 for 4-bit quantization. Lowest-friction path on M-series Macs — runs Moshi locally without a GPU server.
- Rust (build from source in rust/, cargo run --features cuda --bin moshi-backend -r) — production path. CUDA or Metal. Required for serious deployment where latency or throughput matters.
Available model checkpoints. Two voices (Moshika female, Moshiko male) released on HuggingFace under the kyutai collection in multiple quantizations:
- PyTorch: bf16, int8
- MLX: int4, int8, bf16
- Candle/Rust: int8, bf16
- All weights under CC-BY 4.0 (commercial use permitted with attribution).
System-level differentiators vs OpenAI Realtime API. OpenAI shipped a comparable full-duplex API in late 2024 (closed-source, server-side, $0.06/min audio out, ~300ms latency). Moshi is the open-source equivalent — locally runnable on a single GPU or Apple Silicon, MIT/Apache code, CC-BY weights, no per-minute cost. Trade-off: OpenAI Realtime is one API call; Moshi requires hosting infrastructure.
System-level differentiators vs ElevenLabs Conversational AI. ElevenLabs is closed-source, has a much broader voice library (commercial-grade voices in 30+ languages), and a polished Conversational AI product surface. Moshi has two voices, English-primary, but is open weights, locally runnable, and ~3-4× faster on response onset.
Authors and credentials. Alexandre Défossez (lead, formerly Meta AI / FAIR — also lead author of Demucs and Encodec), Pierre Mazaré, Manu Orsini et al. Kyutai Labs is a Paris-based AI lab funded by Iliad / Schmidt Sciences / CMA-CGM ($300M endowment). Track record: open weights, real papers, real benchmarks.
Production trajectory. Created Aug 2024, the repo is on its second year of active development, with the Rust runtime maturing. The deploy.sh + docker-compose.yml + swarm-config.yml in the root suggest production-deployment tooling is shipping alongside the model code itself.
Two related Kyutai repos worth knowing.
- kyutai-labs/moshi-finetune — fine-tuning code for adapting Moshi to specific voices, languages, or domains.
- The rustymimi Rust crate — Mimi codec standalone for embedding into non-Moshi audio pipelines.

Where this fits in the wiki

Seeds the new ai-voice/ topic as the first article. Topic created 2026-05-08; cross-references existing voice-adjacent content in claude-ai and ai-video-content.
Open-source counterpart to ElevenLabs Conversational AI. Same problem (voice agents), opposite trade-offs: open vs closed, locally hosted vs SaaS, single voice vs catalog, sub-300ms vs 500-800ms latency, free-after-hosting-cost vs $0.30/min API. The two are likely complementary in production: ElevenLabs for breadth-of-voices use cases, Moshi for sub-300ms latency or fully-local deployment requirements.
Pairs with OpenClaw on Rabbit R1. R1 provides the voice input hardware; OpenClaw provides the agent fleet; Moshi could provide the local foundation model layer if a user wanted a fully-local voice loop. Pocket-hardware voice → self-hosted agent gateway → self-hosted speech model. All-open-stack.
Composes with Claude Managed Agents for hybrid stacks. A Moshi-based voice front-end could route to Anthropic’s Managed Agents for the heavy reasoning work, getting Moshi’s sub-300ms response onset for ack-and-continue UX while Claude does the actual problem-solving in the background.
Mimi codec is reusable independently. Anyone building real-time audio infrastructure (transcription pipelines, audio compression, voice activity detection) can pull in the codec without using Moshi itself. Worth flagging for agent-infra use cases.
Latency claim worth verifying via the last30days research skill. Sub-300ms full-duplex is a strong claim; community reproductions on r/LocalLLaMA and similar would be worth aggregating.

Implementation

Tool/Service: Kyutai Labs Moshi (model + codec + framework, https://github.com/kyutai-labs/moshi)
Setup paths:
- PyTorch (research): pip install -U moshi → python -m moshi.server → localhost:8998. Requires 24GB+ GPU VRAM (e.g., L4 / A10 / A100 / H100).
- MLX (local Mac): pip install -U moshi_mlx (Python 3.12 recommended) → python -m moshi_mlx.local -q 4. Apple Silicon M1+; 4-bit quantization fits on consumer Macs.
- Rust (production): clone repo, build rust/ directory with cargo run --features cuda --bin moshi-backend -r. CUDA or Metal.
Cost:
- Code: free (MIT + Apache 2.0).
- Model weights: free for commercial use under CC-BY 4.0 (attribution required).
- Compute: GPU (24GB+ for PyTorch) or Apple Silicon (M1+ for MLX) or CUDA/Metal-enabled Rust deployment. Self-hosted; no per-minute API cost.
Integration notes:
- Two voices ship out of the box: Moshika (female) and Moshiko (male). For other voices, fine-tune via kyutai-labs/moshi-finetune.
- Quantization shrinks memory footprint substantially: int4 MLX is the most aggressive practical setting; bf16 keeps full quality.
- Web UI is built into PyTorch server via Gradio (--gradio-tunnel for public access).
- Mimi codec is reusable standalone via the rustymimi Rust crate or via moshi.models PyTorch import.
- For full-stack-open voice agents: Moshi (speech foundation) + OpenClaw or self-hosted Hermes (agent layer) + local toolset (file system, browser, MCP servers).

Open Questions

Latency on MLX (Apple Silicon). README publishes 200ms practical on L4 GPU. What’s the practical latency on M3 Pro / M4 Max? Worth a community benchmark.
Voice fidelity vs ElevenLabs. Moshi has two voices; ElevenLabs has hundreds across languages with broadcast-quality fidelity. For applications where voice naturalness is the key UX axis (audiobook narration, character voice), Moshi may not compete. For sub-300ms turn-taking conversational UX, Moshi likely wins.
Multi-language support. README is English-primary; arXiv paper would clarify multilingual training data and inference quality.
Production stability of the Rust runtime. The Rust crate rustymimi is at 0.2.2; the main moshi-backend binary doesn’t have published releases. Production-readiness signal is mixed — actively maintained, but no semver-stable release tag.
Memory pressure during long conversations. Full-duplex streaming + KV cache could grow unbounded across an hours-long conversation. README doesn’t specify whether conversation length is bounded by context-window-equivalent or by streaming-cache architecture. Worth checking the paper.
Composition with Memory + Dreaming. Could Moshi be the voice front-end for a Managed Agents system that uses Mahes’ Memory + Dreaming? Architecturally yes — Moshi handles voice I/O, Managed Agents handle reasoning + memory consolidation. Implementation specifics unclear.
Inner-monologue text-token output. Moshi predicts text tokens as inner monologue. Is that text exposed via the API for logging / debugging / hybrid pipelines, or strictly internal? Unclear from README.

Try It

On a Mac (easiest path): pip install -U moshi_mlx then python -m moshi_mlx.local -q 4. Talk to Moshi locally with no GPU server, no API key, no rate limit. Apple Silicon M1+ minimum.
In the browser: moshi.chat — Kyutai’s hosted demo. Fastest path to “what does this actually sound like.”
For research / prototyping: pip install -U moshi then python -m moshi.server --gradio-tunnel. Gradio UI on localhost:8998 with a tunneled public URL for sharing.
For production self-hosting: clone the repo, build the Rust runtime (cargo run --features cuda --bin moshi-backend -r). Pair with the swarm-config.yml and docker-compose.yml in the repo root for multi-instance deployment.
Pair with ElevenLabs voice agents for a side-by-side comparison. Same prompts, same conversation, both stacks. Note the latency difference, voice quality difference, and the deployment-ops difference. Useful baseline before committing to either stack.
Read the paper if any of these are relevant to your work: streaming codec design, full-duplex dialogue training, inner-monologue prediction, latency-vs-quality trade-offs in speech foundation models.
For all-open-stack voice agents: combine Moshi (speech) + self-hosted OpenClaw / Hermes (agent layer) + local MCP servers (tools). The closest analog to a self-hosted Siri/Alexa as of 2026.

ElevenLabs Voice Agents on Claude Code — closed-source counterpart; the two stacks are the canonical voice-agent comparison
Claude Managed Agents — could pair with Moshi as the reasoning layer behind a Moshi voice front-end
Memory and Dreaming for Self-Learning Agents — voice agents persist memory through Managed Agents primitives
Code with Claude 2026 — Opening Keynote — Anthropic’s broader 2026 agent platform context
OpenClaw on Rabbit R1 — pocket-hardware voice surface; Moshi is the model layer that could pair with R1+OpenClaw for an all-open voice stack
HeyGen Hyperframes — video composition with TTS layers (different problem; complementary to Moshi for video output)
HeyGen Studio Automation — TTS + lipsync + multi-clip video pipeline
yt-dlp — YouTube transcript / audio extraction tool used by bin/yt-transcript plus local whisper-cli fallback for captionless videos
Karpathy — Vibe Coding to Agentic Engineering — software 1.0/2.0/3.0 framing applies to the voice-agent stack composition
2026 Claude Code AIOS Pattern — voice as one of the surfaces in the Cowork-native AIOS
Agents & Agentic Systems — Mimi codec is reusable infra for agent-side audio

Jonathon's AI Wiki

Explorer

Moshi — Kyutai Labs' Full-Duplex Speech Foundation Model

Key Takeaways

Where this fits in the wiki

Implementation

Open Questions

Try It

Graph View

Table of Contents

Backlinks

Jonathon's AI Wiki

Explorer

Moshi — Kyutai Labs' Full-Duplex Speech Foundation Model

Key Takeaways

Where this fits in the wiki

Implementation

Open Questions

Try It

Related

Graph View

Table of Contents

Backlinks