Source: ai-research/whisper-cpp-github-readme-2026.md, ai-research/whisper-cpp-models-readme-2026.md, ai-research/whisper-cpp-apple-silicon-benchmark-2026.md, ai-research/whisper-cpp-vs-faster-whisper-comparison-2026.md, ai-research/whisper-cpp-vs-cloud-stt-privacy-cost-2026.md
whisper-cli is the command-line binary shipped by whisper.cpp (ggml-org/whisper.cpp, formerly ggerganov/whisper.cpp) — a plain C/C++ reimplementation of OpenAI’s Whisper speech-recognition model, running entirely on-device with zero required dependencies. It is not OpenAI’s hosted Whisper API; it’s the free, self-hosted, offline path to the same model weights. This wiki already references it in passing (the yt-dlp article’s captionless-video fallback via ~/.whisper-models/ggml-*.en.bin) — this article covers it as a general-purpose local STT tool in its own right, and as the free/private/offline sibling to the real-time conversational voice stacks covered elsewhere in this topic (Moshi, OpenAI Realtime API, ElevenLabs).
Key Takeaways
- Free and fully local. MIT licensed, no API key, no network call, no per-minute or per-token cost. Runs the same Whisper model weights OpenAI ships, converted to a custom single-file
ggmlbinary format (params + mel filters + vocab + weights). - Broad hardware backend support. Metal (first-class on Apple Silicon) + Core ML/ANE offload for the encoder, CUDA (NVIDIA), Vulkan (cross-vendor GPU), ROCm/HIP (AMD), OpenVINO (Intel), CANN (Ascend NPU), MUSA (Moore Threads), and plain CPU (AVX2/NEON). Integer quantization (e.g.
-q5_0) shrinks model files further —large-v3-q5_0drops from 2.9 GiB to 1.1 GiB. - Model size is a real tradeoff, not just a knob. Six sizes from
tiny(39M params, 7.6% WER, ~32x realtime) tolarge-v3(1.55B params, 2.5% WER, 1x baseline speed);large-v3-turbo(~809M params) gets close tolarge-v2quality at meaningfully faster inference. - Metal makes this a genuinely good Mac tool. On Apple Silicon (M2), Metal gives ~4.4x speedup over CPU-only, and Core ML/ANE offload adds a further >3x on the encoder specifically. This is the deciding factor vs. alternatives like faster-whisper, which has no Metal backend (CPU-only on Mac) despite beating whisper.cpp on NVIDIA hardware (~12x realtime vs. ~8x on an RTX 4070, large-v3).
- Not built for live conversational latency. A
whisper-streamexample exists for real-time mic transcription, with practical latency around 0.5–2s behind live speech — usable for captioning/dictation, but not competitive with purpose-built full-duplex conversational models like Moshi or OpenAI Realtime API. Position it as the offline/batch/privacy tier, not a low-latency conversational competitor. - Privacy and cost are the real selling points. Audio never leaves the device — a clean win for GDPR-sensitive use cases (no DPA needed) — versus OpenAI’s hosted Whisper endpoint (~120–480/month saved at 8 hours/week of meeting transcription. Accuracy is essentially identical to cloud Whisper since it’s the same weights; cloud still edges ahead on marginal multilingual coverage.
Model Sizes
| Model | Params | Disk | RAM | English WER | Speed (RTX 4070 reference) |
|---|---|---|---|---|---|
| tiny | 39M | 75 MiB | ~273 MB | 7.6% | ~32x realtime |
| base | 74M | 142 MiB | ~388 MB | 5.0% | ~16x |
| small | 244M | 466 MiB | ~852 MB | 3.4% | ~6x |
| medium | 769M | 1.5 GiB | ~2.1 GB | 2.9% | ~2x |
| large-v3 | 1.55B | 2.9 GiB | ~3.9 GB | 2.5% | 1x (baseline) |
| large-v3-turbo | ~809M | 1.5 GiB | — | ~large-v2 quality | faster than large-v3 |
Suffix conventions: .en = English-only variant (smaller, slightly more accurate for English-only use); -q5_0 = quantized (roughly halves disk size); -tdrz = built-in speaker-turn diarization. A Silero-based voice activity detector (VAD) can pre-filter silence before decoding to save compute.
Setup
Quickest path (macOS): brew install whisper-cpp — installs a prebuilt whisper-cli binary directly.
From source:
git clone https://github.com/ggml-org/whisper.cpp
cd whisper.cpp
cmake -B build && cmake --build build -j --config Release
./models/download-ggml-model.sh base.en
./build/bin/whisper-cli -f samples/jfk.wavwhisper-cli only accepts 16-bit WAV natively — convert other formats with ffmpeg first, or build with FFmpeg support for broader format handling. Run whisper-cli -h for the full flag reference. Real-time mic transcription is available via the whisper-stream example (--step 500 --length 5000).
Performance
Real-time factor (RTF) is the standard whisper.cpp benchmark metric — RTF 0.1 means 30 seconds of audio transcribes in 3 seconds. On Apple Silicon (M2) with Metal: small model RTF ~0.08–0.35 depending on source; large-v3 RTF ~0.45. On NVIDIA (RTX 4070): whisper.cpp’s CUDA path runs large-v3 at ~8x realtime, while faster-whisper’s CTranslate2/int8 path reaches ~12x — faster-whisper wins on NVIDIA hardware and VRAM efficiency, but whisper.cpp is the clear pick on Apple Silicon since faster-whisper has no Metal backend at all.
Implementation
Tool/Service: whisper.cpp (ggml-org/whisper.cpp), MIT license, free.
Setup: brew install whisper-cpp (macOS) or build from source via CMake; download a GGML model (./models/download-ggml-model.sh <size>).
Cost: $0 — fully local, no API calls, no usage metering.
Integration notes: Best fit as a batch/offline transcription tool (captioning, meeting transcripts, dictation) rather than a live-conversation voice-agent backend — see Moshi or OpenAI Realtime API for that use case. Already used in this vault as the captionless-video fallback for yt-dlp transcript extraction (~/.whisper-models/ggml-*.en.bin).
Related
- Moshi — Kyutai Labs’ Full-Duplex Speech Foundation Model — the real-time conversational counterpart; whisper.cpp is offline/batch, Moshi is full-duplex live dialogue
- OpenAI Realtime API — cloud-hosted native speech-to-speech; opposite tradeoff (paid, low-latency, no self-host) to whisper.cpp’s (free, offline, higher-latency)
- Voice Agent Comparison — Moshi vs ElevenLabs vs OpenAI Realtime vs Cartesia Sonic — where whisper.cpp sits as the free/private/offline reference point against the four live conversational stacks
- yt-dlp — the existing vault use case that already depends on whisper-cli as a captionless-video fallback
- ElevenLabs voice agents on Claude Code — the commercial pipelined alternative for live use cases whisper.cpp isn’t suited for
Try It
brew install whisper-cppon a Mac, download thebase.enmodel, and transcribe a sample WAV file to confirm the setup works.- If already using
yt-dlpfor transcript extraction in this vault, verify which whisper model size is configured at~/.whisper-models/and consider whether a larger/smaller model better fits the accuracy/speed tradeoff for that use case. - For any local meeting/dictation transcription need, compare
whisper-cliagainst a cloud API’s per-minute cost before defaulting to the cloud option — the local option is usually free once set up.
Open Questions
- No streaming/real-time latency benchmark was found for whisper.cpp on Apple Silicon specifically (only the general “0.5–2s behind live speech” figure) — would need direct measurement to place it precisely against Moshi/OpenAI Realtime API on a latency axis.
- Multilingual accuracy gap vs. cloud Whisper endpoints is asserted but not quantified in the sources gathered.