Source: ai-research/whisper-cpp-github-readme-2026.md, ai-research/whisper-cpp-models-readme-2026.md, ai-research/whisper-cpp-apple-silicon-benchmark-2026.md, ai-research/whisper-cpp-vs-faster-whisper-comparison-2026.md, ai-research/whisper-cpp-vs-cloud-stt-privacy-cost-2026.md

whisper-cli is the command-line binary shipped by whisper.cpp (ggml-org/whisper.cpp, formerly ggerganov/whisper.cpp) — a plain C/C++ reimplementation of OpenAI’s Whisper speech-recognition model, running entirely on-device with zero required dependencies. It is not OpenAI’s hosted Whisper API; it’s the free, self-hosted, offline path to the same model weights. This wiki already references it in passing (the yt-dlp article’s captionless-video fallback via ~/.whisper-models/ggml-*.en.bin) — this article covers it as a general-purpose local STT tool in its own right, and as the free/private/offline sibling to the real-time conversational voice stacks covered elsewhere in this topic (Moshi, OpenAI Realtime API, ElevenLabs).

Key Takeaways

  • Free and fully local. MIT licensed, no API key, no network call, no per-minute or per-token cost. Runs the same Whisper model weights OpenAI ships, converted to a custom single-file ggml binary format (params + mel filters + vocab + weights).
  • Broad hardware backend support. Metal (first-class on Apple Silicon) + Core ML/ANE offload for the encoder, CUDA (NVIDIA), Vulkan (cross-vendor GPU), ROCm/HIP (AMD), OpenVINO (Intel), CANN (Ascend NPU), MUSA (Moore Threads), and plain CPU (AVX2/NEON). Integer quantization (e.g. -q5_0) shrinks model files further — large-v3-q5_0 drops from 2.9 GiB to 1.1 GiB.
  • Model size is a real tradeoff, not just a knob. Six sizes from tiny (39M params, 7.6% WER, ~32x realtime) to large-v3 (1.55B params, 2.5% WER, 1x baseline speed); large-v3-turbo (~809M params) gets close to large-v2 quality at meaningfully faster inference.
  • Metal makes this a genuinely good Mac tool. On Apple Silicon (M2), Metal gives ~4.4x speedup over CPU-only, and Core ML/ANE offload adds a further >3x on the encoder specifically. This is the deciding factor vs. alternatives like faster-whisper, which has no Metal backend (CPU-only on Mac) despite beating whisper.cpp on NVIDIA hardware (~12x realtime vs. ~8x on an RTX 4070, large-v3).
  • Not built for live conversational latency. A whisper-stream example exists for real-time mic transcription, with practical latency around 0.5–2s behind live speech — usable for captioning/dictation, but not competitive with purpose-built full-duplex conversational models like Moshi or OpenAI Realtime API. Position it as the offline/batch/privacy tier, not a low-latency conversational competitor.
  • Privacy and cost are the real selling points. Audio never leaves the device — a clean win for GDPR-sensitive use cases (no DPA needed) — versus OpenAI’s hosted Whisper endpoint (~120–480/month saved at 8 hours/week of meeting transcription. Accuracy is essentially identical to cloud Whisper since it’s the same weights; cloud still edges ahead on marginal multilingual coverage.

Model Sizes

ModelParamsDiskRAMEnglish WERSpeed (RTX 4070 reference)
tiny39M75 MiB~273 MB7.6%~32x realtime
base74M142 MiB~388 MB5.0%~16x
small244M466 MiB~852 MB3.4%~6x
medium769M1.5 GiB~2.1 GB2.9%~2x
large-v31.55B2.9 GiB~3.9 GB2.5%1x (baseline)
large-v3-turbo~809M1.5 GiB~large-v2 qualityfaster than large-v3

Suffix conventions: .en = English-only variant (smaller, slightly more accurate for English-only use); -q5_0 = quantized (roughly halves disk size); -tdrz = built-in speaker-turn diarization. A Silero-based voice activity detector (VAD) can pre-filter silence before decoding to save compute.

Setup

Quickest path (macOS): brew install whisper-cpp — installs a prebuilt whisper-cli binary directly.

From source:

git clone https://github.com/ggml-org/whisper.cpp
cd whisper.cpp
cmake -B build && cmake --build build -j --config Release
./models/download-ggml-model.sh base.en
./build/bin/whisper-cli -f samples/jfk.wav

whisper-cli only accepts 16-bit WAV natively — convert other formats with ffmpeg first, or build with FFmpeg support for broader format handling. Run whisper-cli -h for the full flag reference. Real-time mic transcription is available via the whisper-stream example (--step 500 --length 5000).

Performance

Real-time factor (RTF) is the standard whisper.cpp benchmark metric — RTF 0.1 means 30 seconds of audio transcribes in 3 seconds. On Apple Silicon (M2) with Metal: small model RTF ~0.08–0.35 depending on source; large-v3 RTF ~0.45. On NVIDIA (RTX 4070): whisper.cpp’s CUDA path runs large-v3 at ~8x realtime, while faster-whisper’s CTranslate2/int8 path reaches ~12x — faster-whisper wins on NVIDIA hardware and VRAM efficiency, but whisper.cpp is the clear pick on Apple Silicon since faster-whisper has no Metal backend at all.

Implementation

Tool/Service: whisper.cpp (ggml-org/whisper.cpp), MIT license, free. Setup: brew install whisper-cpp (macOS) or build from source via CMake; download a GGML model (./models/download-ggml-model.sh <size>). Cost: $0 — fully local, no API calls, no usage metering. Integration notes: Best fit as a batch/offline transcription tool (captioning, meeting transcripts, dictation) rather than a live-conversation voice-agent backend — see Moshi or OpenAI Realtime API for that use case. Already used in this vault as the captionless-video fallback for yt-dlp transcript extraction (~/.whisper-models/ggml-*.en.bin).

Try It

  1. brew install whisper-cpp on a Mac, download the base.en model, and transcribe a sample WAV file to confirm the setup works.
  2. If already using yt-dlp for transcript extraction in this vault, verify which whisper model size is configured at ~/.whisper-models/ and consider whether a larger/smaller model better fits the accuracy/speed tradeoff for that use case.
  3. For any local meeting/dictation transcription need, compare whisper-cli against a cloud API’s per-minute cost before defaulting to the cloud option — the local option is usually free once set up.

Open Questions

  • No streaming/real-time latency benchmark was found for whisper.cpp on Apple Silicon specifically (only the general “0.5–2s behind live speech” figure) — would need direct measurement to place it precisely against Moshi/OpenAI Realtime API on a latency axis.
  • Multilingual accuracy gap vs. cloud Whisper endpoints is asserted but not quantified in the sources gathered.