whisper-cli (whisper.cpp) — Free Local Speech-to-Text

Source: ai-research/whisper-cpp-github-readme-2026.md, ai-research/whisper-cpp-models-readme-2026.md, ai-research/whisper-cpp-apple-silicon-benchmark-2026.md, ai-research/whisper-cpp-vs-faster-whisper-comparison-2026.md, ai-research/whisper-cpp-vs-cloud-stt-privacy-cost-2026.md

whisper-cli is the command-line binary shipped by whisper.cpp (ggml-org/whisper.cpp, formerly ggerganov/whisper.cpp) — a plain C/C++ reimplementation of OpenAI’s Whisper speech-recognition model, running entirely on-device with zero required dependencies. It is not OpenAI’s hosted Whisper API; it’s the free, self-hosted, offline path to the same model weights. This wiki already references it in passing (the yt-dlp article’s captionless-video fallback via ~/.whisper-models/ggml-*.en.bin) — this article covers it as a general-purpose local STT tool in its own right, and as the free/private/offline sibling to the real-time conversational voice stacks covered elsewhere in this topic (Moshi, OpenAI Realtime API, ElevenLabs).

Key Takeaways

Free and fully local. MIT licensed, no API key, no network call, no per-minute or per-token cost. Runs the same Whisper model weights OpenAI ships, converted to a custom single-file ggml binary format (params + mel filters + vocab + weights).
Broad hardware backend support. Metal (first-class on Apple Silicon) + Core ML/ANE offload for the encoder, CUDA (NVIDIA), Vulkan (cross-vendor GPU), ROCm/HIP (AMD), OpenVINO (Intel), CANN (Ascend NPU), MUSA (Moore Threads), and plain CPU (AVX2/NEON). Integer quantization (e.g. -q5_0) shrinks model files further — large-v3-q5_0 drops from 2.9 GiB to 1.1 GiB.
Model size is a real tradeoff, not just a knob. Six sizes from tiny (39M params, 7.6% WER, ~32x realtime) to large-v3 (1.55B params, 2.5% WER, 1x baseline speed); large-v3-turbo (~809M params) gets close to large-v2 quality at meaningfully faster inference.
Metal makes this a genuinely good Mac tool. On Apple Silicon (M2), Metal gives ~4.4x speedup over CPU-only, and Core ML/ANE offload adds a further >3x on the encoder specifically. This is the deciding factor vs. alternatives like faster-whisper, which has no Metal backend (CPU-only on Mac) despite beating whisper.cpp on NVIDIA hardware (~12x realtime vs. ~8x on an RTX 4070, large-v3).
Not built for live conversational latency. A whisper-stream example exists for real-time mic transcription, with practical latency around 0.5–2s behind live speech — usable for captioning/dictation, but not competitive with purpose-built full-duplex conversational models like Moshi or OpenAI Realtime API. Position it as the offline/batch/privacy tier, not a low-latency conversational competitor.
Privacy and cost are the real selling points. Audio never leaves the device — a clean win for GDPR-sensitive use cases (no DPA needed) — versus OpenAI’s hosted Whisper endpoint (~ $0.006/ min) orco m p a r ab l ec l o u d u s a g e - ba se d p r i c in g . O n eso u rcees t ima t e d$ 120–480/month saved at 8 hours/week of meeting transcription. Accuracy is essentially identical to cloud Whisper since it’s the same weights; cloud still edges ahead on marginal multilingual coverage.

Model Sizes

Model	Params	Disk	RAM	English WER	Speed (RTX 4070 reference)
tiny	39M	75 MiB	~273 MB	7.6%	~32x realtime
base	74M	142 MiB	~388 MB	5.0%	~16x
small	244M	466 MiB	~852 MB	3.4%	~6x
medium	769M	1.5 GiB	~2.1 GB	2.9%	~2x
large-v3	1.55B	2.9 GiB	~3.9 GB	2.5%	1x (baseline)
large-v3-turbo	~809M	1.5 GiB	—	~large-v2 quality	faster than large-v3

Suffix conventions: .en = English-only variant (smaller, slightly more accurate for English-only use); -q5_0 = quantized (roughly halves disk size); -tdrz = built-in speaker-turn diarization. A Silero-based voice activity detector (VAD) can pre-filter silence before decoding to save compute.

Setup

Quickest path (macOS): brew install whisper-cpp — installs a prebuilt whisper-cli binary directly.

From source:

git clone https://github.com/ggml-org/whisper.cpp
cd whisper.cpp
cmake -B build && cmake --build build -j --config Release
./models/download-ggml-model.sh base.en
./build/bin/whisper-cli -f samples/jfk.wav

whisper-cli only accepts 16-bit WAV natively — convert other formats with ffmpeg first, or build with FFmpeg support for broader format handling. Run whisper-cli -h for the full flag reference. Real-time mic transcription is available via the whisper-stream example (--step 500 --length 5000).

Performance

Real-time factor (RTF) is the standard whisper.cpp benchmark metric — RTF 0.1 means 30 seconds of audio transcribes in 3 seconds. On Apple Silicon (M2) with Metal: small model RTF ~0.08–0.35 depending on source; large-v3 RTF ~0.45. On NVIDIA (RTX 4070): whisper.cpp’s CUDA path runs large-v3 at ~8x realtime, while faster-whisper’s CTranslate2/int8 path reaches ~12x — faster-whisper wins on NVIDIA hardware and VRAM efficiency, but whisper.cpp is the clear pick on Apple Silicon since faster-whisper has no Metal backend at all.

Implementation

Tool/Service: whisper.cpp (ggml-org/whisper.cpp), MIT license, free. Setup: brew install whisper-cpp (macOS) or build from source via CMake; download a GGML model (./models/download-ggml-model.sh <size>). Cost: $0 — fully local, no API calls, no usage metering. Integration notes: Best fit as a batch/offline transcription tool (captioning, meeting transcripts, dictation) rather than a live-conversation voice-agent backend — see Moshi or OpenAI Realtime API for that use case. Already used in this vault as the captionless-video fallback for yt-dlp transcript extraction (~/.whisper-models/ggml-*.en.bin).

Moshi — Kyutai Labs’ Full-Duplex Speech Foundation Model — the real-time conversational counterpart; whisper.cpp is offline/batch, Moshi is full-duplex live dialogue
OpenAI Realtime API — cloud-hosted native speech-to-speech; opposite tradeoff (paid, low-latency, no self-host) to whisper.cpp’s (free, offline, higher-latency)
Voice Agent Comparison — Moshi vs ElevenLabs vs OpenAI Realtime vs Cartesia Sonic — where whisper.cpp sits as the free/private/offline reference point against the four live conversational stacks
yt-dlp — the existing vault use case that already depends on whisper-cli as a captionless-video fallback
ElevenLabs voice agents on Claude Code — the commercial pipelined alternative for live use cases whisper.cpp isn’t suited for

Try It

brew install whisper-cpp on a Mac, download the base.en model, and transcribe a sample WAV file to confirm the setup works.
If already using yt-dlp for transcript extraction in this vault, verify which whisper model size is configured at ~/.whisper-models/ and consider whether a larger/smaller model better fits the accuracy/speed tradeoff for that use case.
For any local meeting/dictation transcription need, compare whisper-cli against a cloud API’s per-minute cost before defaulting to the cloud option — the local option is usually free once set up.

Open Questions

No streaming/real-time latency benchmark was found for whisper.cpp on Apple Silicon specifically (only the general “0.5–2s behind live speech” figure) — would need direct measurement to place it precisely against Moshi/OpenAI Realtime API on a latency axis.
Multilingual accuracy gap vs. cloud Whisper endpoints is asserted but not quantified in the sources gathered.

Jonathon's AI Wiki

Explorer

whisper-cli (whisper.cpp) — Free Local Speech-to-Text

Key Takeaways

Model Sizes

Setup

Performance

Implementation

Try It

Open Questions

Graph View

Table of Contents

Backlinks

Jonathon's AI Wiki

Explorer

whisper-cli (whisper.cpp) — Free Local Speech-to-Text

Key Takeaways

Model Sizes

Setup

Performance

Implementation

Related

Try It

Open Questions

Graph View

Table of Contents

Backlinks