Hermes on Apple Silicon — Local Model, Backend & Quant Guide (Mac/MLX)

Source: raw/reddit-1uc7rw5.md — r/hermesagent “Mac + MLX Megathread — Hermes Agent on Apple Silicon” (OP Jonathan_Rivera, 142 score, last updated 2026-06-21; community-aggregated from 20+ threads + GitHub issues + benchmarks).

The most-asked Hermes-on-Mac question is “what model do I download, and why is tool calling broken?” This is the consolidated community answer for running Hermes Agent locally on Apple Silicon — model-by-RAM picks, the backend landscape, and the pitfalls that waste an afternoon. Caveat: this is a community megathread, not Nous’s official docs, and the local-model/quant/backend stack changes weekly — treat specific model names, tok/s figures, and bug numbers as time-stamped (June 2026) and verify against linked sources before relying on them. ^[the recommendations are extracted from the thread; the “best pick” framing is community consensus, not a benchmark this wiki ran]

Key Takeaways

7B–9B is the floor for agent/tool use; 2B–4B models are chatbots, not agents. One tester: gemma4:e2b on an M4 16GB “can’t even handle one request”; Qwen3.5-9B “sort of worked.” Tool calling for 6+ chained calls is the real bar.
The current Mac default (32GB+) is Qwen3.6-35B-A3B (MoE, ~3B active/token, ~20GB at 4-bit MLX, runs like a 3B for speed). 16GB floor = Qwen3.5-9B Q4_K_M with a quantized KV cache and 64K (not 128K) context.
Tool calling breaking is the #1 failure mode — and it’s usually the backend, not the model. Test /v1/chat/completions with a tool-calling prompt before blaming a model; if one backend fails, try another.
MTP (multi-token prediction / speculative decode) is a net loss on Apple Metal — the opposite of what model cards market. Do not enable it on Mac.
Memory bandwidth beats chip generation. An M3 Max (~400 GB/s) generates tokens faster than an M4 Pro (~273 GB/s). Shopping for Hermes: Max > Pro > base, even a generation older.
The agent context tax is real and punishing on local Macs (context = RAM). Hermes’ orchestrator can spend ~15K tokens just to reply “hi.” The community consensus is a hybrid stack: fast local model for routine work + cheap cloud fallback for hard tasks.

Which model to download (by RAM)

Mac RAM	Download	Backend	Why
8GB	Qwen3.5-4B Q4_K_M / Gemma 4 E2B Q4	Ollama / llama.cpp	Simple chat only — not heavy agent work
16GB	Qwen3.5-9B Q4_K_M or MLX 4-bit	llama.cpp (compat) / Ollama	Practical floor; preserve RAM for context, quantize KV
24GB	Qwen3.6-27B Q4_K_M or Qwen3.6-35B-A3B 4-bit MLX	llama.cpp (dense) / MLX-LM (MoE)	Dense = stronger/predictable coding; MoE = faster decode
32–48GB	Qwen3.6-35B-A3B 4-bit MLX (OptiQ if available)	MLX-LM / oMLX	The Mac sweet spot — ~3B active, ~20GB file
48–64GB	Qwen3.6-27B Q6_K/Q8, or 35B-A3B 8-bit	llama.cpp (dense quality) / MLX (MoE speed)	Q6/Q8 is the serious-agent quant
64GB+	Gemma 4 26B-A4B Q4 (MoE alt), Qwen3.6-27B Q8	MLX / llama.cpp	More RAM → better quants + longer context, not automatically a better model

Start with stock models. Uncensored variants (Heretic, HauhauCS) are advanced options — “the boring model that follows schema for 6+ tool calls beats the spicy one that talks itself into a ditch.”

Backend landscape

llama.cpp (GGUF) — max compatibility, full KV-cache quantization control (critical on 16GB), vision/mmproj, Jinja templates; fastest time-to-first-token in Hermes’ own testing. Predictable behavior often wins for tool loops despite slower raw generation.
MLX-LM / oMLX — 20–30% faster generation, gap widens on MoE (35B-A3B 4-bit: ~61 tok/s vs ~17 for dense 27B 4-bit on M1 Max 64GB). Known June-2026 bugs: Qwen3.5/3.6 non-Coder tool-parser mismatch (mlx-lm #1293), MTP-variant multi-turn failures (#1292).
Ollama — easiest setup; MLX backend (v0.19+) ~2× decode. Known KV-cache memory leak on M4 Max (#16698) that swap-deaths token gen — set OLLAMA_KV_CACHE_TYPE=q8_0 + OLLAMA_FLASH_ATTENTION=1.
LM Studio — GUI comfort; tool-call parser bugs on Qwen/Gemma in some versions (upgrade first).
Rapid-MLX (new, June 2026) — claims 2–4× faster than Ollama, 0.08s cached TTFT, 17 tool parsers / 100% tool-calling, drop-in OpenAI replacement; currently the strongest Mac backend for tool-calling reliability (3,000+ stars, actively maintained). Newer/less battle-tested than llama.cpp/Ollama.

Critical pitfalls (read before you waste 4 hours)

MTP = slower on Mac. llama.cpp #23752: 25.3 → 19.3 tok/s; Qwen3.6-35B self-MTP collapses to 1.93 tok/s. Never enable --spec-type draft-mtp on Metal.
Tool calling breaks across backends — test before trusting; rotate backends before blaming the model.
16GB is the floor, not the sweet spot. Usable RAM after macOS is ~10–12GB; a strong 9B at decent quant beats a crippled 27B in swap.
Bandwidth > chip generation (see above).
Agent context tax — quantize KV, keep context conservative (64K), consider a small fast orchestrator that delegates heavy work to cloud sub-agents.
KV cache is the hidden memory killer — always --cache-type-k q8_0 --cache-type-v q4_0. One M3 Pro 18GB user went from timeouts to working via 4-bit KV quant + Hermes 0.8.0 lazy skill loading + 40K context (first message ~14K tokens, then ~600).
Qwen overthinks — on RAM-limited Macs, turn thinking off to escape timeout territory.

Sampling (Qwen3.6, Hermes agent work)

Workload	Thinking	Temp	Top-P	Top-K	Presence
Coding / tool loops	ON	0.6	0.95	20	0.0
Research / chat	ON	0.8–1.0	0.95	20	0.0
Summarization (latency)	OFF	0.7	0.8	20	1.5

Try It

Match RAM → model from the table; stay stock first. On 16GB start with Qwen3.5-9B Q4_K_M.
Run a local OpenAI-compatible server (llama.cpp llama-server … --jinja --cache-type-k q8_0 --cache-type-v q4_0, or Ollama/MLX-LM/Rapid-MLX).
Point Hermes at it: hermes config set model.provider custom:local-mac; model.base_url http://127.0.0.1:8080/v1; model.api_key local-no-key; model.default <id>.
Verify tool calling immediately — curl /v1/models, then a tool-calling prompt. If it loops/hangs, switch backend before switching model.
Adopt the hybrid pattern — local for routine, a cheap cloud fallback (e.g. a $20 C o d e x pl an, orK imik 2.6 a t <$ 1/Mtok) for hard tasks; one reported M4 Max setup runs 95% local at ~$1/week.

Open Questions

How durable are these picks? Models/quants/backends turn over weekly — the article is a June-2026 snapshot; re-verify against the linked GitHub issues and HuggingFace pages before quoting figures.
Official Nous guidance — Nous’s Run Local LLMs on Mac doc should be the tiebreaker where it disagrees with community consensus.

Hermes Memory Providers — local context is RAM-bound; provider choice interacts with the context tax.
Hermes Autonomous SWE Workflow — the cheapest-successful-outcome routing (local model + cloud fallback) this guide’s hybrid pattern serves.
Hermes Profiles & Multi-Instance — run a light local orchestrator profile that delegates heavy work.
Nate Herk 1-Hour Course — operator hygiene + deployment context.
Hermes Desktop (Official) — the GUI install that wraps the same local-endpoint config.

Jonathon's AI Wiki

Explorer

Hermes on Apple Silicon — Local Model, Backend & Quant Guide (Mac/MLX)

Key Takeaways

Which model to download (by RAM)

Backend landscape

Critical pitfalls (read before you waste 4 hours)

Sampling (Qwen3.6, Hermes agent work)

Try It

Open Questions

Graph View

Table of Contents

Backlinks

Jonathon's AI Wiki

Explorer

Hermes on Apple Silicon — Local Model, Backend & Quant Guide (Mac/MLX)

Key Takeaways

Which model to download (by RAM)

Backend landscape

Critical pitfalls (read before you waste 4 hours)

Sampling (Qwen3.6, Hermes agent work)

Try It

Open Questions

Related

Graph View

Table of Contents

Backlinks