Source: raw/reddit-1uc7rw5.md — r/hermesagent “Mac + MLX Megathread — Hermes Agent on Apple Silicon” (OP Jonathan_Rivera, 142 score, last updated 2026-06-21; community-aggregated from 20+ threads + GitHub issues + benchmarks).
The most-asked Hermes-on-Mac question is “what model do I download, and why is tool calling broken?” This is the consolidated community answer for running Hermes Agent locally on Apple Silicon — model-by-RAM picks, the backend landscape, and the pitfalls that waste an afternoon. Caveat: this is a community megathread, not Nous’s official docs, and the local-model/quant/backend stack changes weekly — treat specific model names, tok/s figures, and bug numbers as time-stamped (June 2026) and verify against linked sources before relying on them. ^[the recommendations are extracted from the thread; the “best pick” framing is community consensus, not a benchmark this wiki ran]
Key Takeaways
- 7B–9B is the floor for agent/tool use; 2B–4B models are chatbots, not agents. One tester: gemma4:e2b on an M4 16GB “can’t even handle one request”; Qwen3.5-9B “sort of worked.” Tool calling for 6+ chained calls is the real bar.
- The current Mac default (32GB+) is Qwen3.6-35B-A3B (MoE, ~3B active/token, ~20GB at 4-bit MLX, runs like a 3B for speed). 16GB floor = Qwen3.5-9B Q4_K_M with a quantized KV cache and 64K (not 128K) context.
- Tool calling breaking is the #1 failure mode — and it’s usually the backend, not the model. Test
/v1/chat/completionswith a tool-calling prompt before blaming a model; if one backend fails, try another. - MTP (multi-token prediction / speculative decode) is a net loss on Apple Metal — the opposite of what model cards market. Do not enable it on Mac.
- Memory bandwidth beats chip generation. An M3 Max (~400 GB/s) generates tokens faster than an M4 Pro (~273 GB/s). Shopping for Hermes: Max > Pro > base, even a generation older.
- The agent context tax is real and punishing on local Macs (context = RAM). Hermes’ orchestrator can spend ~15K tokens just to reply “hi.” The community consensus is a hybrid stack: fast local model for routine work + cheap cloud fallback for hard tasks.
Which model to download (by RAM)
| Mac RAM | Download | Backend | Why |
|---|---|---|---|
| 8GB | Qwen3.5-4B Q4_K_M / Gemma 4 E2B Q4 | Ollama / llama.cpp | Simple chat only — not heavy agent work |
| 16GB | Qwen3.5-9B Q4_K_M or MLX 4-bit | llama.cpp (compat) / Ollama | Practical floor; preserve RAM for context, quantize KV |
| 24GB | Qwen3.6-27B Q4_K_M or Qwen3.6-35B-A3B 4-bit MLX | llama.cpp (dense) / MLX-LM (MoE) | Dense = stronger/predictable coding; MoE = faster decode |
| 32–48GB | Qwen3.6-35B-A3B 4-bit MLX (OptiQ if available) | MLX-LM / oMLX | The Mac sweet spot — ~3B active, ~20GB file |
| 48–64GB | Qwen3.6-27B Q6_K/Q8, or 35B-A3B 8-bit | llama.cpp (dense quality) / MLX (MoE speed) | Q6/Q8 is the serious-agent quant |
| 64GB+ | Gemma 4 26B-A4B Q4 (MoE alt), Qwen3.6-27B Q8 | MLX / llama.cpp | More RAM → better quants + longer context, not automatically a better model |
Start with stock models. Uncensored variants (Heretic, HauhauCS) are advanced options — “the boring model that follows schema for 6+ tool calls beats the spicy one that talks itself into a ditch.”
Backend landscape
- llama.cpp (GGUF) — max compatibility, full KV-cache quantization control (critical on 16GB), vision/mmproj, Jinja templates; fastest time-to-first-token in Hermes’ own testing. Predictable behavior often wins for tool loops despite slower raw generation.
- MLX-LM / oMLX — 20–30% faster generation, gap widens on MoE (35B-A3B 4-bit: ~61 tok/s vs ~17 for dense 27B 4-bit on M1 Max 64GB). Known June-2026 bugs: Qwen3.5/3.6 non-Coder tool-parser mismatch (mlx-lm #1293), MTP-variant multi-turn failures (#1292).
- Ollama — easiest setup; MLX backend (v0.19+) ~2× decode. Known KV-cache memory leak on M4 Max (#16698) that swap-deaths token gen — set
OLLAMA_KV_CACHE_TYPE=q8_0+OLLAMA_FLASH_ATTENTION=1. - LM Studio — GUI comfort; tool-call parser bugs on Qwen/Gemma in some versions (upgrade first).
- Rapid-MLX (new, June 2026) — claims 2–4× faster than Ollama, 0.08s cached TTFT, 17 tool parsers / 100% tool-calling, drop-in OpenAI replacement; currently the strongest Mac backend for tool-calling reliability (3,000+ stars, actively maintained). Newer/less battle-tested than llama.cpp/Ollama.
Critical pitfalls (read before you waste 4 hours)
- MTP = slower on Mac. llama.cpp #23752: 25.3 → 19.3 tok/s; Qwen3.6-35B self-MTP collapses to 1.93 tok/s. Never enable
--spec-type draft-mtpon Metal. - Tool calling breaks across backends — test before trusting; rotate backends before blaming the model.
- 16GB is the floor, not the sweet spot. Usable RAM after macOS is ~10–12GB; a strong 9B at decent quant beats a crippled 27B in swap.
- Bandwidth > chip generation (see above).
- Agent context tax — quantize KV, keep context conservative (64K), consider a small fast orchestrator that delegates heavy work to cloud sub-agents.
- KV cache is the hidden memory killer — always
--cache-type-k q8_0 --cache-type-v q4_0. One M3 Pro 18GB user went from timeouts to working via 4-bit KV quant + Hermes 0.8.0 lazy skill loading + 40K context (first message ~14K tokens, then ~600). - Qwen overthinks — on RAM-limited Macs, turn thinking off to escape timeout territory.
Sampling (Qwen3.6, Hermes agent work)
| Workload | Thinking | Temp | Top-P | Top-K | Presence |
|---|---|---|---|---|---|
| Coding / tool loops | ON | 0.6 | 0.95 | 20 | 0.0 |
| Research / chat | ON | 0.8–1.0 | 0.95 | 20 | 0.0 |
| Summarization (latency) | OFF | 0.7 | 0.8 | 20 | 1.5 |
Try It
- Match RAM → model from the table; stay stock first. On 16GB start with Qwen3.5-9B Q4_K_M.
- Run a local OpenAI-compatible server (llama.cpp
llama-server … --jinja --cache-type-k q8_0 --cache-type-v q4_0, or Ollama/MLX-LM/Rapid-MLX). - Point Hermes at it:
hermes config set model.provider custom:local-mac;model.base_url http://127.0.0.1:8080/v1;model.api_key local-no-key;model.default <id>. - Verify tool calling immediately —
curl /v1/models, then a tool-calling prompt. If it loops/hangs, switch backend before switching model. - Adopt the hybrid pattern — local for routine, a cheap cloud fallback (e.g. a 1/Mtok) for hard tasks; one reported M4 Max setup runs 95% local at ~$1/week.
Open Questions
- How durable are these picks? Models/quants/backends turn over weekly — the article is a June-2026 snapshot; re-verify against the linked GitHub issues and HuggingFace pages before quoting figures.
- Official Nous guidance — Nous’s Run Local LLMs on Mac doc should be the tiebreaker where it disagrees with community consensus.
Related
- Hermes Memory Providers — local context is RAM-bound; provider choice interacts with the context tax.
- Hermes Autonomous SWE Workflow — the cheapest-successful-outcome routing (local model + cloud fallback) this guide’s hybrid pattern serves.
- Hermes Profiles & Multi-Instance — run a light local orchestrator profile that delegates heavy work.
- Nate Herk 1-Hour Course — operator hygiene + deployment context.
- Hermes Desktop (Official) — the GUI install that wraps the same local-endpoint config.