Source: raw/reddit-1uc7rw5.md — r/hermesagent “Mac + MLX Megathread — Hermes Agent on Apple Silicon” (OP Jonathan_Rivera, 142 score, last updated 2026-06-21; community-aggregated from 20+ threads + GitHub issues + benchmarks).

The most-asked Hermes-on-Mac question is “what model do I download, and why is tool calling broken?” This is the consolidated community answer for running Hermes Agent locally on Apple Silicon — model-by-RAM picks, the backend landscape, and the pitfalls that waste an afternoon. Caveat: this is a community megathread, not Nous’s official docs, and the local-model/quant/backend stack changes weekly — treat specific model names, tok/s figures, and bug numbers as time-stamped (June 2026) and verify against linked sources before relying on them. ^[the recommendations are extracted from the thread; the “best pick” framing is community consensus, not a benchmark this wiki ran]

Key Takeaways

  • 7B–9B is the floor for agent/tool use; 2B–4B models are chatbots, not agents. One tester: gemma4:e2b on an M4 16GB “can’t even handle one request”; Qwen3.5-9B “sort of worked.” Tool calling for 6+ chained calls is the real bar.
  • The current Mac default (32GB+) is Qwen3.6-35B-A3B (MoE, ~3B active/token, ~20GB at 4-bit MLX, runs like a 3B for speed). 16GB floor = Qwen3.5-9B Q4_K_M with a quantized KV cache and 64K (not 128K) context.
  • Tool calling breaking is the #1 failure mode — and it’s usually the backend, not the model. Test /v1/chat/completions with a tool-calling prompt before blaming a model; if one backend fails, try another.
  • MTP (multi-token prediction / speculative decode) is a net loss on Apple Metal — the opposite of what model cards market. Do not enable it on Mac.
  • Memory bandwidth beats chip generation. An M3 Max (~400 GB/s) generates tokens faster than an M4 Pro (~273 GB/s). Shopping for Hermes: Max > Pro > base, even a generation older.
  • The agent context tax is real and punishing on local Macs (context = RAM). Hermes’ orchestrator can spend ~15K tokens just to reply “hi.” The community consensus is a hybrid stack: fast local model for routine work + cheap cloud fallback for hard tasks.

Which model to download (by RAM)

Mac RAMDownloadBackendWhy
8GBQwen3.5-4B Q4_K_M / Gemma 4 E2B Q4Ollama / llama.cppSimple chat only — not heavy agent work
16GBQwen3.5-9B Q4_K_M or MLX 4-bitllama.cpp (compat) / OllamaPractical floor; preserve RAM for context, quantize KV
24GBQwen3.6-27B Q4_K_M or Qwen3.6-35B-A3B 4-bit MLXllama.cpp (dense) / MLX-LM (MoE)Dense = stronger/predictable coding; MoE = faster decode
32–48GBQwen3.6-35B-A3B 4-bit MLX (OptiQ if available)MLX-LM / oMLXThe Mac sweet spot — ~3B active, ~20GB file
48–64GBQwen3.6-27B Q6_K/Q8, or 35B-A3B 8-bitllama.cpp (dense quality) / MLX (MoE speed)Q6/Q8 is the serious-agent quant
64GB+Gemma 4 26B-A4B Q4 (MoE alt), Qwen3.6-27B Q8MLX / llama.cppMore RAM → better quants + longer context, not automatically a better model

Start with stock models. Uncensored variants (Heretic, HauhauCS) are advanced options — “the boring model that follows schema for 6+ tool calls beats the spicy one that talks itself into a ditch.”

Backend landscape

  • llama.cpp (GGUF) — max compatibility, full KV-cache quantization control (critical on 16GB), vision/mmproj, Jinja templates; fastest time-to-first-token in Hermes’ own testing. Predictable behavior often wins for tool loops despite slower raw generation.
  • MLX-LM / oMLX — 20–30% faster generation, gap widens on MoE (35B-A3B 4-bit: ~61 tok/s vs ~17 for dense 27B 4-bit on M1 Max 64GB). Known June-2026 bugs: Qwen3.5/3.6 non-Coder tool-parser mismatch (mlx-lm #1293), MTP-variant multi-turn failures (#1292).
  • Ollama — easiest setup; MLX backend (v0.19+) ~2× decode. Known KV-cache memory leak on M4 Max (#16698) that swap-deaths token gen — set OLLAMA_KV_CACHE_TYPE=q8_0 + OLLAMA_FLASH_ATTENTION=1.
  • LM Studio — GUI comfort; tool-call parser bugs on Qwen/Gemma in some versions (upgrade first).
  • Rapid-MLX (new, June 2026) — claims 2–4× faster than Ollama, 0.08s cached TTFT, 17 tool parsers / 100% tool-calling, drop-in OpenAI replacement; currently the strongest Mac backend for tool-calling reliability (3,000+ stars, actively maintained). Newer/less battle-tested than llama.cpp/Ollama.

Critical pitfalls (read before you waste 4 hours)

  1. MTP = slower on Mac. llama.cpp #23752: 25.3 → 19.3 tok/s; Qwen3.6-35B self-MTP collapses to 1.93 tok/s. Never enable --spec-type draft-mtp on Metal.
  2. Tool calling breaks across backends — test before trusting; rotate backends before blaming the model.
  3. 16GB is the floor, not the sweet spot. Usable RAM after macOS is ~10–12GB; a strong 9B at decent quant beats a crippled 27B in swap.
  4. Bandwidth > chip generation (see above).
  5. Agent context tax — quantize KV, keep context conservative (64K), consider a small fast orchestrator that delegates heavy work to cloud sub-agents.
  6. KV cache is the hidden memory killer — always --cache-type-k q8_0 --cache-type-v q4_0. One M3 Pro 18GB user went from timeouts to working via 4-bit KV quant + Hermes 0.8.0 lazy skill loading + 40K context (first message ~14K tokens, then ~600).
  7. Qwen overthinks — on RAM-limited Macs, turn thinking off to escape timeout territory.

Sampling (Qwen3.6, Hermes agent work)

WorkloadThinkingTempTop-PTop-KPresence
Coding / tool loopsON0.60.95200.0
Research / chatON0.8–1.00.95200.0
Summarization (latency)OFF0.70.8201.5

Try It

  1. Match RAM → model from the table; stay stock first. On 16GB start with Qwen3.5-9B Q4_K_M.
  2. Run a local OpenAI-compatible server (llama.cpp llama-server … --jinja --cache-type-k q8_0 --cache-type-v q4_0, or Ollama/MLX-LM/Rapid-MLX).
  3. Point Hermes at it: hermes config set model.provider custom:local-mac; model.base_url http://127.0.0.1:8080/v1; model.api_key local-no-key; model.default <id>.
  4. Verify tool calling immediatelycurl /v1/models, then a tool-calling prompt. If it loops/hangs, switch backend before switching model.
  5. Adopt the hybrid pattern — local for routine, a cheap cloud fallback (e.g. a 1/Mtok) for hard tasks; one reported M4 Max setup runs 95% local at ~$1/week.

Open Questions

  • How durable are these picks? Models/quants/backends turn over weekly — the article is a June-2026 snapshot; re-verify against the linked GitHub issues and HuggingFace pages before quoting figures.
  • Official Nous guidance — Nous’s Run Local LLMs on Mac doc should be the tiebreaker where it disagrees with community consensus.