Source: raw/Hermes_Architecture_EXPLAINED_-_Memory_Context_Gateways.md (YouTube video)
Creator: (unnamed explainer channel) · URL: https://youtu.be/n32qq7Kwzh0 · Platform: YouTube · Type: code-level architecture walkthrough
A high-level, code-level walkthrough of how the Hermes agent is built — meant both for understanding how to use it and for anyone wanting to build a similar agent. The presenter sketches the system on a whiteboard-style canvas and traces each piece: the bird’s-eye view, the agent loop, context construction, context compression, the messaging gateway, the three memory layers, and cron jobs. The framing throughout is “it’s actually very simple” — Hermes is a minimalist agent loop comparable to Pi or OpenCode, with the gateway and memory layers being what make it distinctive.
Key Takeaways
- Five connected pieces. A central AI agent core (the agentic loop) plus three ways to reach it (CLI, gateway, API), and the core wires into pre-installed tools, skills, and memory (internal + external). The presenter repeatedly stresses the design is intentionally simple.
- The agent loop is minimalist. On every user message: build context → send to LLM → execute any tool calls in a loop → return a final response → run a memory update pass. The memory-update step is what makes Hermes “continuously learning.”
- Context is mostly markdown.
soul.md(personality) +memory/user.md(facts about you) +memory/memory.md(arbitrary learnings), plus skill/tool descriptions and message history, assembled fresh each turn. - Context compression at 50%. Triggered by default when the context hits 50% of the window; estimated by
chars / 4on the first message and by the LLM’s actualusagefield afterward; the summarizer prompt lives incontext_compressor.pyand is far richer than Pi’s. - The gateway is the popularity driver. An always-running
asyncioloop that polls Telegram / Slack / WhatsApp / email / SMS / Discord, rebuilds context from scratch per inbound message, and includes a session manager for interrupt / steer / queue. - Memory has three layers: markdown files, a local SQLite transcript store (with a bare-text table for similarity search), and optional external providers (Mem0, Supermemory, Honcho).
- Cron is its own loop, not system cron. A
tick()runs every minute against~/.hermes/cron/jobs.json(plain JSON, not SQLite as the docs claim) and delivers results to your “home” messaging platform via a system notification, not thesend_messagetool.
Bird’s-Eye View
- AI agent core = the agentic loop. Everything else connects into this central piece.
- Three connection surfaces:
- CLI — runs when you type
hermeson the command line. - Gateway — an always-running system that lets you reach the agent via messaging services (Telegram, email, Slack, etc.).
- API — programmatic access.
- CLI — runs when you type
- The core ships with batteries included. On install, Hermes comes pre-loaded with a set of tools, a set of skills, and memory.
- Memory splits two ways at the top level:
- External memory — optional third-party providers (e.g. Mem0).
- Internal memory — the session transcripts plus modifiable files like
soul.md(personality / agent + user info).
The Agent Loop
Runs every single time the user sends a message — described as similar to minimalist agents like Pi or OpenCode:
- User sends a message.
- Hermes builds its context — pulls from internal memory and the pre-prepared prompts.
- Context + message history sent to the LLM.
- Tool-call loop — the LLM may call a tool (web search, read/write/update files, etc.); Hermes executes it, returns the result, and repeats as long as the model keeps requesting tools.
- Final response delivered to the user.
- Memory update — the agent analyzes the interaction and writes anything worth remembering into memory, so it has “learned” for next time. This is the loop’s distinguishing final step and the basis of Hermes’ continuous-learning claim.
Context Construction
The “build context” step assembles, in order:
- System prompt.
soul.md— the agent’s personality (tone, inspirations, goals, approach). Empty by default on a fresh install; if not set, Hermes falls back to a default system prompt declaring it an always-on Hermes assistant. The presenter likens a goodsoul.mdto Claude’s well-written system prompts — you personalize it for yourself.memory/user.md— facts about the user. Hermes auto-updates this whenever it learns something about you (e.g. “I’m a software engineer working on X”).memory/memory.md— arbitrary facts the agent finds useful: how to use tools, workflows, things learned in conversation. Auto-updated, governed by the goals set insoul.md.- External-memory summary — a summary of relevant past threads / sessions, but only if external memory is configured (absent by default).
- Skill descriptions + tool descriptions.
- Message history — the full conversation, or a summary once it exceeds the context threshold (see compression below).
Context Compression
- Window sizes assumed: roughly 250K to 1M tokens.
- Default trigger: 50% of the context window. Configurable at setup to 70% or 80% (recommended for smaller models / smaller windows).
- What it does: summarizes previous messages, appends the summary to the context, and replaces the prior messages with that summary.
- Two check moments:
- Before each turn (before the LLM call) — on every iteration.
- On error — when the LLM returns a context-window error.
- Two token-counting methods:
- First message: no usage data yet, so Hermes approximates with total characters ÷ 4 rather than running a tokenizer (cheaper, “good enough”).
- Subsequent messages: uses the
usagefield the LLM returns (input/output/usage varies by provider) — more accurate because it reflects the model’s own tokenizer. Running a real tokenizer at this stage is deemed too expensive.
- The compressor prompt lives in
context_compressor.py(around line ~1400, shifts with refactors). It asks the LLM for a multi-section summary: full goal, constraints, completed actions, active state, historical progress, current blockers, key decisions, resolved questions, relevant files, critical context, previous summaries, next turns to incorporate, and turns to summarize. The presenter notes it is much richer / less minimalist than Pi’s summarizer.
The Gateway
- What it is: the always-running system that lets you talk to the agent through many message providers — Telegram, WhatsApp, email, SMS, Slack, Discord. The presenter argues it is “the part that made it as mainstream / as popular as it became.” ^[inferred: presenter’s opinion, not a measured claim]
- Core mechanism: starts an
asyncioloop that runs continuously, polling/waiting on each configured gateway. Different providers are polled differently — some via webhooks, some via a tiny ~1-second polling loop (Telegram can be polled this way), some via websockets. - Per-provider configuration. It is not one universal gateway; each integration is configured independently via
hermes setup gateway→ choose provider (e.g. Telegram) → create bot ID → set the allowed user IDs. - Builds context from scratch per message. Because a gateway receives a single inbound message (not the whole conversation), it must reconstruct the message history itself — this is where the loop’s “build context” step is most load-bearing.
- Session identity → SQLite. The message-history key is composed of the gateway name (e.g.
Telegram) + the session ID the provider returned + other IDs. On each new inbound message the gateway queries all prior messages for that key from the local SQLite database, appends them to the context, and sends the result to the agent. - Session manager. Decides what happens when a message arrives while the agent is already working: interrupt (
/interrupt), steer (/steer), or queue (plain send). It governs which messages actually reach the LLM and when.
Memory (Three Layers)
- Markdown files —
soul.md, plusmemory/user.mdandmemory/memory.md. Always appended to the context window right after the system prompt. - SQLite database — stores full transcripts of every session across many tables / data models (different views of the same data). It is where gateway sessions are pulled from (keyed by gateway + session ID). Notably it includes a bare-text table holding only the conversation text, to make similarity search easy.
- External memory (opt-in, off by default) — pluggable third-party providers, several of them free:
- Mem0 (spoken “MemZero”) — uses similarity search.
- Supermemory — requires sending the full conversation history after each turn; uses an LLM to extract the relevant memory.
- Honcho — works differently again.
- Query timing: when configured, external memory is queried on the second message (i.e. after the first response), effectively guessing what the user’s next question will be — analogous to how a human recalls prior context a beat after being asked. Practical tip: if Hermes doesn’t recall something on the first try, describe it in one message and ask a follow-up — the second message will have queried external memory.
- Most users don’t enable external memory, but the presenter recommends it as a large improvement to how Hermes learns from you.
Cron Jobs
- Its own loop, not server cron. Hermes runs a dedicated process that calls a
tick()function every minute;tickpolls the list of scheduled jobs and executes any due that minute. - What they enable: natural-language scheduled automation — e.g. “every morning email me the latest AI news,” “every Friday email my boss an update,” “daily post AI developments to my community Slack.”
- Storage discrepancy. The documentation says jobs are stored in SQLite, but the presenter’s analysis found no cron jobs in SQLite and the code doesn’t read them from there. They are actually stored as plain JSON at
~/.hermes/cron/jobs.json(each job with its prompt and instructions).tickreads this file each minute. ^[ambiguous: docs vs. observed code conflict — presenter sides with the code] - Run outputs. Inside
~/.hermes/cron/there is anoutput/directory containing one folder per job ID, and within each arun.mdmarkdown file per run. - Delivery is system-level, not a tool call. Cron does not call the
send_messagetool. It delivers results as a system notification to your “home” messaging platform — the gateway/user ID you designated as “home” during setup (e.g. your Telegram user ID).
Related
- Hermes Agent (topic index)
- Codex App-Server Runtime — the runtime that can swap Hermes’ own tool loop for Codex CLI on
openai/*turns; complements this loop-level view ^[inferred] - Hermes Memory Providers — deeper comparison of the external-memory backends (Mem0, Supermemory, Honcho) this video introduces
- Hermes MemoryKit — an 8-layer stack extending the three-tier memory model described here
- Hermes 1-Hour Course (Nate Herk) — the operator-facing “Five Pillars” framing (memory / skills / soul / crons / self-improving loop) of the same architecture ^[inferred]
- Hermes Agent Masterclass — covers the SQLite FTS5 session-search layer in build-it-live detail
- Profiles & Multi-Instance — how
config.yaml,SOUL.md,MEMORY.md,USER.mdare cloned per profile - Hermes Security Model — defense-in-depth context for the gateway’s allowed-user-ID and command-approval surfaces ^[inferred]
Try It
- Inspect your own files. Open
soul.md,memory/user.md, andmemory/memory.mdto see exactly what is being prepended to every prompt. Ifsoul.mdis empty, write one to escape the generic default personality. - Read the compressor prompt. Open
context_compressor.py(search for the summarization prompt near ~line 1400) to see the full list of summary sections — useful as a template if you build your own agent. - Tune the compression threshold at setup. On smaller-context models, set it to 70-80% instead of the 50% default.
- Enable an external memory provider. Several (Mem0, Supermemory, Honcho) are free — turning one on materially improves cross-session recall. Remember it queries on the second message, so structure recall as describe-then-follow-up.
- Find your cron jobs on disk. Look in
~/.hermes/cron/jobs.jsonfor definitions and~/.hermes/cron/output/<job-id>/run.mdfor past run logs — don’t expect them in SQLite. - Set your “home” platform during gateway setup so scheduled cron results land where you actually read them.