Source: Memory and dreaming for self-learning agents (Anthropic / Mahes — PM, Platform team — Code with Claude 2026, May 7 2026 conference talk, ~10 min, YouTube RtywqDFBYnQ)
Standalone talk by Mahes (Product Manager on Anthropic’s Platform team — same team that shipped MCP and Skills) covering two announcements stacked together: Memory in Claude Managed Agents (public beta launched a few weeks before the talk), and Dreaming (research preview in Managed Agents API, launching live during the talk). The thesis: memory is the next agent primitive — the missing piece between MCP-augmented agents and continuously self-improving agents that get better at their job day-by-day. Two named customer outcomes anchor the substantive claims: Rakuten — 90% drop in first-pass mistakes with memory; Harvey — 6× increase in task completion rate on a legal benchmark with Dreaming. Goes well beyond what the Code with Claude 2026 keynote covered on either feature; this is the canonical engineering-detail source.
Key Takeaways
- Memory is the next primitive after MCP and Skills. Mahes situates it explicitly: MCP gave agents access to external tools and data; Claude Code + Agent SDK gave them powerful harnesses; Skills (October launch) let them pick up brand-new capabilities from other agents or users. Memory is what unlocks continuous self-learning and context management over long-horizon tasks — letting agents learn about success criteria, common mistakes, working strategies, environment-specific knowledge, and (most ambitiously) learn from other agents in the same environment.
- Memory is modeled as a file system — not a key-value store, not a vector DB. Same design rationale as Skills: agents already manage virtual environments and file systems competently, so model memory the same way. Files with hierarchy + format. Claude uses familiar
bashandgreptools to read/write/organize. Opus 4.7 is state-of-the-art at file-system-based memory (claim) — better at discerning what’s worth remembering, structuring it across files, keeping it organized. - Three layers Anthropic identified for a frontier memory system.
- Storage layer — where data lives, what attribution metadata sits alongside it.
- Structure-and-content layer — file-system model + Skills-as-procedural-memory.
- Process layer — how often memory updates, what triggers updates, what sources decide changes. Memory API solves layers 1-2; Dreaming solves layer 3.
- Rakuten outcome. “Dropped first-pass mistakes in their internal knowledge agents by 90%” because agents catch mistakes and share them with the next iteration of agents. Side effect: better token efficiency and lower cost + better latency from memory deployment (less re-investigation of solved problems).
- Memory primitive — four enterprise-grade properties.
- Permission scopes. One agent can have read-only access to one memory store and read-write to another. Demo: SRE agent has read-only access to org-wide knowledge / runbooks / SLO guidelines; read-write access to its own SRE working-memory store; read access to the codebase memory store.
- Optimistic concurrency. With hundreds-to-thousands of agents reading/writing the same memory, agents use a content hash precondition to verify they’re not clobbering another agent’s update before applying their own. (Same pattern as ETag / If-Match in HTTP.)
- Version history + attribution. Full audit log of every memory update, with attribution metadata: which agent made the change, when, which session. Agents can be given access to the audit log too — not just developers. “Most sought-after property” per Mahes after talking to customers.
- Standalone API. Portable. Customers wanted to do their own PII scanning, custom cleanup pipelines, cloning into external systems. Anthropic deliberately did not lock memory inside the Managed Agents harness.
- Dreaming — the new primitive launched live in the talk. “A process that looks for patterns and mistakes across your recent agent sessions and their transcripts and automatically produces organized and up-to-date memory content.” Research preview in Managed Agents API.
- Dreaming is async, batch, and out-of-band. Doesn’t run inside a session. Triggered three ways:
- Cron-style schedule via Console or API.
- Plugged into existing pipelines — kick off when an agent finishes a task and is spinning down (“save those learnings before exit”).
- Manual via the Claude Console UI. Out-of-band design is deliberate: keeps the hot path fast (no latency added to active sessions), separates memory-quality objective from task-completion objective, lets the dreaming agent see across multiple agents’ sessions for shared patterns no single agent could detect from its own perspective.
- Harvey outcome. Deployed Dreaming on one of their legal benchmarks. Saw a 6× increase in task completion rate on a “pretty realistic legal scenario.” Concrete, named, dramatic.
- Dreaming output is a diff applied to a memory store. From the live demo (an SRE-agent fleet handling P1 alerts):
- Pattern discovery: “a bunch of these agents were triggered exactly 60 seconds after an upstream CPU spike → there’s likely retry logic that’s inefficient.” No single agent saw the pattern; Dreaming did.
- Deduplication and curation: 5 redundant memory entries collapsed to 1.
- Stale-entry removal: caught one entry no longer valid based on transcript evidence.
- Verification backfill: appends a “verified at this time based on this transcript” note so downstream agents can rely on it tomorrow.
- Test-time-compute analogy for memory quality. Mahes frames Dreaming using the same scaling-law shape as test-time compute / thinking models: let an agent spend more tokens on memory upkeep to get better downstream outcomes. Memory becomes a dedicated objective separate from task completion — “memory is going to be increasingly load-bearing.”
- Search-index analogy. Dreaming is the upfront-effort step that produces the high-quality, up-to-date index; downstream agents reading from the memory store get to amortize that effort across many retrievals. “We can amortize this effort across all those agents that are reading from a memory store.”
- The frontier memory system as Anthropic now sees it. Memory (real-time read/write during a session) on the left; Dreaming (comprehensive batch process to verify, organize, enrich memory) on the right. Dreaming is the bridge between intermediate per-task memory and large-scale shared knowledge bases that Anthropic expects to see across enterprise multi-agent fleets.
Where it fits in the wiki
- Refreshes Claude Dreaming. The existing article was written from the keynote’s brief Dreaming demo (Caitlin’s drone-landing playbook, ~5 minutes of stage time). This talk is the engineering deep-dive — adds the Harvey 6× number, the file-system memory model, the optimistic-concurrency mechanic, the version-history requirement, the SRE demo, and the test-time-compute / search-index framing. Refreshing claude-dreaming.md with a “Memory + Dreaming as one system” section is the right downstream move.
- Slots into the Code with Claude 2026 keynote tree. The keynote was the umbrella; this talk is a specific deep-dive session at the same conference. Other deep-dive sessions from the conference would each get their own article and link forward from the keynote.
- Pairs with Claude Managed Agents as the canonical Memory + Dreaming reference. Memory shipped in public beta a few weeks before the talk; Dreaming shipped in research preview during the talk. Both are inside the Managed Agents API.
- Composes with Managed Agents cookbook coverage. The cookbook ships the multiagent + outcomes patterns; memory is the substrate they both write to / read from. A memory-aware variant of the multi-agent coordinator pattern is the next likely cookbook addition.
- Reframes the Skills story. Mahes calls Skills “procedural memory with a lightweight spec.” Memory + Dreaming + Skills now form a stacked memory taxonomy: Skills are how agents acquire reusable capabilities; Memory is how they accumulate per-task / per-environment knowledge; Dreaming is how they consolidate it.
Implementation
- Tool/Service: Memory + Dreaming inside Claude Managed Agents API.
- Memory: public beta (launched a few weeks before May 7 2026).
- Dreaming: research preview (launched May 7 2026).
- Setup:
- Memory: enable in Managed Agents API. Stores accessed via standard tools (
bash,grep) or via the standalone Memory API for out-of-harness curation. - Dreaming: kick off via Claude Console (UI), via Managed Agents API (cron-able), or via end-of-task hook in your existing harness.
- Memory: enable in Managed Agents API. Stores accessed via standard tools (
- Cost: Pay-as-you-go — Dreaming spends additional tokens on memory upkeep (test-time-compute-style). Vendor pitch: amortized across many retrievals from the same memory store, so per-retrieval cost falls.
- Integration notes:
- Permission scopes — set per agent per memory store (read-only, read-write, no-access). Demo’s three-store SRE pattern (org-wide read-only + service-specific read-write + codebase context) is the recommended starting shape for multi-agent fleets.
- Optimistic concurrency — content-hash preconditions on every write, so simultaneous agents can’t silently overwrite each other.
- Version history — full audit log; metadata: which agent, when, which session. Roll back / inspect / give agents access for “what changed and why.”
- Standalone API — portable; bring your own PII scanner, cleanup pipeline, archival/clone destination.
- Dreaming triggers — recommended pattern is “kick off when an agent finishes a task and is spinning down” so learnings don’t sit unwritten.
- Async / out-of-band — does not block active sessions, does not add latency to the hot path.
- Demo workflow (SRE agent fleet, paraphrased from the talk):
- P1 alert from dispatch service → spin up SRE agent A with access to three memory stores (org-wide RO, SRE RW, codebase RO).
- SRE agent A investigates CPU utilization, traffic patterns, recent PRs. Writes findings to SRE memory store (RW).
- Same alert pages a few minutes later → SRE agent B spins up. Reads SRE memory store first. Sees note from A. Short-circuits investigation. Token efficiency + intelligence gain without code changes.
- Overnight: kick off Dreaming on the SRE memory store with the past 7 days of sessions. Dreaming agent spins up sub-agents to look through transcripts, identifies the 60-second-after-CPU-spike pattern, deduplicates 5 redundant entries → 1, removes 1 stale entry, adds verification note. Diff applied to memory store.
- Next day’s SRE agents start with a richer, deduped, verified memory store.
Open Questions
- Dreaming token cost at scale. Mahes invokes the test-time-compute analogy — but doesn’t quantify how much compute Dreaming spends per session it considers, or per memory-store update. Worth tracking once usage data emerges.
- Inter-agent memory contamination. Optimistic concurrency prevents silent overwrites, but doesn’t prevent one agent writing wrong/misleading content into a shared memory store and Dreaming consolidating it. What’s the abuse / bad-actor / poisoned-memory threat model?
- Memory portability across model versions. If Memory is file-system-modeled and Opus 4.7 is “state-of-the-art at file-system memory,” what happens when Opus 5 ships? Is the memory format model-independent or do you re-Dream when models change?
- Dreaming + Skills overlap. Skills are “procedural memory with a lightweight spec.” Dreaming reflects on past sessions and updates a memory store. When does a learning surface as a Skill (durable, portable, version-controlled) vs as Memory (per-environment, dynamic, agent-managed)? The Mahes framing implies they coexist; the harness boundary is where the next clarification belongs.
- Standalone Memory API surface. “Customers do PII scanning, cleanup, cloning.” What auth model — does the Memory API support tenant-scoped tokens, organization-wide tokens, etc.? Not addressed in the talk.
- Per-customer Dream-isolation. Dreaming reflects on agent-session transcripts. For multi-tenant SaaS deployments built on Managed Agents, what’s the boundary that prevents one tenant’s transcripts from contaminating another’s memory store via Dreaming? Implicitly, customer-managed memory stores are the boundary, but worth verifying in the docs.
Try It
- Watch the talk (YouTube
RtywqDFBYnQ, ~10 min). Mahes goes through the SRE-agent demo around 8:00 — short and concrete. - Read the existing Claude Managed Agents article first for the API surface that Memory + Dreaming live inside.
- Try Memory in Managed Agents if you have access. Start with the three-store SRE pattern (org-wide RO + service-specific RW + codebase RO) since that’s the demo’s reference shape and the permission-scope mechanic is the easy lever for multi-agent fleets.
- Schedule one Dreaming run on a memory store after a multi-session experiment. Look at the diff: is the deduplication useful? Are stale-entry removals correct? Is the verification-note pattern usable downstream?
- For agentic CI scenarios specifically, the Mahes “trigger Dreaming when agent spins down” pattern is a clean fit — wire it into the harness’s session-end hook.
- Pair with the Managed Agents cookbook patterns. Multi-agent + outcomes generate more diverse session traces; Memory absorbs them; Dreaming consolidates them. The three primitives together are the “self-learning agent” loop the talk describes.
- Skim the Karpathy “Vibe Coding to Agentic Engineering” talk for the broader frame — Karpathy’s “LLM knowledge bases as understanding tools” thesis is the user-side mirror of what Mahes is building agent-side.
Related
- Claude Dreaming — entity article (refreshed alongside this ingest)
- Code with Claude 2026 — Opening Keynote — umbrella talk; this is one of its deep-dives
- Claude Managed Agents — the API surface
- Managed Agents cookbook (multiagent + outcomes)
- Agent Skills Overview — Skills as procedural memory
- skills repo
- Opus 4.7 Best Practices — Opus 4.7 cited as state-of-the-art at file-system memory
- Karpathy — From Vibe Coding to Agentic Engineering
- 2026 Claude Code AIOS Pattern