Source: Memory and dreaming for self-learning agents (Anthropic / Mahes — PM, Platform team — Code with Claude 2026, May 7 2026 conference talk, ~10 min, YouTube RtywqDFBYnQ)

Standalone talk by Mahes (Product Manager on Anthropic’s Platform team — same team that shipped MCP and Skills) covering two announcements stacked together: Memory in Claude Managed Agents (public beta launched a few weeks before the talk), and Dreaming (research preview in Managed Agents API, launching live during the talk). The thesis: memory is the next agent primitive — the missing piece between MCP-augmented agents and continuously self-improving agents that get better at their job day-by-day. Two named customer outcomes anchor the substantive claims: Rakuten — 90% drop in first-pass mistakes with memory; Harvey — 6× increase in task completion rate on a legal benchmark with Dreaming. Goes well beyond what the Code with Claude 2026 keynote covered on either feature; this is the canonical engineering-detail source.

Key Takeaways

  • Memory is the next primitive after MCP and Skills. Mahes situates it explicitly: MCP gave agents access to external tools and data; Claude Code + Agent SDK gave them powerful harnesses; Skills (October launch) let them pick up brand-new capabilities from other agents or users. Memory is what unlocks continuous self-learning and context management over long-horizon tasks — letting agents learn about success criteria, common mistakes, working strategies, environment-specific knowledge, and (most ambitiously) learn from other agents in the same environment.
  • Memory is modeled as a file system — not a key-value store, not a vector DB. Same design rationale as Skills: agents already manage virtual environments and file systems competently, so model memory the same way. Files with hierarchy + format. Claude uses familiar bash and grep tools to read/write/organize. Opus 4.7 is state-of-the-art at file-system-based memory (claim) — better at discerning what’s worth remembering, structuring it across files, keeping it organized.
  • Three layers Anthropic identified for a frontier memory system.
    1. Storage layer — where data lives, what attribution metadata sits alongside it.
    2. Structure-and-content layer — file-system model + Skills-as-procedural-memory.
    3. Process layer — how often memory updates, what triggers updates, what sources decide changes. Memory API solves layers 1-2; Dreaming solves layer 3.
  • Rakuten outcome. “Dropped first-pass mistakes in their internal knowledge agents by 90%” because agents catch mistakes and share them with the next iteration of agents. Side effect: better token efficiency and lower cost + better latency from memory deployment (less re-investigation of solved problems).
  • Memory primitive — four enterprise-grade properties.
    • Permission scopes. One agent can have read-only access to one memory store and read-write to another. Demo: SRE agent has read-only access to org-wide knowledge / runbooks / SLO guidelines; read-write access to its own SRE working-memory store; read access to the codebase memory store.
    • Optimistic concurrency. With hundreds-to-thousands of agents reading/writing the same memory, agents use a content hash precondition to verify they’re not clobbering another agent’s update before applying their own. (Same pattern as ETag / If-Match in HTTP.)
    • Version history + attribution. Full audit log of every memory update, with attribution metadata: which agent made the change, when, which session. Agents can be given access to the audit log too — not just developers. “Most sought-after property” per Mahes after talking to customers.
    • Standalone API. Portable. Customers wanted to do their own PII scanning, custom cleanup pipelines, cloning into external systems. Anthropic deliberately did not lock memory inside the Managed Agents harness.
  • Dreaming — the new primitive launched live in the talk. “A process that looks for patterns and mistakes across your recent agent sessions and their transcripts and automatically produces organized and up-to-date memory content.” Research preview in Managed Agents API.
  • Dreaming is async, batch, and out-of-band. Doesn’t run inside a session. Triggered three ways:
    1. Cron-style schedule via Console or API.
    2. Plugged into existing pipelines — kick off when an agent finishes a task and is spinning down (“save those learnings before exit”).
    3. Manual via the Claude Console UI. Out-of-band design is deliberate: keeps the hot path fast (no latency added to active sessions), separates memory-quality objective from task-completion objective, lets the dreaming agent see across multiple agents’ sessions for shared patterns no single agent could detect from its own perspective.
  • Harvey outcome. Deployed Dreaming on one of their legal benchmarks. Saw a 6× increase in task completion rate on a “pretty realistic legal scenario.” Concrete, named, dramatic.
  • Dreaming output is a diff applied to a memory store. From the live demo (an SRE-agent fleet handling P1 alerts):
    • Pattern discovery: “a bunch of these agents were triggered exactly 60 seconds after an upstream CPU spike → there’s likely retry logic that’s inefficient.” No single agent saw the pattern; Dreaming did.
    • Deduplication and curation: 5 redundant memory entries collapsed to 1.
    • Stale-entry removal: caught one entry no longer valid based on transcript evidence.
    • Verification backfill: appends a “verified at this time based on this transcript” note so downstream agents can rely on it tomorrow.
  • Test-time-compute analogy for memory quality. Mahes frames Dreaming using the same scaling-law shape as test-time compute / thinking models: let an agent spend more tokens on memory upkeep to get better downstream outcomes. Memory becomes a dedicated objective separate from task completion — “memory is going to be increasingly load-bearing.”
  • Search-index analogy. Dreaming is the upfront-effort step that produces the high-quality, up-to-date index; downstream agents reading from the memory store get to amortize that effort across many retrievals. “We can amortize this effort across all those agents that are reading from a memory store.”
  • The frontier memory system as Anthropic now sees it. Memory (real-time read/write during a session) on the left; Dreaming (comprehensive batch process to verify, organize, enrich memory) on the right. Dreaming is the bridge between intermediate per-task memory and large-scale shared knowledge bases that Anthropic expects to see across enterprise multi-agent fleets.

Where it fits in the wiki

  • Refreshes Claude Dreaming. The existing article was written from the keynote’s brief Dreaming demo (Caitlin’s drone-landing playbook, ~5 minutes of stage time). This talk is the engineering deep-dive — adds the Harvey 6× number, the file-system memory model, the optimistic-concurrency mechanic, the version-history requirement, the SRE demo, and the test-time-compute / search-index framing. Refreshing claude-dreaming.md with a “Memory + Dreaming as one system” section is the right downstream move.
  • Slots into the Code with Claude 2026 keynote tree. The keynote was the umbrella; this talk is a specific deep-dive session at the same conference. Other deep-dive sessions from the conference would each get their own article and link forward from the keynote.
  • Pairs with Claude Managed Agents as the canonical Memory + Dreaming reference. Memory shipped in public beta a few weeks before the talk; Dreaming shipped in research preview during the talk. Both are inside the Managed Agents API.
  • Composes with Managed Agents cookbook coverage. The cookbook ships the multiagent + outcomes patterns; memory is the substrate they both write to / read from. A memory-aware variant of the multi-agent coordinator pattern is the next likely cookbook addition.
  • Reframes the Skills story. Mahes calls Skills “procedural memory with a lightweight spec.” Memory + Dreaming + Skills now form a stacked memory taxonomy: Skills are how agents acquire reusable capabilities; Memory is how they accumulate per-task / per-environment knowledge; Dreaming is how they consolidate it.

Implementation

  • Tool/Service: Memory + Dreaming inside Claude Managed Agents API.
    • Memory: public beta (launched a few weeks before May 7 2026).
    • Dreaming: research preview (launched May 7 2026).
  • Setup:
    • Memory: enable in Managed Agents API. Stores accessed via standard tools (bash, grep) or via the standalone Memory API for out-of-harness curation.
    • Dreaming: kick off via Claude Console (UI), via Managed Agents API (cron-able), or via end-of-task hook in your existing harness.
  • Cost: Pay-as-you-go — Dreaming spends additional tokens on memory upkeep (test-time-compute-style). Vendor pitch: amortized across many retrievals from the same memory store, so per-retrieval cost falls.
  • Integration notes:
    • Permission scopes — set per agent per memory store (read-only, read-write, no-access). Demo’s three-store SRE pattern (org-wide read-only + service-specific read-write + codebase context) is the recommended starting shape for multi-agent fleets.
    • Optimistic concurrency — content-hash preconditions on every write, so simultaneous agents can’t silently overwrite each other.
    • Version history — full audit log; metadata: which agent, when, which session. Roll back / inspect / give agents access for “what changed and why.”
    • Standalone API — portable; bring your own PII scanner, cleanup pipeline, archival/clone destination.
    • Dreaming triggers — recommended pattern is “kick off when an agent finishes a task and is spinning down” so learnings don’t sit unwritten.
    • Async / out-of-band — does not block active sessions, does not add latency to the hot path.
  • Demo workflow (SRE agent fleet, paraphrased from the talk):
    1. P1 alert from dispatch service → spin up SRE agent A with access to three memory stores (org-wide RO, SRE RW, codebase RO).
    2. SRE agent A investigates CPU utilization, traffic patterns, recent PRs. Writes findings to SRE memory store (RW).
    3. Same alert pages a few minutes later → SRE agent B spins up. Reads SRE memory store first. Sees note from A. Short-circuits investigation. Token efficiency + intelligence gain without code changes.
    4. Overnight: kick off Dreaming on the SRE memory store with the past 7 days of sessions. Dreaming agent spins up sub-agents to look through transcripts, identifies the 60-second-after-CPU-spike pattern, deduplicates 5 redundant entries → 1, removes 1 stale entry, adds verification note. Diff applied to memory store.
    5. Next day’s SRE agents start with a richer, deduped, verified memory store.

Open Questions

  • Dreaming token cost at scale. Mahes invokes the test-time-compute analogy — but doesn’t quantify how much compute Dreaming spends per session it considers, or per memory-store update. Worth tracking once usage data emerges.
  • Inter-agent memory contamination. Optimistic concurrency prevents silent overwrites, but doesn’t prevent one agent writing wrong/misleading content into a shared memory store and Dreaming consolidating it. What’s the abuse / bad-actor / poisoned-memory threat model?
  • Memory portability across model versions. If Memory is file-system-modeled and Opus 4.7 is “state-of-the-art at file-system memory,” what happens when Opus 5 ships? Is the memory format model-independent or do you re-Dream when models change?
  • Dreaming + Skills overlap. Skills are “procedural memory with a lightweight spec.” Dreaming reflects on past sessions and updates a memory store. When does a learning surface as a Skill (durable, portable, version-controlled) vs as Memory (per-environment, dynamic, agent-managed)? The Mahes framing implies they coexist; the harness boundary is where the next clarification belongs.
  • Standalone Memory API surface. “Customers do PII scanning, cleanup, cloning.” What auth model — does the Memory API support tenant-scoped tokens, organization-wide tokens, etc.? Not addressed in the talk.
  • Per-customer Dream-isolation. Dreaming reflects on agent-session transcripts. For multi-tenant SaaS deployments built on Managed Agents, what’s the boundary that prevents one tenant’s transcripts from contaminating another’s memory store via Dreaming? Implicitly, customer-managed memory stores are the boundary, but worth verifying in the docs.

Try It

  1. Watch the talk (YouTube RtywqDFBYnQ, ~10 min). Mahes goes through the SRE-agent demo around 8:00 — short and concrete.
  2. Read the existing Claude Managed Agents article first for the API surface that Memory + Dreaming live inside.
  3. Try Memory in Managed Agents if you have access. Start with the three-store SRE pattern (org-wide RO + service-specific RW + codebase RO) since that’s the demo’s reference shape and the permission-scope mechanic is the easy lever for multi-agent fleets.
  4. Schedule one Dreaming run on a memory store after a multi-session experiment. Look at the diff: is the deduplication useful? Are stale-entry removals correct? Is the verification-note pattern usable downstream?
  5. For agentic CI scenarios specifically, the Mahes “trigger Dreaming when agent spins down” pattern is a clean fit — wire it into the harness’s session-end hook.
  6. Pair with the Managed Agents cookbook patterns. Multi-agent + outcomes generate more diverse session traces; Memory absorbs them; Dreaming consolidates them. The three primitives together are the “self-learning agent” loop the talk describes.
  7. Skim the Karpathy “Vibe Coding to Agentic Engineering” talk for the broader frame — Karpathy’s “LLM knowledge bases as understanding tools” thesis is the user-side mirror of what Mahes is building agent-side.