Reflexio — Self-Improvement Harness for AI Agents (ReflexioAI)

Source: ai-research/reflexioai-reflexio-github-2026-05-18.md (README + benchmark/gdpval/RESULTS.md + gh CLI metadata, fetched 2026-05-18) Repo: github.com/ReflexioAI/reflexio (Apache 2.0, Python ≥3.12, 200★ at ingest) PyPI: reflexio-ai (server), reflexio-client (SDK) Homepage: reflexio.ai Created: 2026-04-11 (5 weeks before ingest); active development

External self-improvement harness that sits alongside an AI agent, reads completed runs, and produces a copy-pasteable recipe for the next run of the same task. Two artifact types: user profiles (per-user facts like “production region is us-west-2”) and agent playbooks (procedural recipes like “confirm region before deploying”). Drop-in integrations for Claude Code, LangChain, and OpenClaw. Architectural sibling to Browserbase Autobrowse (which graduates winning browser strategies into a SKILL.md) and to the on-board memory layer in Nous Hermes Agent — but as an external layer, not baked into the agent.

Key Takeaways

External, not in-agent. Reflexio runs as a service (FastAPI backend, SQLite, ports 8081/8082). The agent publishes completed conversations; Reflexio extracts profiles + playbooks; the agent retrieves them next run. The agent itself stays the same.
Two artifact types. User profiles are per-user facts. Agent playbooks are procedural recipes — “trigger / instruction / pitfall” SOPs. Versioning is current → pending → archived with an approval workflow.
Expert mode is the load-bearing feature. Publish a human expert’s ideal response alongside the agent response via expert_content; Reflexio compares them, filters out stylistic deltas, and writes a playbook from the substantive differences (missing info, wrong approach, reasoning gaps). This is the closest thing in the repo to genuine transfer learning.
Headline benchmark claim: −81% planning steps / −72% tokens on Hermes running MiniMax-M2.7. On 4 of 5 selected GDPVal knowledge-work tasks, against a warm baseline (the same agent re-running the task after its own native self-improvement has fired). Real benchmark, real numbers — read the caveats below before quoting.
Cross-host aggregate is more conservative: −50% steps / −57% tokens (median across 7 measurements over 4 qualifying tasks on both hosts). The README’s headline is the Hermes-host-specific slice.
Honest failure case documented. The 5th task (Police legal reference) doesn’t benefit; the writeup explains why — warm baseline saturates the step budget on Hermes, or the recipe’s context overhead inflates tokens on a short OpenSpace baseline.
57 ms p50 retrieval at ~3,000 indexed rows (SQLite, local Apple Silicon, 30 trials × 20 fixed queries). Fast enough for in-loop use.
Multi-provider via LiteLLM. OpenAI, Anthropic, Gemini, OpenRouter, Azure, MiniMax, custom endpoints. Pin the internal pipeline to a separate model than the host agent (the benchmark uses gpt-5-mini for Reflexio while the host runs MiniMax-M2.7).

Architecture

Client (SDK / Web UI)
  → FastAPI Backend
    → Reflexio Orchestrator
      → GenerationService
        ├─ ProfileGenerationService  → Extractor(s) → Deduplicator → Storage
        ├─ PlaybookGenerationService → Extractor(s) → Deduplicator → Storage
        └─ GroupEvaluationScheduler  → Evaluator(s) → Storage (deferred 10 min)

The shape is familiar — extract → dedupe → store → retrieve is what every memory/RAG layer eventually converges on. The novel bits:

Two-class storage. Profiles and playbooks are separate first-class entity types with their own extractors, schemas, and approval workflows. Most agent memory systems collapse both into “memories” / “facts.”
should_generate gate. Reflexio’s extractor can decline to write a playbook when a task has inherently low compressibility (the docs cite news-article writing as the canonical example). The system knows when there’s nothing worth caching.
Deferred group evaluation. Session-level evaluation fires 10 minutes after the last request, not synchronously — enables shadow A/B comparisons (regular vs shadow agent) and tool-usage analysis without blocking the foreground loop.

Benchmark — honest read

The benchmark protocol is the most interesting thing in the repo. Three runs per task per host:

run	what it measures
Cold	Fresh agent, no Reflexio — first time on the task
Warm	Same agent re-running, with its own native learning active — the honest baseline
Warm + Reflexio	Warm agent + cheat sheet extracted from the cold run

The null hypothesis is explicit: Warm + Reflexio = Warm alone (Reflexio just re-does the agent’s own work). Any third-run savings are on top of whatever the host agent learned for itself. This is a more rigorous setup than the usual “compare against cold.” ^[inferred]

Headline numbers caveats ^[inferred]:

The −81% / −72% in the README is the Hermes-host median across 4 qualifying tasks (Lawyer −80%/−64%, VA −82%/−85%, Federal −83%/−80%, RE-PIP −43%/−37%). The cross-host aggregate is −50% / −57%. Both are reported in RESULTS.md; the README leads with the bigger one.
N=5 tasks out of 50 in the public GDPVal subset. Selection criterion (“base agent can complete end-to-end in the iteration budget”) filters out harder tasks. Fair filter, small sample.
Reflexio sees the cold run of the same task before extracting the recipe used in the Warm + Reflexio run. So what’s measured is task-specific memoization with retrieval, not transfer learning across tasks. The marketing line (“what one user teaches, every user benefits from”) is broader than what the benchmark actually demonstrates.
Warm baseline is one re-run. Tougher baselines (host agent with explicit reflection, N>2 retries, a hand-written hint) would likely compress the Reflexio delta.

Counter-balancing positives the discussion section earns:

The 5th task failure (Police legal reference) is documented, not buried. Two distinct failure modes named: warm-baseline-saturates-step-budget (Hermes case) and recipe-context-overhead-inflates-tokens-on-short-baseline (OpenSpace case).
OpenSpace’s Lawyer COPPA 0.9 → 0.2 quality crash from Cold → Warm is attributed to OpenSpace’s own skill evolver capturing a bad shortcut, not to Reflexio. Reflexio holds cost+score on top of the broken baseline. Author isolates host-internal regressions from harness contribution.
The “creative tasks correctly rejected” pattern (news-article writing) is presented as a feature, not hidden as a gap. Self-aware about scope.

Read the discussion section in benchmark/gdpval/RESULTS.md directly — it’s better than most pre-funding benchmark writeups.

Provenance + identity caution

Author identity is 5 weeks old. ^[inferred] The GitHub account yilu331 was created 2026-04-11, the same day the repo was created. No name, no bio, no company, no blog. 1 public repo, 1 follower. All 134 commits across the repo are by the same person (the second contributor yyiilluu is the same person with different username casing). Zero outside contributors.
This is consistent with either (a) a stealth founder launching publicly before raising, (b) a serious solo project from someone who decided GitHub starts today, or (c) something else worth knowing about. ^[inferred]
The code is real, Apache 2.0, on PyPI, and the benchmark is reproducible (commands documented). The risk is not “does it work” — it’s “is there a second engineer who can maintain it if Yi Lu disappears.”

Integrations

Surface	Integration shape
Claude Code	Hooks into Claude Code sessions to automatically capture corrections and preferences. Code at `reflexio/integrations/claude_code/`.
LangChain	Drop-in callbacks for chains and agents. Code at `reflexio/integrations/langchain/`.
OpenClaw	Native integration with the OpenClaw agent framework. Code at `reflexio/integrations/openclaw/` + `openclaw-embedded/`.
Hermes	No first-party integration shipped, but Hermes is the headline host in the GDPVal benchmark. Integration shape is “agent publishes interactions via the Reflexio SDK; Reflexio writes playbooks back into the agent’s context.”

Implementation

Tool/Service: ReflexioAI/reflexio (Apache 2.0, self-hosted Python service)

Setup:

pip install reflexio-ai
 
reflexio services start    # API (8081), Docs (8082), SQLite storage at ~/.reflexio
reflexio services stop

Open http://localhost:8082 for the interactive docs UI. Configure at least one LLM API key (OpenAI, Anthropic, etc.) before publishing.

Cost:

Runtime: free (self-hosted, MIT-friendly Apache 2.0)
Inference: pay the LLM provider directly via LiteLLM
The benchmark setup pins Reflexio’s internal pipeline to openai/gpt-5-mini while the host agent runs minimax/MiniMax-M2.7 — separating the harness’s reasoning cost from the agent’s task cost is the recommended pattern

Integration notes:

SDK shape: client.publish_interaction(user_id, interactions, agent_version, session_id) to feed conversations in; client.search_user_profiles() and client.get_agent_playbooks() to pull them out
expert_content field on an interaction triggers expert-comparison mode (publish ideal-response alongside agent-response → Reflexio diffs them → playbook)
Versioning workflow lets you stage playbook changes (pending) for review before promoting (current)
Default group evaluation runs 10 min after the last session activity — async, doesn’t block the agent loop

Try It

Install and publish a single corrective conversation. Walk through the pip install reflexio-ai → publish → search flow from the README. Confirm it extracts a user profile and an agent playbook from a 4-turn correction.
Wire into Claude Code. Use reflexio/integrations/claude_code/ to capture corrections automatically from a real Claude Code session. Compare: what does Reflexio surface from a week of corrections that you wouldn’t have written down by hand?
Compare against Hermes’ native memory. Hermes already does on-board self-improvement. Run a repeatable task (e.g. a weekly summary cron) on Hermes alone for a week, then with Reflexio publishing alongside. Measure: planning steps + tokens + output quality. Does the external layer add anything Hermes wasn’t already learning?
Use expert mode. Pick a task where you have a clean human-written ideal response. Publish 5–10 (agent response, expert response) pairs and read the resulting playbook. This is closer to a fair test of the transfer-learning claim than the benchmark protocol is.
Reproduce the GDPVal benchmark. The benchmark/gdpval/RESULTS.md ships a copy-pasteable reproduction recipe — backend on port 8091, deterministic LLM seed via REFLEXIO_LLM_SEED=0, 5 task UUIDs, both hosts (OpenSpace + Hermes), --phases p1,p2,p3. Wallclock ~3 hours. Verify the headline numbers on your own hardware before quoting them.

Nous Hermes Agent — the headline host in Reflexio’s benchmark; Hermes already does on-board memory + skill auto-generation, so Reflexio’s claim is specifically about adding an external layer on top of that
Browserbase Autobrowse — closest architectural sibling: iterates a browser-agent strategy until convergence, then graduates the winning approach into a markdown SKILL.md. Same “harness extracts a reusable recipe from a successful run” pattern, different domain (browser agents vs general task agents)
Agent Workflow Patterns — Anthropic’s sequential/parallel/evaluator-optimizer taxonomy; Reflexio sits outside the agent loop, so it’s orthogonal to which workflow shape the agent runs
Claude Managed Agents — Anthropic’s hosted alternative to “long-running autonomous agent with memory.” Reflexio is the BYO-harness option that lives next to any agent (including Managed Agents) rather than replacing the runtime
Memory & Dreaming — Self-Learning Agents — the Anthropic thesis Reflexio is a community implementation of; the architectural debate is “memory in the agent” (Hermes, Anthropic) vs “memory beside the agent” (Reflexio)
Alex Albert — Inside How Anthropic Is Building the Next Claude — describes dreaming-as-overnight-memory-pruning; same problem space, in-model approach instead of external harness
Skill Systems — Orchestrator + Child Pattern — community Claude-as-manager pattern; pairs naturally with Reflexio’s playbook output as the “child skill” content
Karpathy — From Vibe Coding to Agentic Engineering — the broader framing for why agent self-improvement is a hot pattern in 2026
Loop Engineering (Cobus Greyling) — the loop-design discipline (readiness ladder + failure-mode catalog) around the same “harness extracts a reusable recipe from a successful run” pattern Reflexio implements.

Open Questions

Who is Yi Lu? ^[inferred] The GitHub identity is 5 weeks old with no metadata; the repo has no second engineer. Stealth founder? Anthropic / Nous engineer using an alt? Public information at ingest gives no answer. Worth knowing before depending on the project.
What’s the difference between Reflexio’s playbook extractor and DSPy / GEPA? The repo mentions a “playbook-optimizer” with GEPA — exact relationship between Reflexio’s extractor pipeline and Stanford’s Genetic-Pareto algorithm for prompt optimization is unclear from the README alone. Code inspection or docs read needed.
Does the −81% number reproduce on a fresh box? Reproduction recipe is documented; nobody outside the author has confirmed numbers publicly yet (as of 2026-05-18). Cross-host −50% / −57% is the conservative read until a second party reproduces.
How does the system behave when the user-correction is wrong? The benchmark assumes corrections are ground truth (“never deploy production to us-east-1” example). Real users sometimes correct an agent with worse advice — does Reflexio have any defense against learning incorrect corrections, or does it just absorb them?
Is the expert_content workflow load-bearing in production deployments, or is it aspirational? The discussion section doesn’t surface any benchmark of expert-mode specifically. The most defensible transfer-learning claim sits in this mode, but the GDPVal benchmark doesn’t use it.

Jonathon's AI Wiki

Explorer

Reflexio — Self-Improvement Harness for AI Agents (ReflexioAI)

Key Takeaways

Architecture

Benchmark — honest read

Provenance + identity caution

Integrations

Implementation

Try It

Open Questions

Graph View

Table of Contents

Backlinks

Jonathon's AI Wiki

Explorer

Reflexio — Self-Improvement Harness for AI Agents (ReflexioAI)

Key Takeaways

Architecture

Benchmark — honest read

Provenance + identity caution

Integrations

Implementation

Try It

Related

Open Questions

Graph View

Table of Contents

Backlinks