Source: reflexioai-reflexio-github-2026-05-18 (README + benchmark/gdpval/RESULTS.md + gh CLI metadata, fetched 2026-05-18) Repo: github.com/ReflexioAI/reflexio (Apache 2.0, Python ≥3.12, 200★ at ingest) PyPI: reflexio-ai (server), reflexio-client (SDK) Homepage: reflexio.ai Created: 2026-04-11 (5 weeks before ingest); active development

External self-improvement harness that sits alongside an AI agent, reads completed runs, and produces a copy-pasteable recipe for the next run of the same task. Two artifact types: user profiles (per-user facts like “production region is us-west-2”) and agent playbooks (procedural recipes like “confirm region before deploying”). Drop-in integrations for Claude Code, LangChain, and OpenClaw. Architectural sibling to Browserbase Autobrowse (which graduates winning browser strategies into a SKILL.md) and to the on-board memory layer in Nous Hermes Agent — but as an external layer, not baked into the agent.

Key Takeaways

  • External, not in-agent. Reflexio runs as a service (FastAPI backend, SQLite, ports 8081/8082). The agent publishes completed conversations; Reflexio extracts profiles + playbooks; the agent retrieves them next run. The agent itself stays the same.
  • Two artifact types. User profiles are per-user facts. Agent playbooks are procedural recipes — “trigger / instruction / pitfall” SOPs. Versioning is current → pending → archived with an approval workflow.
  • Expert mode is the load-bearing feature. Publish a human expert’s ideal response alongside the agent response via expert_content; Reflexio compares them, filters out stylistic deltas, and writes a playbook from the substantive differences (missing info, wrong approach, reasoning gaps). This is the closest thing in the repo to genuine transfer learning.
  • Headline benchmark claim: −81% planning steps / −72% tokens on Hermes running MiniMax-M2.7. On 4 of 5 selected GDPVal knowledge-work tasks, against a warm baseline (the same agent re-running the task after its own native self-improvement has fired). Real benchmark, real numbers — read the caveats below before quoting.
  • Cross-host aggregate is more conservative: −50% steps / −57% tokens (median across 7 measurements over 4 qualifying tasks on both hosts). The README’s headline is the Hermes-host-specific slice.
  • Honest failure case documented. The 5th task (Police legal reference) doesn’t benefit; the writeup explains why — warm baseline saturates the step budget on Hermes, or the recipe’s context overhead inflates tokens on a short OpenSpace baseline.
  • 57 ms p50 retrieval at ~3,000 indexed rows (SQLite, local Apple Silicon, 30 trials × 20 fixed queries). Fast enough for in-loop use.
  • Multi-provider via LiteLLM. OpenAI, Anthropic, Gemini, OpenRouter, Azure, MiniMax, custom endpoints. Pin the internal pipeline to a separate model than the host agent (the benchmark uses gpt-5-mini for Reflexio while the host runs MiniMax-M2.7).

Architecture

Client (SDK / Web UI)
  → FastAPI Backend
    → Reflexio Orchestrator
      → GenerationService
        ├─ ProfileGenerationService  → Extractor(s) → Deduplicator → Storage
        ├─ PlaybookGenerationService → Extractor(s) → Deduplicator → Storage
        └─ GroupEvaluationScheduler  → Evaluator(s) → Storage (deferred 10 min)

The shape is familiar — extract → dedupe → store → retrieve is what every memory/RAG layer eventually converges on. The novel bits:

  • Two-class storage. Profiles and playbooks are separate first-class entity types with their own extractors, schemas, and approval workflows. Most agent memory systems collapse both into “memories” / “facts.”
  • should_generate gate. Reflexio’s extractor can decline to write a playbook when a task has inherently low compressibility (the docs cite news-article writing as the canonical example). The system knows when there’s nothing worth caching.
  • Deferred group evaluation. Session-level evaluation fires 10 minutes after the last request, not synchronously — enables shadow A/B comparisons (regular vs shadow agent) and tool-usage analysis without blocking the foreground loop.

Benchmark — honest read

The benchmark protocol is the most interesting thing in the repo. Three runs per task per host:

runwhat it measures
ColdFresh agent, no Reflexio — first time on the task
WarmSame agent re-running, with its own native learning active — the honest baseline
Warm + ReflexioWarm agent + cheat sheet extracted from the cold run

The null hypothesis is explicit: Warm + Reflexio = Warm alone (Reflexio just re-does the agent’s own work). Any third-run savings are on top of whatever the host agent learned for itself. This is a more rigorous setup than the usual “compare against cold.” ^[inferred]

Headline numbers caveats ^[inferred]:

  • The −81% / −72% in the README is the Hermes-host median across 4 qualifying tasks (Lawyer −80%/−64%, VA −82%/−85%, Federal −83%/−80%, RE-PIP −43%/−37%). The cross-host aggregate is −50% / −57%. Both are reported in RESULTS.md; the README leads with the bigger one.
  • N=5 tasks out of 50 in the public GDPVal subset. Selection criterion (“base agent can complete end-to-end in the iteration budget”) filters out harder tasks. Fair filter, small sample.
  • Reflexio sees the cold run of the same task before extracting the recipe used in the Warm + Reflexio run. So what’s measured is task-specific memoization with retrieval, not transfer learning across tasks. The marketing line (“what one user teaches, every user benefits from”) is broader than what the benchmark actually demonstrates.
  • Warm baseline is one re-run. Tougher baselines (host agent with explicit reflection, N>2 retries, a hand-written hint) would likely compress the Reflexio delta.

Counter-balancing positives the discussion section earns:

  • The 5th task failure (Police legal reference) is documented, not buried. Two distinct failure modes named: warm-baseline-saturates-step-budget (Hermes case) and recipe-context-overhead-inflates-tokens-on-short-baseline (OpenSpace case).
  • OpenSpace’s Lawyer COPPA 0.9 → 0.2 quality crash from Cold → Warm is attributed to OpenSpace’s own skill evolver capturing a bad shortcut, not to Reflexio. Reflexio holds cost+score on top of the broken baseline. Author isolates host-internal regressions from harness contribution.
  • The “creative tasks correctly rejected” pattern (news-article writing) is presented as a feature, not hidden as a gap. Self-aware about scope.

Read the discussion section in benchmark/gdpval/RESULTS.md directly — it’s better than most pre-funding benchmark writeups.

Provenance + identity caution

  • Author identity is 5 weeks old. ^[inferred] The GitHub account yilu331 was created 2026-04-11, the same day the repo was created. No name, no bio, no company, no blog. 1 public repo, 1 follower. All 134 commits across the repo are by the same person (the second contributor yyiilluu is the same person with different username casing). Zero outside contributors.
  • This is consistent with either (a) a stealth founder launching publicly before raising, (b) a serious solo project from someone who decided GitHub starts today, or (c) something else worth knowing about. ^[inferred]
  • The code is real, Apache 2.0, on PyPI, and the benchmark is reproducible (commands documented). The risk is not “does it work” — it’s “is there a second engineer who can maintain it if Yi Lu disappears.”

Integrations

SurfaceIntegration shape
Claude CodeHooks into Claude Code sessions to automatically capture corrections and preferences. Code at reflexio/integrations/claude_code/.
LangChainDrop-in callbacks for chains and agents. Code at reflexio/integrations/langchain/.
OpenClawNative integration with the OpenClaw agent framework. Code at reflexio/integrations/openclaw/ + openclaw-embedded/.
HermesNo first-party integration shipped, but Hermes is the headline host in the GDPVal benchmark. Integration shape is “agent publishes interactions via the Reflexio SDK; Reflexio writes playbooks back into the agent’s context.”

Implementation

Tool/Service: ReflexioAI/reflexio (Apache 2.0, self-hosted Python service)

Setup:

pip install reflexio-ai
 
reflexio services start    # API (8081), Docs (8082), SQLite storage at ~/.reflexio
reflexio services stop

Open http://localhost:8082 for the interactive docs UI. Configure at least one LLM API key (OpenAI, Anthropic, etc.) before publishing.

Cost:

  • Runtime: free (self-hosted, MIT-friendly Apache 2.0)
  • Inference: pay the LLM provider directly via LiteLLM
  • The benchmark setup pins Reflexio’s internal pipeline to openai/gpt-5-mini while the host agent runs minimax/MiniMax-M2.7 — separating the harness’s reasoning cost from the agent’s task cost is the recommended pattern

Integration notes:

  • SDK shape: client.publish_interaction(user_id, interactions, agent_version, session_id) to feed conversations in; client.search_user_profiles() and client.get_agent_playbooks() to pull them out
  • expert_content field on an interaction triggers expert-comparison mode (publish ideal-response alongside agent-response → Reflexio diffs them → playbook)
  • Versioning workflow lets you stage playbook changes (pending) for review before promoting (current)
  • Default group evaluation runs 10 min after the last session activity — async, doesn’t block the agent loop

Try It

  1. Install and publish a single corrective conversation. Walk through the pip install reflexio-ai → publish → search flow from the README. Confirm it extracts a user profile and an agent playbook from a 4-turn correction.
  2. Wire into Claude Code. Use reflexio/integrations/claude_code/ to capture corrections automatically from a real Claude Code session. Compare: what does Reflexio surface from a week of corrections that you wouldn’t have written down by hand?
  3. Compare against Hermes’ native memory. Hermes already does on-board self-improvement. Run a repeatable task (e.g. a weekly summary cron) on Hermes alone for a week, then with Reflexio publishing alongside. Measure: planning steps + tokens + output quality. Does the external layer add anything Hermes wasn’t already learning?
  4. Use expert mode. Pick a task where you have a clean human-written ideal response. Publish 5–10 (agent response, expert response) pairs and read the resulting playbook. This is closer to a fair test of the transfer-learning claim than the benchmark protocol is.
  5. Reproduce the GDPVal benchmark. The benchmark/gdpval/RESULTS.md ships a copy-pasteable reproduction recipe — backend on port 8091, deterministic LLM seed via REFLEXIO_LLM_SEED=0, 5 task UUIDs, both hosts (OpenSpace + Hermes), --phases p1,p2,p3. Wallclock ~3 hours. Verify the headline numbers on your own hardware before quoting them.
  • Nous Hermes Agent — the headline host in Reflexio’s benchmark; Hermes already does on-board memory + skill auto-generation, so Reflexio’s claim is specifically about adding an external layer on top of that
  • Browserbase Autobrowse — closest architectural sibling: iterates a browser-agent strategy until convergence, then graduates the winning approach into a markdown SKILL.md. Same “harness extracts a reusable recipe from a successful run” pattern, different domain (browser agents vs general task agents)
  • Agent Workflow Patterns — Anthropic’s sequential/parallel/evaluator-optimizer taxonomy; Reflexio sits outside the agent loop, so it’s orthogonal to which workflow shape the agent runs
  • Claude Managed Agents — Anthropic’s hosted alternative to “long-running autonomous agent with memory.” Reflexio is the BYO-harness option that lives next to any agent (including Managed Agents) rather than replacing the runtime
  • Memory & Dreaming — Self-Learning Agents — the Anthropic thesis Reflexio is a community implementation of; the architectural debate is “memory in the agent” (Hermes, Anthropic) vs “memory beside the agent” (Reflexio)
  • Alex Albert — Inside How Anthropic Is Building the Next Claude — describes dreaming-as-overnight-memory-pruning; same problem space, in-model approach instead of external harness
  • Skill Systems — Orchestrator + Child Pattern — community Claude-as-manager pattern; pairs naturally with Reflexio’s playbook output as the “child skill” content
  • Karpathy — From Vibe Coding to Agentic Engineering — the broader framing for why agent self-improvement is a hot pattern in 2026

Open Questions

  • Who is Yi Lu? ^[inferred] The GitHub identity is 5 weeks old with no metadata; the repo has no second engineer. Stealth founder? Anthropic / Nous engineer using an alt? Public information at ingest gives no answer. Worth knowing before depending on the project.
  • What’s the difference between Reflexio’s playbook extractor and DSPy / GEPA? The repo mentions a “playbook-optimizer” with GEPA — exact relationship between Reflexio’s extractor pipeline and Stanford’s Genetic-Pareto algorithm for prompt optimization is unclear from the README alone. Code inspection or docs read needed.
  • Does the −81% number reproduce on a fresh box? Reproduction recipe is documented; nobody outside the author has confirmed numbers publicly yet (as of 2026-05-18). Cross-host −50% / −57% is the conservative read until a second party reproduces.
  • How does the system behave when the user-correction is wrong? The benchmark assumes corrections are ground truth (“never deploy production to us-east-1” example). Real users sometimes correct an agent with worse advice — does Reflexio have any defense against learning incorrect corrections, or does it just absorb them?
  • Is the expert_content workflow load-bearing in production deployments, or is it aspirational? The discussion section doesn’t surface any benchmark of expert-mode specifically. The most defensible transfer-learning claim sits in this mode, but the GDPVal benchmark doesn’t use it.