Source: reflexioai-reflexio-github-2026-05-18 (README + benchmark/gdpval/RESULTS.md + gh CLI metadata, fetched 2026-05-18)
Repo: github.com/ReflexioAI/reflexio (Apache 2.0, Python ≥3.12, 200★ at ingest)
PyPI: reflexio-ai (server), reflexio-client (SDK)
Homepage: reflexio.ai
Created: 2026-04-11 (5 weeks before ingest); active development
External self-improvement harness that sits alongside an AI agent, reads completed runs, and produces a copy-pasteable recipe for the next run of the same task. Two artifact types: user profiles (per-user facts like “production region is us-west-2”) and agent playbooks (procedural recipes like “confirm region before deploying”). Drop-in integrations for Claude Code, LangChain, and OpenClaw. Architectural sibling to Browserbase Autobrowse (which graduates winning browser strategies into a SKILL.md) and to the on-board memory layer in Nous Hermes Agent — but as an external layer, not baked into the agent.
Key Takeaways
- External, not in-agent. Reflexio runs as a service (FastAPI backend, SQLite, ports 8081/8082). The agent publishes completed conversations; Reflexio extracts profiles + playbooks; the agent retrieves them next run. The agent itself stays the same.
- Two artifact types. User profiles are per-user facts. Agent playbooks are procedural recipes — “trigger / instruction / pitfall” SOPs. Versioning is current → pending → archived with an approval workflow.
- Expert mode is the load-bearing feature. Publish a human expert’s ideal response alongside the agent response via
expert_content; Reflexio compares them, filters out stylistic deltas, and writes a playbook from the substantive differences (missing info, wrong approach, reasoning gaps). This is the closest thing in the repo to genuine transfer learning. - Headline benchmark claim: −81% planning steps / −72% tokens on Hermes running MiniMax-M2.7. On 4 of 5 selected GDPVal knowledge-work tasks, against a warm baseline (the same agent re-running the task after its own native self-improvement has fired). Real benchmark, real numbers — read the caveats below before quoting.
- Cross-host aggregate is more conservative: −50% steps / −57% tokens (median across 7 measurements over 4 qualifying tasks on both hosts). The README’s headline is the Hermes-host-specific slice.
- Honest failure case documented. The 5th task (Police legal reference) doesn’t benefit; the writeup explains why — warm baseline saturates the step budget on Hermes, or the recipe’s context overhead inflates tokens on a short OpenSpace baseline.
- 57 ms p50 retrieval at ~3,000 indexed rows (SQLite, local Apple Silicon, 30 trials × 20 fixed queries). Fast enough for in-loop use.
- Multi-provider via LiteLLM. OpenAI, Anthropic, Gemini, OpenRouter, Azure, MiniMax, custom endpoints. Pin the internal pipeline to a separate model than the host agent (the benchmark uses
gpt-5-minifor Reflexio while the host runsMiniMax-M2.7).
Architecture
Client (SDK / Web UI)
→ FastAPI Backend
→ Reflexio Orchestrator
→ GenerationService
├─ ProfileGenerationService → Extractor(s) → Deduplicator → Storage
├─ PlaybookGenerationService → Extractor(s) → Deduplicator → Storage
└─ GroupEvaluationScheduler → Evaluator(s) → Storage (deferred 10 min)
The shape is familiar — extract → dedupe → store → retrieve is what every memory/RAG layer eventually converges on. The novel bits:
- Two-class storage. Profiles and playbooks are separate first-class entity types with their own extractors, schemas, and approval workflows. Most agent memory systems collapse both into “memories” / “facts.”
should_generategate. Reflexio’s extractor can decline to write a playbook when a task has inherently low compressibility (the docs cite news-article writing as the canonical example). The system knows when there’s nothing worth caching.- Deferred group evaluation. Session-level evaluation fires 10 minutes after the last request, not synchronously — enables shadow A/B comparisons (regular vs shadow agent) and tool-usage analysis without blocking the foreground loop.
Benchmark — honest read
The benchmark protocol is the most interesting thing in the repo. Three runs per task per host:
| run | what it measures |
|---|---|
| Cold | Fresh agent, no Reflexio — first time on the task |
| Warm | Same agent re-running, with its own native learning active — the honest baseline |
| Warm + Reflexio | Warm agent + cheat sheet extracted from the cold run |
The null hypothesis is explicit: Warm + Reflexio = Warm alone (Reflexio just re-does the agent’s own work). Any third-run savings are on top of whatever the host agent learned for itself. This is a more rigorous setup than the usual “compare against cold.” ^[inferred]
Headline numbers caveats ^[inferred]:
- The −81% / −72% in the README is the Hermes-host median across 4 qualifying tasks (Lawyer −80%/−64%, VA −82%/−85%, Federal −83%/−80%, RE-PIP −43%/−37%). The cross-host aggregate is −50% / −57%. Both are reported in
RESULTS.md; the README leads with the bigger one. - N=5 tasks out of 50 in the public GDPVal subset. Selection criterion (“base agent can complete end-to-end in the iteration budget”) filters out harder tasks. Fair filter, small sample.
- Reflexio sees the cold run of the same task before extracting the recipe used in the Warm + Reflexio run. So what’s measured is task-specific memoization with retrieval, not transfer learning across tasks. The marketing line (“what one user teaches, every user benefits from”) is broader than what the benchmark actually demonstrates.
- Warm baseline is one re-run. Tougher baselines (host agent with explicit reflection, N>2 retries, a hand-written hint) would likely compress the Reflexio delta.
Counter-balancing positives the discussion section earns:
- The 5th task failure (Police legal reference) is documented, not buried. Two distinct failure modes named: warm-baseline-saturates-step-budget (Hermes case) and recipe-context-overhead-inflates-tokens-on-short-baseline (OpenSpace case).
- OpenSpace’s Lawyer COPPA
0.9 → 0.2quality crash from Cold → Warm is attributed to OpenSpace’s own skill evolver capturing a bad shortcut, not to Reflexio. Reflexio holds cost+score on top of the broken baseline. Author isolates host-internal regressions from harness contribution. - The “creative tasks correctly rejected” pattern (news-article writing) is presented as a feature, not hidden as a gap. Self-aware about scope.
Read the discussion section in benchmark/gdpval/RESULTS.md directly — it’s better than most pre-funding benchmark writeups.
Provenance + identity caution
- Author identity is 5 weeks old. ^[inferred] The GitHub account
yilu331was created 2026-04-11, the same day the repo was created. No name, no bio, no company, no blog. 1 public repo, 1 follower. All 134 commits across the repo are by the same person (the second contributoryyiilluuis the same person with different username casing). Zero outside contributors. - This is consistent with either (a) a stealth founder launching publicly before raising, (b) a serious solo project from someone who decided GitHub starts today, or (c) something else worth knowing about. ^[inferred]
- The code is real, Apache 2.0, on PyPI, and the benchmark is reproducible (commands documented). The risk is not “does it work” — it’s “is there a second engineer who can maintain it if Yi Lu disappears.”
Integrations
| Surface | Integration shape |
|---|---|
| Claude Code | Hooks into Claude Code sessions to automatically capture corrections and preferences. Code at reflexio/integrations/claude_code/. |
| LangChain | Drop-in callbacks for chains and agents. Code at reflexio/integrations/langchain/. |
| OpenClaw | Native integration with the OpenClaw agent framework. Code at reflexio/integrations/openclaw/ + openclaw-embedded/. |
| Hermes | No first-party integration shipped, but Hermes is the headline host in the GDPVal benchmark. Integration shape is “agent publishes interactions via the Reflexio SDK; Reflexio writes playbooks back into the agent’s context.” |
Implementation
Tool/Service: ReflexioAI/reflexio (Apache 2.0, self-hosted Python service)
Setup:
pip install reflexio-ai
reflexio services start # API (8081), Docs (8082), SQLite storage at ~/.reflexio
reflexio services stopOpen http://localhost:8082 for the interactive docs UI. Configure at least one LLM API key (OpenAI, Anthropic, etc.) before publishing.
Cost:
- Runtime: free (self-hosted, MIT-friendly Apache 2.0)
- Inference: pay the LLM provider directly via LiteLLM
- The benchmark setup pins Reflexio’s internal pipeline to
openai/gpt-5-miniwhile the host agent runsminimax/MiniMax-M2.7— separating the harness’s reasoning cost from the agent’s task cost is the recommended pattern
Integration notes:
- SDK shape:
client.publish_interaction(user_id, interactions, agent_version, session_id)to feed conversations in;client.search_user_profiles()andclient.get_agent_playbooks()to pull them out expert_contentfield on an interaction triggers expert-comparison mode (publish ideal-response alongside agent-response → Reflexio diffs them → playbook)- Versioning workflow lets you stage playbook changes (
pending) for review before promoting (current) - Default group evaluation runs 10 min after the last session activity — async, doesn’t block the agent loop
Try It
- Install and publish a single corrective conversation. Walk through the
pip install reflexio-ai → publish → searchflow from the README. Confirm it extracts a user profile and an agent playbook from a 4-turn correction. - Wire into Claude Code. Use
reflexio/integrations/claude_code/to capture corrections automatically from a real Claude Code session. Compare: what does Reflexio surface from a week of corrections that you wouldn’t have written down by hand? - Compare against Hermes’ native memory. Hermes already does on-board self-improvement. Run a repeatable task (e.g. a weekly summary cron) on Hermes alone for a week, then with Reflexio publishing alongside. Measure: planning steps + tokens + output quality. Does the external layer add anything Hermes wasn’t already learning?
- Use expert mode. Pick a task where you have a clean human-written ideal response. Publish 5–10 (agent response, expert response) pairs and read the resulting playbook. This is closer to a fair test of the transfer-learning claim than the benchmark protocol is.
- Reproduce the GDPVal benchmark. The
benchmark/gdpval/RESULTS.mdships a copy-pasteable reproduction recipe — backend on port 8091, deterministic LLM seed viaREFLEXIO_LLM_SEED=0, 5 task UUIDs, both hosts (OpenSpace + Hermes),--phases p1,p2,p3. Wallclock ~3 hours. Verify the headline numbers on your own hardware before quoting them.
Related
- Nous Hermes Agent — the headline host in Reflexio’s benchmark; Hermes already does on-board memory + skill auto-generation, so Reflexio’s claim is specifically about adding an external layer on top of that
- Browserbase Autobrowse — closest architectural sibling: iterates a browser-agent strategy until convergence, then graduates the winning approach into a markdown
SKILL.md. Same “harness extracts a reusable recipe from a successful run” pattern, different domain (browser agents vs general task agents) - Agent Workflow Patterns — Anthropic’s sequential/parallel/evaluator-optimizer taxonomy; Reflexio sits outside the agent loop, so it’s orthogonal to which workflow shape the agent runs
- Claude Managed Agents — Anthropic’s hosted alternative to “long-running autonomous agent with memory.” Reflexio is the BYO-harness option that lives next to any agent (including Managed Agents) rather than replacing the runtime
- Memory & Dreaming — Self-Learning Agents — the Anthropic thesis Reflexio is a community implementation of; the architectural debate is “memory in the agent” (Hermes, Anthropic) vs “memory beside the agent” (Reflexio)
- Alex Albert — Inside How Anthropic Is Building the Next Claude — describes dreaming-as-overnight-memory-pruning; same problem space, in-model approach instead of external harness
- Skill Systems — Orchestrator + Child Pattern — community Claude-as-manager pattern; pairs naturally with Reflexio’s playbook output as the “child skill” content
- Karpathy — From Vibe Coding to Agentic Engineering — the broader framing for why agent self-improvement is a hot pattern in 2026
Open Questions
- Who is Yi Lu? ^[inferred] The GitHub identity is 5 weeks old with no metadata; the repo has no second engineer. Stealth founder? Anthropic / Nous engineer using an alt? Public information at ingest gives no answer. Worth knowing before depending on the project.
- What’s the difference between Reflexio’s playbook extractor and DSPy / GEPA? The repo mentions a “playbook-optimizer” with GEPA — exact relationship between Reflexio’s extractor pipeline and Stanford’s Genetic-Pareto algorithm for prompt optimization is unclear from the README alone. Code inspection or docs read needed.
- Does the −81% number reproduce on a fresh box? Reproduction recipe is documented; nobody outside the author has confirmed numbers publicly yet (as of 2026-05-18). Cross-host −50% / −57% is the conservative read until a second party reproduces.
- How does the system behave when the user-correction is wrong? The benchmark assumes corrections are ground truth (“never deploy production to us-east-1” example). Real users sometimes correct an agent with worse advice — does Reflexio have any defense against learning incorrect corrections, or does it just absorb them?
- Is the
expert_contentworkflow load-bearing in production deployments, or is it aspirational? The discussion section doesn’t surface any benchmark of expert-mode specifically. The most defensible transfer-learning claim sits in this mode, but the GDPVal benchmark doesn’t use it.