Source: From one person to 80 — Scaling a hypergrowth engineering org with Claude Code (Code with Claude 2026, May 7 2026, customer talk — Yav (Product) + Gabriel (AI Lead) from Base44)

Yav and Gabriel from Base44 — the vibe-coding platform Wix acquired in 2025 — walk through how the engineering org scaled from solo founder (Maor) to 80 engineers in three phases, with Claude Code as the lever that kept onboarding, code review, eval, and QA from collapsing under headcount growth. The thesis throughout: keep processes radically simple, use past actions to encode taste, dogfood your own product, and let the bottleneck move forward rather than building heavyweight machinery up front. The talk reads as a counterweight to the “build an eval framework from day one” reflex — Base44 deliberately deferred evals until the headcount and traffic justified them, then went all-in with a user-simulator suite built on Stagehand.

Key Takeaways

  • Three-phase scaling timeline. Phase 1: solo founder Maor builds Base44 end of 2024, profitable by April 2025, building-in-public on LinkedIn/Twitter, two-person team. Phase 2 (post-Wix acquisition): 2 → 15 engineers, four challenges (onboarding doesn’t scale, code review doesn’t scale, can’t sit with each beta tester, broad product surface to cover). Phase 3 (recent merge of another vibe-coding product): 40 → 80 engineers overnight, new challenges around experimentation at scale, evals, and QA.
  • Onboarding via two prompts, not docs. Instead of building maintained onboarding docs, every new engineer runs two prompts in Claude Code: (1) “Go over all the commits and tell me what everyone cares about” — produces a live org-map by area; (2) “Give me a mermaid chart of how this component works” — produces real-time architecture diagrams. The key insight: docs go stale, prompts don’t. The map regenerates against current state every time.
  • WhatsApp integration anecdote. A new engineer onboarded Thursday using just those two prompts. By Sunday morning, the WhatsApp/Meta API integration (assumed 1-2 week effort touching agentic flow + new API + integration) was complete, PR’d, two-three small Claude-PR-reviewer comments, ready for production.
  • PR review via distilled comment history. Maor was very cautious about backend/agent code. Couldn’t scale his manual review across 15 engineers. Solution: after 1-2 weeks of accumulated PR comments from Maor, ran Claude over the comment pool to extract the most important review rules → put those rules in a Claude-driven PR-review instruction → ran it every couple days. Past actions become the taste-encoding mechanism — not a guideline committee.
  • The “frustration metric” replaced the eval suite for the first 15-engineer era. Naive instinct = build an eval suite. Real solution = look at production conversation traffic: when things work, users go quiet (move to next feature); when things break, users get loud (“Why is this broken? I can’t believe it’s not working.”). Used a simple LLM-as-judge IQ-model classifier to score message frustration level. Every new agent version gets rolled to a small % of users; frustration-level delta is the leading indicator across prompt changes, model changes, and infrastructure changes. Skipped eval-suite buildout entirely until phase 3 justified it.
  • AB-test guidelines distilled from PostHog history. Phase 3 needed PR-time experimentation verdicts (just ship / gradual rollout / AB test / how long / which KPIs). Same past-actions-encode-taste pattern: Claude Code hooked to PostHog MCP, ingested the last 100 experiments + matching PRs, produced the first iteration of guidelines. Hours later: per-PR Claude verdict on whether to AB test, with KPI list and duration. Combined with Base44-built dashboard (dogfooded) connecting BigQuery → PostHog → GitHub → AI cost telemetry.
  • Eval suite as user-simulator on Stagehand. When evals finally became the right ROI in phase 3, the team built a CI/CD pipeline where every AI code change spins up a real Base44 app instance + uses Stagehand to simulate real user actions. The key epiphany: rejection is not eval failure — when a user’s first prompt produces a partially-broken app, the eval should pipe the rejection back to the agent and assert it can fix it. Metrics tracked: latency, turn count, cost, credit burn. Canonical eval = “build me a hello world app” (smoke test for “did we break anything”); scales up to multi-change scenarios + compaction-mechanism tests.
  • QA via Claude Code + Playwright MCP + abstraction skills. Phase 3 QA gap: testing deep edge cases (sub-tier + specific credit limit + feature permutation) is tedious for humans, expensive to wait for QA engineer. Wrapped common flows in skills so Claude doesn’t re-learn the platform every time. CLI tools that abstract APIs + database specifically for the test-setup case (override subscription tier directly in DB to test the case quickly — what a good QA engineer would do manually). Meta-skill ties them all together. On PR open: agent triggers, creates a test plan in a Base44 app, runs it, reports back with screenshots + capability gaps. 80% coverage with the skill abstractions, the other 20% surfaced as “I couldn’t test that” — explicit boundary instead of false-positive pass.
  • Four cross-cutting principles named at the close. (1) Bold and simple — work very hard not to build complex things when it’s not the right time; defer eval suite buildout until justified. (2) Encode taste from past actions — guidelines committee replaced by Claude reading the last week of decisions. (3) Dogfood your product — Base44 built the experimentation dashboard and QA workspace IN Base44; pairs with Anthropic’s stated practice of building Claude Code in Claude Code. (4) The bottleneck keeps moving — current frontier: post-validation (did the bug fix actually reduce support tickets? did the feature actually get used?); automate post-merge business-metric verification too.

Implementation

Tool/Service: Claude Code in a 15→80-engineer SaaS engineering org.

Setup (the four playbooks):

  1. Onboarding prompts (run in Claude Code from the repo root):

    • “Go over all the commits in the last N months and tell me what everyone in this org cares about by area.”
    • “Give me a mermaid chart of how <component> works.”
  2. Distill PR-review rules from comment history:

    • After 1-2 weeks of human comments accumulating, ask Claude: “Read every PR comment in this repo. Distill the most important review rules into a prompt I can run as an automated reviewer on new PRs.”
    • Wire the resulting prompt into a CI hook or /ultrareview-style scheduled check.
  3. Frustration-metric LLM-as-judge (replaces an eval suite at small scale):

    • Pull production conversation logs into a daily job.
    • Classify each user message: “high frustration | low frustration” via an IQ model (Haiku-class).
    • Roll new agent versions to N% of users → diff per-version frustration rate → ship/rollback.
  4. PostHog-MCP-distilled AB-test guidelines:

    • Connect Claude Code to PostHog MCP.
    • Prompt: “Read the last 100 experiments and matching PRs. Tell me when we just ship vs gradual-rollout vs AB test, how long we run, which KPIs we track.”
    • Wire output into a GitHub PR-comment bot that gives per-PR verdicts.
  5. Eval-suite-as-user-simulator (phase 3 scale only):

    • Real Base44 app spun up per CI run.
    • Stagehand drives real user actions through the actual UI.
    • Metric set: latency, turns, cost, credits. Reject ≠ fail — assert the agent recovers.
    • Canonical “hello world app” eval as the smoke test; scenarios for compaction mechanism + multi-edit flows on top.
  6. QA-on-PR via skills + CLI test-setup tools:

    • Wrap common Base44 user flows in Claude skills.
    • Build CLI tools that abstract API/DB writes specifically for test-environment setup (subscription tier overrides, credit-limit overrides, etc.).
    • Meta-skill that orchestrates: PR opens → agent generates test plan → spins up Base44 app → runs Playwright/browser tasks → reports back with screenshots + explicit “I couldn’t test X” gaps.

Cost: Not stated explicitly. Frustration metric uses a small IQ model (Haiku-class). Eval suite spins real app instances per change — cost-throttled by --max or sampling.

Integration notes:

  • Stagehand (the user-action simulator) is the load-bearing dependency for the eval suite; it lets Claude drive a real Base44 app like a real user.
  • PostHog MCP is the experimentation-platform connector; same pattern would work for any AB-testing tool that exposes an MCP (Statsig, LaunchDarkly, etc.).
  • The dashboard for centralizing experimentation + eval + cost telemetry is itself built in Base44 (dogfooding).
  • The “past actions encode taste” pattern is the cross-cutting move — applies equally to onboarding prompts, PR review rules, AB-test guidelines. Don’t sit and articulate guidelines; let Claude read what you’ve already decided.

Try It

  1. Replace your onboarding doc with two prompts. Pick your messiest area of the codebase. Run Yav’s two prompts (org-by-area from commits, mermaid for the component) and compare what comes out vs your current README. If the Claude-generated version is fresher, retire the doc and put the prompts in the README.
  2. Distill review rules from your last 100 PR comments. Take any senior engineer’s PR comment history, run it through Claude Code with “What are the most important review rules to encode as a Claude-driven PR reviewer?” — wire the output into a /ultrareview or GitHub-app trigger. Measure: comments on new PRs that match the senior engineer’s pattern.
  3. Frustration-metric your AI feature. If you ship any conversational AI surface, classify the last week of production conversations into high/low frustration. Track the rate. Run an experiment that changes one variable (prompt, model, retrieval) and measure the delta. This is meaningfully cheaper than a held-out eval set for any team under ~15 engineers.
  4. Steal the PostHog-MCP-for-AB-test-guidelines pattern even if you don’t run experiments yet. The general move — “Claude, read our last N decisions about $TOPIC and tell me what our guidelines are” — works for code review, AB test design, release-cut criteria, and on-call escalation policy.
  5. For WEO Marketly’s GSC SEO engine or Blog-Agent-Worker, apply the eval-suite-as-user-simulator pattern. Spin up a real Blog-Agent-Worker run per CI change, use Playwright/Stagehand to drive the surface as a real client would (search a query, click a result, screenshot, score the response), and treat per-run cost + latency + recovery-from-rejection as the metric set. Don’t build a synthetic-eval rig — drive the real product.

Open Questions

  • Cost of the eval-suite-as-user-simulator at scale. No numbers stated. Per-PR real-app instance + Stagehand session has real $$ implications; Base44 didn’t share their cap or sampling policy.
  • Frustration-metric model choice. Yav says “an IQ model” but doesn’t name it; Haiku-class inferred from “simple classifier” framing. ^[inferred]
  • PostHog MCP setup specifics. Not detailed in the talk; the wider claim (“hooked up Claude Code to PostHog MCP”) leaves the auth/scoping pattern open. Worth tracking PostHog’s MCP doc separately.
  • The Wix-side integration costs. Base44 was acquired by Wix; the talk doesn’t address how engineering processes integrate with Wix’s existing org, code review, or release gates. The “we doubled overnight from 40 → 80” framing suggests a parallel-track operating model rather than absorption.
  • Stagehand maturity. Stagehand is named as the user-simulator runtime but not benchmarked vs Playwright MCP / Browser Use / Computer Use. Worth a follow-up comparing the four browser-control runtimes for AI-driven QA specifically.