Don't sacrifice your pilot customers in the age of AI

Building a SaaS (Software as a Service) with multiple services from scratch is a lot of blood, sweat, and tears. Even in the age where LLMs (Large Language Models; the technology behind Claude, ChatGPT, and DeepSeek) write most of the code, you’re still in charge of the architecture, the reviews, making sure the test suites exist, and obviously the integration between all of it.

Then the usual thing happens. You build the app, test a few paths, your product owner takes a look, everyone’s happy. You deploy. The first pilot customer tries it at 9pm and sends a message that nothing works. You check Grafana (monitoring dashboard; the flight recorder for software) and see errors in places you thought were solid. “But we tested it…” Yes. But they caused a scenario you never imagined.

As a software engineer you assume a perfect world. You test the happy path, maybe a few edge cases, call it a day. But a stressed bar manager closing at 2am doesn’t follow your happy path. They do things that make no sense to you but make perfect sense at 2am when three people called in sick and the dishwasher broke.

Most companies treat their first customers as unpaid testers. The pitch is “we’d love your feedback” but what that really means is “we’d love you to find our bugs for free.” In gastronomy this is worse than in most industries because every problem happens during service. A venue owner can’t pause dinner to file a bug report. They just think the software is half-baked, tell three other owners, and you’ve lost four customers before your second release.

We decided not to do that.

LLMs love to roleplay

LLMs are good at pretending to be someone. That’s the observation. The question was how to get one to actually use the product, not just talk about using it.

Teaching an LLM to navigate a frontend (the visual interface a person clicks through) seemed like overkill. But the backend (the server-side logic that processes requests) communicates through an API (a standardized way for software to talk to other software). Specifically, Gastova uses gRPC (protocol for data exchange) with ConnectRPC (allows web browsers to talk to gRPC) as a web gateway. Text in, text out. The API is already in a format an LLM can work with natively.

The missing piece was a bridge between the LLM and the services. MCP (Model Context Protocol) is the open standard for connecting LLMs to external tools: databases, APIs, file systems, anything. The MCP server exposes the real running system. Not a mock (imitates a service). Not a staging environment with fake data. The actual services, running on a Kubernetes cluster (technology to manage cloud servers), with real databases and real authorization.

Because the MCP server returns a trace ID (a unique identifier that follows a request across every service it touches) with every response, every bug comes with full observability (the ability to see exactly what happened inside the system) in Grafana. No “can you reproduce that?” back and forth. Just a trace ID, open the dashboard, see exactly what happened across every service involved.

The personas

An LLM roleplaying as “a generic person” is boring. It tests the obvious paths because that’s what a generic person does. What was needed is chaos. Real gastronomy chaos.

A persona generator solves this. It takes a seed number and a venue size and deterministically creates an entire venue with staff. Not random test data; staff members with backgrounds, employment types, competencies, and most critically, personalities that drive how they interact with the system.

25 archetypes cover the spectrum of who actually works in gastronomy: the solo owner-operator running a 30-seat tavern, the apprentice in their first year, the seasonal worker who’s only here May through September, the multi-venue employee juggling two jobs, the external bookkeeper who only needs financial access. Each archetype carries realistic employment terms, availability patterns, and access requirements.

Then 20 personalities determine the chaos:

Personality	What they do
The Flake	Calls in sick 20 minutes before a shift. No-shows. Submits the same form twice.
The Chaos Agent	Double-submits approvals. Approves the wrong things. Accidentally generates edge cases that no test plan would cover.
The Ghost	Minimal presence. Missed clock-outs that pile up until the manager has to fix them Monday morning.
The Hustler	Picks up every open shift. Risks max-hours violations because they need the money.
The Old-School	Resists the system entirely. Does everything through the manager instead.
The Stressed	Approves things at 3 AM. Overrides permissions to fix immediate problems because the alternative is a broken Friday night.
The Rookie	Learning the system. Incomplete profiles. Submits things to the wrong place.

These aren’t hypothetical. After years of talking to bar owners, the pattern is clear. Every venue has a Flake. Every venue has someone who picks up too many shifts because they need the money. Every venue has the person who just doesn’t clock out. The personas are real people compressed into behavior patterns.

The generator is seeded and deterministic. Seed 42 on “medium” always produces the same 30-person venue with the same personalities in the same roles. Same seed, same venue, same bugs. When the agent finds a bug at 2am in a simulated Friday service, the exact same scenario can be rerun the next morning to verify the fix.

The tools exist. The personas exist. But the agent still needs to know what to do with both. A Skill (a prompt loaded before the session starts) connects them: generate a venue with the persona generator, log in as a staff member, and interact with the system in character. The Skill handles the setup; the personality drives what happens.

Roleplay in action

Last minute sick leave for a Flake

The first session worth showing is an agent (LLM with a specific task) roleplaying as a line cook with the Flake personality at Zur Goldenen Gans. It’s a Thursday. The Flake wants the weekend off. Last minute, as usual.

The agent activates the venue, logs in as the line cook, and lists the leave types:

→ use_venue(venue_id="7a7be0a2", user_id="6bea31bc", roles="member")
← Active venue: 7a7be0a2, user: 6bea31bc, roles: member

→ ListLeaveTypes(venue_id="7a7be0a2")
← Annual Leave      requires_approval: true   track_against_allotment: true
   Compassionate     requires_approval: true   track_against_allotment: false
   Sick Leave        requires_approval: false   documentation_required: true
   Unpaid Leave      requires_approval: false

The Flake requests annual leave for Saturday and Sunday. Gets it. Immediately cancels it. Then re-requests it because he changed his mind. Classic Flake behavior; the agent is staying in character.

Then the Flake “suddenly feels ill.” He requests sick leave for today:

→ RequestLeave(leave_type_id="dda43274", start_date=2026-04-10, notes="Feeling really sick, can't come in today")
← state: LEAVE_REQUEST_STATE_PENDING
   trace_id: 50affdd88d65e532dae196e7957e9777

The agent roleplaying notices a bug right away:

POTENTIAL BUG FOUND: Sick leave has requires_approval: false in the leave type config,
yet the request was created in LEAVE_REQUEST_STATE_PENDING state instead of being
auto-approved. Sick leave is "a fact, not a request" per spec. It should not require
manager approval.

This is real. The leave type configuration says no approval required. But the code path creates the request in PENDING state regardless. In real life, a line cook who just texted “I’m sick, not coming in” isn’t going to wait for their manager to approve that fact in an app.

Manager deals with leave requests

Now the agent switches to the venue manager. Hustler personality. She sees the pending requests and tries to approve the annual leave:

→ use_venue(user_id="58f60499", roles="admin,manager,owner,member")
→ ApproveLeave(request_id="ba973cab", expected_version=1)
← ERROR code: INTERNAL
   message: "failed to write leave claims: rpc error: code = Internal
     desc = create claim: Error 1054 (42S22):
     Unknown column 'source_sick_notes' in 'INSERT INTO'"
   trace_id: 3d61c031f72dd4d6bdafef306238d757

Database schema mismatch. The code references a source_sick_notes column that doesn’t exist in the table. No leave request in this entire venue can be approved. The agent labels it correctly:

BUG FOUND, CRITICAL: This means no leave request can ever be approved
in the current deployment.

Then the manager tries to deny the Flake’s sick leave. Sick leave. The thing that “requires_approval: false.” The thing the spec says is a fact, not a request:

→ DenyLeave(request_id="1910a542", reason="Suspicious timing.")
← state: LEAVE_REQUEST_STATE_DENIED
   trace_id: d0bcaf90628a6222aa012c0fc4bb0798

It works. A manager denied someone’s sick day. The agent’s summary:

BUG FOUND, DESIGN VIOLATION: The DenyLeave endpoint does not check the leave type's
requires_approval flag. A manager should not be able to deny sick leave.

Three bugs in two minutes. One session. One flaky line cook and one stressed manager.

End of contract for a seasonal worker

The second session is different. Not a short burst of targeted testing but the tail end of a 12-week simulation. The agent has been running autonomously, generating 84 daily schedules across bar, kitchen, and service stations. Processing leave requests. Verifying that the scheduling solver correctly excludes staff who are on leave. Week after week, for 504 shifts.

By week 12 the simulation has a planned event: the seasonal worker’s contract ends. He’s been working the summer rush and August 4th is his last day. A real venue needs to record this in the system so the scheduler stops assigning him shifts, his profile reflects the termination, and payroll knows when to stop.

The agent, running as the venue manager, tries to process the termination:

→ staff_profile(action="change_employment_state",
    staff_profile_id="76437e97", employment_state="TERMINATED",
    user_id="bfb60c1f")
← ERROR invalid_argument: termination_date is required
     when transitioning to TERMINATED
   trace_id: 6b8b33ac890fff5611fb4f1ba5a7921c

The server is right. You can’t terminate someone without recording when the termination takes effect. But the agent can’t comply because the MCP tool schema for change_employment_state doesn’t have a termination_date field. It has hire_year, hire_month, hire_day for onboarding. Nothing for the other direction. The agent inspects the tool schema, confirms the field genuinely doesn’t exist, and identifies the root cause:

The schema doesn't expose a termination_date parameter — but the server
requires it. This is a bug: the MCP tool schema for change_employment_state
is missing the termination_date field that the server validates.

The trace 6b8b33ac890fff5611fb4f1ba5a7921c tells the full story. The request enters the MCP server, gets translated into a gRPC call to staff-service, hits the validation check, and returns invalid_argument. Every hop is visible.

Grafana renders the trace as a waterfall; each row is one service the request passed through on its way to the rejection:

Grafana trace showing MCP → staff-service → validation failure

Then the agent files the bug itself. Not a vague “something broke.” A structured GitHub issue (a ticket in the project’s bug tracker) with the reproduction steps, both trace IDs, the staff profile ID, the venue ID, the timestamps, the root cause analysis, and a pointer to the tool schema definition that’s missing the field:

→ gh-gastova issue create Gastova/staff-service
    --title "change_employment_state: MCP tool schema missing termination_date field"
    --label bug --label staff
← Created: Gastova/staff-service#290

The issue body includes the exact tool call, the exact error, the exact trace IDs, and the fix: add termination_year / termination_month / termination_day to the MCP tool schema, analogous to the hire date fields that already exist. The agent figured out the pattern from the existing schema and proposed the right shape for the missing fields. The seasonal worker stays employed in the system because the interface literally can’t express the concept of firing him.

The gap isn’t in any single service; it’s in the interface between them. Nobody tested the full path from tool call to database write because that path only gets exercised when someone actually tries to fire someone. The roleplay doesn’t just find product bugs. It catches gaps in the entire integration chain.

What these sessions prove

The leave session found three bugs in a two-minute interaction. The 12-week simulation found one more. Four bugs total that unit tests (tests that check one piece of code in isolation) can’t catch because they only appear when multiple services interact under realistic conditions.

A unit test for the leave approval logic passes because the logic is correct. The bug is a missing database column on the claims table; a deployment problem, not a logic problem. You don’t find that by running go test. You find it when an agent pretending to be a stressed manager approves a leave request and the system crashes trying to write the claim.

A unit test for sick leave state transitions passes because the state machine works. The bug is that the RequestLeave endpoint doesn’t check the leave type’s requires_approval flag when setting the initial state. The state machine is fine. The input to the state machine is wrong. You don’t find that in isolation. You find it when an agent playing a flaky line cook lies about being ill and the system puts the request in PENDING instead of auto-approving it.

The termination bug is different from both. The server-side code is correct. The MCP server is correct. Both pass their own unit tests. The gap is in the interface between them; no unit test covers that path because no unit test exercises the full chain from tool call to database write.

From roleplay to permanent tests

The findings don’t stay in an issue tracker. Every bug gets a regression test (a test written specifically to catch that exact bug if it ever comes back) and every regression test gets mutation-tested.

Mutation testing takes your codebase and systematically changes it: deleting lines, flipping conditions, swapping operators. For each change, it runs the test suite. If a test catches the change, good. If all tests still pass after a line was deleted or a condition was inverted, that line either doesn’t matter or your tests don’t cover it.

Staff-service: 4,869 mutations tested. Zero survivors. Task-service: 1,206 mutations tested. Zero survivors.

This is a higher bar than code coverage (a metric that measures how many lines were executed during testing). 100% code coverage means every line ran during a test. It says nothing about whether the test actually checked what that line did. Mutation testing closes that gap. If you delete a line and no test notices, your coverage number was lying to you.

The marketing page says “comprehensive leave management.” That claim is now backed by a regression test that fails if anyone breaks the manager-side approval path or the sick leave auto-approve flow. If a future change reintroduces any of those bugs, CI (continuous integration; the automated build and test system) rejects it before it reaches any branch.

The cost of not doing this

The token cost of 13 roleplay sessions is a rounding error compared to what one churned pilot customer costs in credibility. Gastronomy runs on word of mouth. A venue owner who tells three other venue owners “the software didn’t work on our first Friday” is damage you can’t undo with a hotfix. You can’t undo it with a follow-up email either.

The transparency argument matters just as much. Yes, an LLM built a lot of this product. But the same technology is also the harshest QA process the product has ever seen. The flaky line cook lying about being sick. The manager denying someone’s sick day because the timing looks suspicious. The 12-week simulation that tries to offboard a seasonal worker and discovers the interface literally can’t express the concept of a termination date. Those are the bugs that would have been a real venue’s Tuesday morning headache. The roleplay caught them before any customer had to.

The roleplaying agent doesn’t need weekends, doesn’t get tired of testing, and it genuinely enjoys pretending to be a chaos-agent bartender who double-submits everything at 11pm on a Friday.