Inside Quality Gates: How the Staff Actually Works

March 23, 2026

The introduction told you who works here. This is what they do all day.

The Problem, Precisely

Before we get into the system, it's worth being specific about the problem. Not "AI code quality is inconsistent" - everyone says that. The specific version of it we kept running into.

We build software with autonomous AI agents. An agent takes a task, implements it, commits the changes, and moves on. No human in the loop between "started" and "done." That's the whole point - it's faster, it's scalable, it compounds.

But we kept seeing the same failure pattern: the agents made the same mistakes across different projects.

Not random mistakes. The same ones. A method called on the wrong object. A missing index on a UUID column in a database migration. Security middleware skipped on a protected route. Each time: 20-30 minutes of debugging. Each time: something we'd already solved before.

The root cause is simple. AI agents have no memory between tasks. Each one starts fresh. There's no mechanism for "we fixed this before, don't do it again." There's no quality gate between implementation and completion. There's no one checking the work.

We needed institutional memory. We needed enforcement. We needed someone to check.

That's Quality Gates.

How a Task Moves Through the System

Before introducing each staff member individually, it helps to see the flow.

A task arrives. The Specialist reviews it and routes it to the right agent. The agent implements. The Teacher's lessons are already embedded in the task description - known gotchas, patterns to avoid, standards to follow. When the agent finishes, it requests a Judge review. The Judge evaluates against the original acceptance criteria and returns a verdict. If security is relevant, the Sheriff audits independently. Only after both sign off can the task be marked complete.

That last part is enforced at the system level. There is no "mark complete" without an approved review. It's not a guideline. It's a hard block.

The Secretary holds the project context across all of it - what's been built, why decisions were made, how each new task fits the whole. She's the briefing layer before every review.

That's the loop. Now let's go deeper.

The Teacher: Mistakes Made Once

This is the one that changes the compounding math.

When an agent hits a bug and fixes it, The Teacher records the learning. Structured, specific, actionable:
  • What component was affected (backend, frontend, database, testing)
  • What technology was involved (Laravel, MCP, PostgreSQL)
  • What went wrong the specific issue, not a vague description
  • How it was fixed the exact solution applied
  • How to prevent it forward-looking guidance for future tasks
  • How severe it was - low, medium, high, critical
The next time any task touches that component and technology, The Teacher injects those lessons directly into the task description before the agent ever sees it.

Create IngestController to handle POST /mobile/ingest requests.

Known Gotchas:
  • MCP Request objects use ->get(), ->string() NOT ->input()
  • Always add unique indexes on UUID columns
  • HMAC middleware must be applied to /mobile/* routes
The agent reads that before writing a single line. The mistake doesn't happen. The counter increments: times prevented.

That counter matters. It's how you know which learnings are pulling their weight. A learning that's prevented the same mistake eleven times across four projects is institutional knowledge. A learning that's never triggered is noise worth reviewing.

What this changes: The system gets smarter with every project. Not gradually, compound. The tenth project benefits from everything the first nine learned. That's not how AI-assisted development usually works. Usually every project starts from zero. In 2026, the standard solutions are still AGENT.md files, memory files, context compaction - all of them just text for the agent to read at the start of a session. Better than nothing. But they're static. They don't update when a bug gets fixed mid-project. They don't know what the third task taught the agent that the first task didn't. The Teacher isn't a file. It's a living record that grows with every mistake made and every mistake prevented.

The Judge: No Approval, No Completion

The Judge is the enforcement layer.

When an agent completes implementation, it submits for review: the git diff, the test results, the original acceptance criteria, implementation notes. The Judge, running on Claude, evaluates the full picture.

Four things get assessed:

Acceptance criteria: Does the implementation satisfy every requirement? Not most of them. Every one.

Code quality: Does it follow standards? Proper error handling? Documentation where it matters?

Technical correctness: Right patterns, right database connections, right security approach?

Completeness: All required files created? Tests comprehensive?

The verdict comes back as one of three:
  • Approve - All criteria met. Safe to build on.
  • Needs Work - Most criteria met, fixable issues identified. Detailed feedback provided.
  • Reject - Fundamental approach wrong. Start over, here's why.
Every verdict includes feedback. Not just a decision but also an explanation. That feedback is what the agent learns from on the next attempt.

The result we didn't expect: Day one of running Quality Gates on a real project, 70% of code passed Judge review. Day two? 100%. The agents were adapting to the feedback in real time. We didn't design that behavior. We just enforced standards consistently and got out of the way.

The Sheriff: Security Isn't Optional

The Sheriff runs independently of the Judge. Different concern, different authority.

Where the Judge asks "is this correct," the Sheriff asks "is this safe." Authentication present and properly configured. Input validation in place. No sensitive data exposed. No injection vulnerabilities. The Sheriff carries a rulebook - our specific security standards - and checks implementation against it before signing off.

Two separate approvals. Neither optional.

The reason they're separate matters: code can be functionally correct and security-broken simultaneously. A feature that works perfectly but skips HMAC authentication on a protected endpoint is both. The Judge might approve it. The Sheriff won't.

Separating the concerns means neither one covers for the other.

The Specialist: Right Model, Right Task

Not all tasks are the same. A complex architectural decision requires different capabilities than a routine migration. Running Claude on everything is expensive and unnecessary. Running a cheaper model on something that requires deep reasoning is a different kind of waste.

The Specialist reviews incoming tasks and routes them. The criteria: task complexity, required capabilities, agent availability, cost. Simple, well-defined tasks go to capable but cost-efficient models. Tasks requiring architectural judgment or complex logic go to the heavy hitters.

This isn't just cost optimization, it's quality optimization. The right model for the task means better output, not just cheaper output. Both at once.

The Secretary: The Piece Still Being Built

The Secretary is the connective tissue we don't fully have yet.

Every other staff member operates at the task level - review this implementation, record this learning, route this assignment. The Secretary operates at the project level. She holds what's been built, why decisions were made, how the current task fits into the whole scope.

Before the Judge reviews a task, the Secretary briefs her: here's what this is for, here's how it connects to the three tasks before it, here's the decision that led to this implementation approach. That context changes what "correct" means.

Before the Teacher records a learning, the Secretary provides project scope: is this a global lesson or a project-specific edge case? The distinction matters.

The Secretary is being built now. She's the last piece before Quality Gates is fully closed-loop.

The Warden: This Is Still On You

The Warden dashboard shows everything in real time. Tasks in progress. Review verdicts. Security audit results. Teacher learnings accumulated. Agent assignments. The full picture of what's happening inside Quality Gates.

The Warden is us. It's always us.

This matters enough to say directly: Quality Gates enforces standards. It catches mistakes. It makes autonomous development more reliable. It does not remove human accountability from the equation.

The human who sets the standards is responsible for whether those standards are right. The human who reviews the Warden dashboard is responsible for what they let pass. The human who deploys the output is responsible for what ships.

When the staff disagrees with an agent repeatedly - a task blocked three times because a required package was never installed, or a third-party service is down - the Warden can step in. Review what the Judge said. Resubmit with new information, or mark it complete with a note explaining why review was bypassed. The override exists because the system is designed to serve the project, not to protect its own authority.

The staff does their jobs. The Warden owns the outcome.

What the Data Actually Looks Like

This is not theoretical. Quality Gates has run on real projects. Here's what we observed.

The Teacher currently holds learnings across multiple components and technologies - each one captured from a real mistake, each one with a prevention counter showing how many times it's been applied since. The compound effect is already visible: projects that run later in the sequence encounter fewer raw bugs than projects that ran earlier, because the curriculum has grown.

The Judge processed roughly 50 task reviews in a single day on one project. 70% approved on first submission. 20% needed revisions. 10% were rejected outright. The next day: 100% approval. One day of consistent feedback changed agent behavior measurably.

The Sheriff catches less than the Judge. That's either because agents are reasonably good at security patterns, or because our security standards need to get more aggressive. Probably both. We're making the Sheriff's rulebook harder.

The Specialist has reduced model costs on projects where it's fully active. The exact reduction depends on task mix - projects heavy in routine implementation tasks see the biggest savings.

What We'd Do Differently

We built Teacher and Judge first, then Sheriff, then Specialist. Sheriff should have been planned from day one... security as an afterthought, even a well-executed afterthought, is a habit worth breaking.

We'd also add agent performance tracking to the Specialist earlier. Right now it assigns well. It doesn't yet measure whether its assignments are producing quality results over time. That feedback loop would make it smarter.

The Teacher's learnings are currently global - a lesson learned on one project applies to all future projects, because most gotchas are technology-specific rather than project-specific. That was the right call. But as the knowledge base grows, we'll need confidence scoring: not all learnings are equally reliable, and the ones that fire most often should carry more weight than ones captured once.

If This Sounds Like Something You Need

If you're doing AI-assisted development and running into inconsistent quality, repeated mistakes, or no standards layer between implementation and deployment - this is built for that problem.

The next article introduces The Secretary properly. Her role is the most technically interesting piece of the whole system, and she changes what's possible with the rest of the staff.

If you have questions on if AI is a good fit for your specific problem, reach out. We'll be happy to have a conversation with you and help you determine if AI can be beneficial for you.