Quality Gates

Two gates guard quality. Planning Review before code. Implementation Review after code. Both produce scores, both must pass. Nothing ships without clearance.

The Two Gates

Gate 1 — Planning Review

Runs before any code is written. Evaluates whether your specification is ready for implementation.

The question it answers: “If I hand this plan to agents right now, will they have everything they need to implement it correctly?”

A specification can be detailed and still fail Planning Review if dependencies are circular, tickets lack acceptance criteria, or entire functional areas are missing from the decomposition.

How the Planning Loop Works

The Planning Review isn’t a one-shot pass/fail check. It’s a control loop. The AI decomposes your intent. The gate evaluates. If the score is below threshold, you get specific findings — not “try again” but “these 3 tickets have no acceptance criteria” and “your description mentions rate limiting but no ticket addresses it.” You refine. The gate evaluates again. The loop repeats until the spec is solid — or maxReviewAttempts stops it.

No code is written during this entire process. The plan is validated before a single line of implementation.

Planning control loop: Intent → AI decomposes → Planning Review evaluates 5 dimensions → FAIL with findings → Refine → Review again → PASS → Ready for implementation. No code written.

Tell your agent:

~~ Run the planning review for this specification ~~

The review evaluates five scoring dimensions:

Dimension	What it checks
Completeness	Do all tickets have steps, acceptance criteria, and descriptions?
Dependencies	Is the graph valid? No circular references? All cross-epic links resolve?
Coverage	Does the decomposition cover everything described in the specification?
Ticket Quality	Are tickets atomic and implementable? Or vague epics disguised as tickets?
Acceptance Criteria	Are criteria specific and verifiable? “Tokens signed with RS256” not “should be secure”

Each dimension scores 0-100. The overall score is the weighted average. If the score meets or exceeds readinessThreshold (default: 80, configurable in Quality Standards), the specification advances to ready.

If it fails, the findings tell you exactly what to fix. Fix it. Run the review again. This is the outer control loop — the one that ensures your plan is solid before agents start implementing.

🔬 For engineers: This is the same control loop described in Engineering Foundations, applied one level up. The reference signal is your intent. The comparator is the Planning Review. The plant is the AI decomposition. The error signal is the findings. The threshold is the acceptance zone. See Quality Standards to tune the controller.

Gate 2 — Implementation Review

Runs after all tickets are completed. Evaluates whether the delivered code matches the original specification.

The question it answers: “Did the agents actually build what the specification asked for, to the quality standard we defined?”

A specification can have all tickets marked done and still fail Implementation Review if acceptance criteria weren’t met, files weren’t delivered, or test coverage is missing.

Scoring Dimensions

Each gate evaluates multiple dimensions and produces a score per dimension. The overall score is a weighted average. The default passing threshold is 80.

Quality gate radar chart: five dimensions scored with pass/fail verdict

Planning Review Dimensions

Completeness — Does every epic have tickets? Does every ticket have a title, description, steps, and acceptance criteria? Are there empty epics or stub tickets with no actionable content?

Score 90: All tickets have steps and acceptance criteria. One ticket is missing a description but has enough context in its title and steps. Score 60: Multiple tickets are stubs — title only, no steps, no acceptance criteria. Agents would have to guess what to build.

Dependencies — Is the dependency graph valid? Are there circular references? Do all cross-epic links resolve correctly? Are there orphaned tickets that should have dependencies but don’t?

Score 95: Clean DAG, no cycles, all cross-epic references resolve. Critical path is computable. Score 50: Circular dependency detected between two tickets. Three tickets reference dependencies that don’t exist.

Coverage — Does the decomposition cover everything described in the specification? Are there goals or requirements mentioned in the description that no ticket addresses?

Score 85: All stated requirements have corresponding tickets. One edge case mentioned in the description (“account lockout after 5 failed attempts”) has no dedicated ticket but is covered as a step in another ticket. Score 55: The specification mentions “rate limiting” in the description but no epic or ticket addresses it.

Ticket Quality — Are tickets atomic and implementable? Or are they vague epics disguised as tickets? Do steps describe concrete actions or abstract goals?

Score 90: Tickets are small, focused, with concrete implementation steps like “Create JwtService class in src/auth/jwt.service.ts.” Score 45: A single ticket says “Implement the entire authentication system” with no steps. That’s an epic, not a ticket.

Acceptance Criteria — Are acceptance criteria specific and verifiable? Can an agent objectively determine whether each criterion is met? Or are they vague statements like “should work well”?

Score 88: Criteria like “Tokens are signed with RS256 algorithm” and “Invalid tokens return 401 with error message.” Score 40: Criteria like “Authentication should be secure” and “Good user experience.”

Implementation Review Dimensions

Steps Completion — Were all implementation steps in each ticket marked as done?

Acceptance Criteria — Were all acceptance criteria in each ticket satisfied?

File Delivery — Were all expected file creations, modifications, and deletions actually performed?

Git Evidence — Are commits and/or pull requests linked to tickets? (Configurable — can be required, recommended, or disabled.)

Tests — Were test results submitted for tickets that require them? Did tests pass?

💡 Each dimension is independently toggleable in your project’s Quality Standards configuration. You can disable git evidence for prototyping or require verbose test output for production specs.

How Scoring Works

Each dimension produces a score from 0 to 100. The overall gate score is a weighted average across all active dimensions. The default passing threshold is 80, but you can configure this per project.

Passed (≥ threshold): The specification advances to the next phase. Planning Review moves the spec to ready. Implementation Review moves it to reviewed.

Failed (< threshold): The gate produces a detailed feedback report. Not blind rejection — every failed dimension includes specific findings:

Which tickets are missing acceptance criteria
Which dependencies are circular or unresolved
Which files were expected but not delivered
Which acceptance criteria were not satisfied

The feedback is actionable. Fix the findings, run the review again. The spec doesn’t go back to square one — it stays in its current phase and you address the gaps.

Mock review findings report showing per-dimension scores and actionable ticket findings

ℹ️ maxReviewAttempts (default: 3) limits how many times a review can be retried before requiring manual intervention. This prevents infinite review loops on fundamentally flawed specifications.

Why Two Gates Instead of One

A single post-implementation review would catch problems — but only after agents already spent tokens implementing a bad plan. The Planning Review exists to catch structural problems before any code is written.

Two gates comparison: without Gate 1 causes expensive rework, with Gate 1 catches issues early

Think of it as: Gate 1 validates the blueprint. Gate 2 validates the building. You wouldn’t start construction on a blueprint with missing rooms.

In practice, Planning Review catches:

Tickets that are too vague for agents to implement without guessing
Missing dependencies that would cause agents to build in the wrong order
Gaps in coverage where the spec description promises something no ticket delivers
Structural issues like empty epics or duplicate tickets

Implementation Review catches:

Tickets marked done but with incomplete steps
Acceptance criteria that weren’t actually met
Files that should have been created but weren’t
Missing test evidence

Configuring the Gates

Both gates are configurable at the project level. You can adjust thresholds, enable/disable individual dimensions, and control how strictly evidence is enforced.

Quick examples:


# Raise the planning threshold for production specs
specforge configure reviewConfig.readinessThreshold 85
 
# Disable git evidence for prototyping
specforge configure implementationConfig.gates.git_evidence false
 
# Require verbose test output
specforge configure implementationConfig.testEvidence verbose

For the complete configuration reference, see Quality Standards.

Practical Guidance

Starting out? Keep defaults. Threshold 80 is balanced — strict enough to catch real problems, lenient enough to not block you on minor gaps.

Prototyping? Lower the threshold to 60-70 and disable git evidence. Speed matters more than ceremony. You can always re-run reviews later with stricter settings.

Production specs? Raise the threshold to 85-90, require git evidence, and use discretized test evidence. This is where the gates pay for themselves — one caught issue in review is worth hours of debugging in production.

Large teams? Blueprint coverage becomes critical. With multiple people contributing to a spec, the Planning Review catches inconsistencies between contributors that manual review misses.