Skip to Content
ConceptsQuality Gates

Quality Gates

Two gates guard quality. Planning Review before code. Implementation Review after code. Both produce scores, both must pass. Nothing ships without clearance.

The Two Gates

Gate 1 — Planning Review

Runs before any code is written. Evaluates whether your specification is ready for implementation.

The question it answers: “If I hand this plan to agents right now, will they have everything they need to implement it correctly?”

A specification can be detailed and still fail Planning Review if dependencies are circular, tickets lack acceptance criteria, or entire functional areas are missing from the decomposition.

Gate 2 — Implementation Review

Runs after all tickets are completed. Evaluates whether the delivered code matches the original specification.

The question it answers: “Did the agents actually build what the specification asked for, to the quality standard we defined?”

A specification can have all tickets marked done and still fail Implementation Review if acceptance criteria weren’t met, files weren’t delivered, or test coverage is missing.

Scoring Dimensions

Each gate evaluates multiple dimensions and produces a score per dimension. The overall score is a weighted average. The default passing threshold is 80.

Planning Review Dimensions

Completeness — Does every epic have tickets? Does every ticket have a title, description, steps, and acceptance criteria? Are there empty epics or stub tickets with no actionable content?

Score 90: All tickets have steps and acceptance criteria. One ticket is missing a description but has enough context in its title and steps. Score 60: Multiple tickets are stubs — title only, no steps, no acceptance criteria. Agents would have to guess what to build.

Dependencies — Is the dependency graph valid? Are there circular references? Do all cross-epic links resolve correctly? Are there orphaned tickets that should have dependencies but don’t?

Score 95: Clean DAG, no cycles, all cross-epic references resolve. Critical path is computable. Score 50: Circular dependency detected between two tickets. Three tickets reference dependencies that don’t exist.

Coverage — Does the decomposition cover everything described in the specification? Are there goals or requirements mentioned in the description that no ticket addresses?

Score 85: All stated requirements have corresponding tickets. One edge case mentioned in the description (“account lockout after 5 failed attempts”) has no dedicated ticket but is covered as a step in another ticket. Score 55: The specification mentions “rate limiting” in the description but no epic or ticket addresses it.

Ticket Quality — Are tickets atomic and implementable? Or are they vague epics disguised as tickets? Do steps describe concrete actions or abstract goals?

Score 90: Tickets are small, focused, with concrete implementation steps like “Create JwtService class in src/auth/jwt.service.ts.” Score 45: A single ticket says “Implement the entire authentication system” with no steps. That’s an epic, not a ticket.

Acceptance Criteria — Are acceptance criteria specific and verifiable? Can an agent objectively determine whether each criterion is met? Or are they vague statements like “should work well”?

Score 88: Criteria like “Tokens are signed with RS256 algorithm” and “Invalid tokens return 401 with error message.” Score 40: Criteria like “Authentication should be secure” and “Good user experience.”

Implementation Review Dimensions

Steps Completion — Were all implementation steps in each ticket marked as done?

Acceptance Criteria — Were all acceptance criteria in each ticket satisfied?

File Delivery — Were all expected file creations, modifications, and deletions actually performed?

Git Evidence — Are commits and/or pull requests linked to tickets? (Configurable — can be required, recommended, or disabled.)

Tests — Were test results submitted for tickets that require them? Did tests pass?

💡 Each dimension is independently toggleable in your project’s Quality Standards configuration. You can disable git evidence for prototyping or require verbose test output for production specs.

How Scoring Works

Each dimension produces a score from 0 to 100. The overall gate score is a weighted average across all active dimensions. The default passing threshold is 80, but you can configure this per project.

Passed (≥ threshold): The specification advances to the next phase. Planning Review moves the spec to ready. Implementation Review moves it to reviewed.

Failed (< threshold): The gate produces a detailed feedback report. Not blind rejection — every failed dimension includes specific findings:

  • Which tickets are missing acceptance criteria
  • Which dependencies are circular or unresolved
  • Which files were expected but not delivered
  • Which acceptance criteria were not satisfied

The feedback is actionable. Fix the findings, run the review again. The spec doesn’t go back to square one — it stays in its current phase and you address the gaps.

ℹ️ maxReviewAttempts (default: 3) limits how many times a review can be retried before requiring manual intervention. This prevents infinite review loops on fundamentally flawed specifications.

Why Two Gates Instead of One

A single post-implementation review would catch problems — but only after agents already spent tokens implementing a bad plan. The Planning Review exists to catch structural problems before any code is written.

Think of it as: Gate 1 validates the blueprint. Gate 2 validates the building. You wouldn’t start construction on a blueprint with missing rooms.

In practice, Planning Review catches:

  • Tickets that are too vague for agents to implement without guessing
  • Missing dependencies that would cause agents to build in the wrong order
  • Gaps in coverage where the spec description promises something no ticket delivers
  • Structural issues like empty epics or duplicate tickets

Implementation Review catches:

  • Tickets marked done but with incomplete steps
  • Acceptance criteria that weren’t actually met
  • Files that should have been created but weren’t
  • Missing test evidence

Configuring the Gates

Both gates are configurable at the project level. You can adjust thresholds, enable/disable individual dimensions, and control how strictly evidence is enforced.

Quick examples:

# Raise the planning threshold for production specs specforge configure planningConfig.readinessThreshold 85 # Disable git evidence for prototyping specforge configure implementationConfig.gates.git_evidence false # Require verbose test output specforge configure implementationConfig.testEvidence verbose

For the complete configuration reference, see Quality Standards.

Practical Guidance

Starting out? Keep defaults. Threshold 80 is balanced — strict enough to catch real problems, lenient enough to not block you on minor gaps.

Prototyping? Lower the threshold to 60-70 and disable git evidence. Speed matters more than ceremony. You can always re-run reviews later with stricter settings.

Production specs? Raise the threshold to 85-90, require git evidence, and use discretized test evidence. This is where the gates pay for themselves — one caught issue in review is worth hours of debugging in production.

Large teams? Blueprint coverage becomes critical. With multiple people contributing to a spec, the Planning Review catches inconsistencies between contributors that manual review misses.

See Also