Engineering Foundations

Deterministic control over probabilistic models.

This section explains why SpecForge works, not just how to use it. If you just want to ship, the Quickstart is that way. If you want to understand the engineering principles that guarantee SpecForge’s behavior under adversarial conditions — stay.

The Problem Statement

Large Language Models are probabilistic. Given the same input twice, they may produce different outputs. Given a complex task, they may hallucinate constraints that don’t exist, skip requirements that do, or make architectural decisions that contradict previous ones.

This is not a bug. It’s the fundamental nature of the system. Probabilities, not determinism.

Most AI coding tools accept this and compensate with human review. The human reads the output, catches the errors, corrects, re-runs. This works for one agent on one task. It does not work for five agents on fifty tasks. The human becomes the bottleneck — the very bottleneck the AI was supposed to eliminate.

The question SpecForge answers: how do you impose deterministic behavior on a probabilistic system without eliminating the probabilistic system’s strengths?

The answer comes from a discipline that has solved this exact class of problem for over 80 years.

Control Theory Applied to AI Agents

Control theory is the engineering discipline of making systems behave predictably despite disturbances. A thermostat is a controller. An autopilot is a controller. A PID loop in an industrial plant is a controller. In every case, the pattern is the same: a reference signal defines the target, a comparator measures the error, and a feedback loop drives the error toward zero.

The comparator receives two inputs: what should happen (reference) and what is actually happening (feedback). It computes the error. The plant receives the error and corrects. The loop repeats until the error is within tolerance.

The SpecForge Control Loop

SpecForge maps directly to this architecture:

Control Theory	SpecForge
r (Reference)	Specification — acceptance criteria, guardrails, expected files, implementation steps. Fixed.
Σ (Comparator)	MCP server — the logic inside `complete_work_session`. Receives the spec (r) and the validation results, computes what’s missing.
e (Error signal)	Prompt injection — “you’re not done. These acceptance criteria are unmet. These files are missing.” Specific, actionable.
Plant	AI Agent — the probabilistic element. Receives corrections, produces code. The only non-deterministic component.
D (Disturbance)	Hallucination — noise in the agent. It receives correct instructions but may produce incorrect output.
Output	Implemented code — the files, tests, commits produced by the agent.
H(s) (Feedback)	Validation checks — AC completion status, file existence checks, guardrail compliance, git diff.

SpecForge control loop: Spec (r) → MCP Comparator (Σ) → AI Agent → Code → Validation H(s) → back to MCP. Hallucination enters at the Agent.

Here’s how the loop works in practice:

The agent receives a ticket with its specification: acceptance criteria, implementation steps, expected files, guardrails.
The agent implements. Hallucination may introduce errors — wrong file structure, missing acceptance criteria, guardrail violations.
The agent calls complete_work_session to finalize.
The MCP server (comparator) checks: Are all acceptance criteria satisfied? Do all expected files exist? Are guardrails respected?
If not: the MCP server rejects the completion and injects a prompt back to the agent with exactly what’s missing. This is the error signal.
The agent corrects based on the specific feedback.
The agent tries to complete again. The MCP server checks again.
Loop repeats until all checks pass — or maxReviewAttempts triggers a safety shutoff.

The agent doesn’t know it’s in a control loop. It thinks it’s trying to complete a task. The MCP server is the invisible controller — measuring output against reference and injecting corrections when the error is non-zero.

Everything in this loop is deterministic except the agent. The spec is fixed. The comparator is algorithmic (boolean checks). The validation is structural (file existence, AC status). The code is code. The entire system exists to tame the one probabilistic element.

Open-Loop vs Closed-Loop Agent Systems

The Thermometer (open-loop)

Most AI coding tools are open-loop. They show you the output and let you decide what’s wrong. The agent has no mechanism to self-correct. The human is the feedback loop. This is a thermometer — it tells you the temperature but doesn’t control it. You read the number and adjust the dial yourself.

The Thermostat (closed-loop)

SpecForge is closed-loop. The MCP server measures the output against the specification automatically. The human defines the temperature (specification). The system maintains it. The human sets 22°C and walks away.

Open-loop (thermometer): human corrects manually, never converges. Closed-loop (thermostat): system self-corrects, converges in 3 iterations.

The specification is fixed. The agent implements. The MCP server validates: ACs met? Files exist? Guardrails respected? If not, the error signal goes back to the agent as a prompt injection with exactly what’s missing. The loop closes at the MCP server. The human intervenes only when the system needs a decision — not when it needs supervision.

Market taxonomy: Vibe Coders (no control) → Babysitter Agents (manual) → Spec-Driven Manual (semi-auto) → SpecForge (closed-loop deterministic).

Step Response — How Systems Converge

This is the chart that every control engineer recognizes. The step response shows what happens when a system receives a task and attempts to reach the target. How it gets there — or doesn’t — reveals everything about the system’s design.

In SpecForge’s context: the “step” is an agent receiving a ticket to implement. The target is zero deviation from the spec. The question is: how does the output converge to the target over successive iterations?

Step response comparison: unstable (vibe coding), underdamped (babysitter), overdamped (manual spec-driven), critically damped (SpecForge — converges in 3 iterations).

Four behaviors are possible. Each maps to a class of AI coding tool:

No Controller (Unstable) — The system oscillates without converging. Each correction introduces a new error. This is vibe coding: Lovable, Bolt, v0.

Underdamped (Babysitter Agent) — Oscillates but eventually converges after 12-15 iterations. Large overshoot: the agent rewrites an entire module when only a function needed changing. This is Devin, Cursor Agent in autonomous mode.

Overdamped (Manual Spec-Driven) — No oscillation, but glacially slow convergence. 10+ iterations because the human reviews every line. This is Sweep, CodePlan, and the careful manual approach.

Critically Damped (SpecForge) — Minimal overshoot, converges in 2-3 iterations. The MCP server detects exactly which ACs are unmet, which files are missing. The error signal is specific: not “try again” but “AC #3 is unmet and auth.middleware.ts was not created.” The system converges fast because the feedback is precise.

maxReviewAttempts (default: 3) is the safety bound — the control theory equivalent of a safety shutoff on an unstable system.

The Dependency Graph as Constraint Space

In control theory, you don’t just control the output — you constrain the state space. An autopilot doesn’t just aim for the destination; it constrains the aircraft to safe altitude bands, speed envelopes, and approach corridors. The constraint space prevents the system from reaching states where recovery is impossible.

The dependency graph serves the same function. It’s not a task list. It’s a constraint space that prevents agents from reaching invalid states: Agent B cannot start until Agent A’s output passes validation. Circular dependencies are rejected at planning time. Ticket isolation ensures each agent operates within its own bounded context. Without the dependency graph, parallel agents are unconstrained — Agent A picks Prisma, Agent B picks Drizzle, and the code diverges.

Stability and Convergence

A control system is stable if the error decreases over time and converges to zero. SpecForge’s review loop is designed for convergence:

Planning Review prevents unstable initial conditions — catching flawed specs before any code is written. maxReviewAttempts (default: 3) is a convergence guard — if the system hasn’t converged after 3 iterations, it stops and requires human intervention. Acceptance criteria provide multi-axis error measurement — not “is it good?” but “is AC #1 met? do expected files exist? are guardrails respected?”

Why This Matters

Every AI coding tool claims to “orchestrate agents.” Most of them are open-loop systems with a dashboard. They show you what happened. You decide what to do about it.

SpecForge is a closed-loop control system. The specification is the reference signal. The MCP server is the comparator. The agent is the plant. The dependency graph is the constraint space. The prompt injection with specific findings is the error signal.

This isn’t a metaphor. It’s the literal engineering architecture.

And this is why the system scales. Open-loop systems scale linearly — more agents, more human review needed. Closed-loop systems scale by design — more agents, same comparator, same guarantees. The MCP server doesn’t care if there are 5 workers or 50. It measures output against reference and computes error. That’s what controllers do.

“He didn’t know it was impossible, so he did it.”