Skip to main content
TRW
LEDGER // REQUIREMENTS
PRD-EVAL-023 — Behavioral Transcript Scoring CalibrationAARE-F v1.1.0

Requirements that govern the build, not just describe it

Every task gets a machine-readable PRD with acceptance criteria, quality gates, and FR-by-FR traceability. The agent verifies work against the spec — not against the conversation.

FR
Acceptance criterion
Implementation
FR01

Context compaction false positive eliminated when FRAMEWORK.md string appears inside JSON

confidence: 0.95

pre_analyzers.py:_detect_compaction — line-start anchor regex

FR02

Error count derived from structured JSONL fields, not regex on file content

confidence: 0.90

pre_analyzers.py:_count_errors_jsonl — state.status=="error" probe

FR03

OpenCode tool names normalised: write/edit/read counted alongside Write/Edit/Read

confidence: 0.95

pre_analyzers.py:_parse_jsonl_ceremony — t.lower() normalisation

FR04

Claude-Code nested tool_use format parsed correctly from assistant.message.content

confidence: 0.85

pre_analyzers.py:_parse_jsonl_ceremony — nested content array walk

FR05

Planning quality prompt clarified: exploration tool calls count as "starting work"

confidence: 0.85

prompts.py:PASS2_SYSTEM — planning_quality_estimate anchor text

FR06

Retry-loop anchor present at 0.2–0.4 for 5+ retries of same failing command

confidence: 0.80

prompts.py:PASS2_SYSTEM — tool_usage_efficiency anchor text

FR07

termination_cause field populated deterministically: none | context_compaction | timeout | …

confidence: 0.85

schema.py:BehavioralAnalysis + pre_analyzers.py:apply_behavioral_overrides

FR08

planning_precedes_tools boolean populated from JSONL event index comparison

confidence: 0.80

schema.py:BehavioralPreAnalysis + pre_analyzers.py:_parse_jsonl_ceremony

0 of 8 acceptance criteria satisfiedVERIFYING…
LEDGER // PROBLEM

Vague prompts build vague features

A plain AI-coding session has no governing spec. The agent receives a sentence in chat, produces code that looks plausible, and reports done. There is no artefact that says what “done” means, no criteria to test against, and no record of the intent behind each decision.

AI fluency does not equal correctness. A fluent model can write confident-sounding code that satisfies the prompt's literal words while missing every unstated requirement that mattered. Without machine-verifiable acceptance criteria, “done” is just the agent stopped typing.

Engineering leads feel this as defect rework. Compliance managers feel it as audit exposure. Developers feel it as the sinking realisation that a four-hour session produced something that fails on the first edge case they forgot to mention.

LEDGER // AARE_F

AARE-F — the structured spec format

AARE-F v1.1.0 (AI-Augmented Requirements Engineering Framework) defines 12 mandatory sections, YAML frontmatter, confidence scores per FR (0.0–1.0), EARS-pattern compliance, and a bidirectional traceability matrix. Every PRD that passes validation carries the same structure so agents and humans can navigate any spec without a tutorial.

YAML frontmatterPRD ID, version, status, priority, risk level
AARE-F componentsC1–C9 mapped to the PRD
Evidence levelstrong | moderate | weak + sources
Confidence scoresper-FR 0.0–1.0 float, EARS-pattern linked
Traceabilityimplements / depends_on / enables / conflicts_with
Metrics + SLOssuccess criteria, measurement method, SLO targets
Quality gatesambiguity ≤5%, completeness ≥85%, traceability ≥90%
Problem statementbackground + measurable defects + impact
Goals + non-goalsexplicit scope boundary
User storiesrole / want / so-that + evidence required
Functional requirementsEARS pattern + acceptance criteria + confidence
Risk + dependenciesprobability, impact, mitigation, residual risk
TERMINAL // trw_prd_create

From prompt to PRD in seconds

One tool call turns a natural-language description into a sprint-ready AARE-F PRD. The agent does not write vague bullets — it emits a versioned document with every section filled, every FR carrying a confidence score, and every quality gate pre-checked. The ledger at the top of this page is the live output of PRD-EVAL-023, created with the same call shape below.

TOOL CALL

trw_prd_create(

input_text="Add rate limiting",

category="INFRA"

)

OUTPUT

PRD-INFRA-042 created — 15 FRs, 5 acceptance criteria each

FR01 Rate limit per IP: 100 req/min default [confidence: 0.92]

FR02 429 response includes Retry-After header [confidence: 0.90]

FR03 Authenticated users exempt from IP-based limit [confidence: 0.88]

FR04 Limits configurable per route prefix [confidence: 0.85]

FR05 Sliding-window algorithm, Redis-backed [confidence: 0.82]

Quality gate: READY — structure 15/15, completeness 28/30, clarity 18/20, traceability 24/25

LATTICE // VALIDATION

Quality gates before IMPLEMENT

trw_prd_validate returns a READY / NEEDS WORK / BLOCK verdict with per-dimension scores. A BLOCK verdict prevents moving to the IMPLEMENT phase until the issue is resolved — the agent cannot self-approve past a failing gate.

15pts

Structure

YAML frontmatter, 12 sections, ID format

30pts

Completeness

All required fields populated, non-goals explicit

20pts

Clarity

EARS patterns, no vague terms, measurable criteria

25pts

Traceability

FR → code → test links bidirectional

10pts

Quality gates

Ambiguity rate, completeness %, coverage %

READYNEEDS WORKBLOCK— three verdicts, one source of truth
LEDGER // TRACEABILITY

Traceability that sticks

The traceability matrix in every AARE-F PRD links each FR to the source requirement, the implementation file, and the test that verifies it. The trw-traceability-checker agent grep-verifies every FR has a corresponding code path and test before delivery is allowed.

Bidirectional means starting from either end works. An engineer browsing the code can jump to the FR that mandated a function. An auditor reading the PRD can see the exact commit that landed each requirement. The spec-ledger demo at the top of this page is that matrix in motion.

LEDGER // FAQ

Common questions

Do I have to write PRDs manually?

No. trw_prd_create drafts a full AARE-F PRD from a single input sentence. You review and refine; you do not author from scratch.

What happens if functional requirements change mid-build?

Update the FRs in the PRD and re-run trw_prd_validate. The quality gate re-checks traceability before IMPLEMENT resumes. No silent drift.

Does this work for small bug fixes?

Yes. Smaller tasks produce smaller PRDs — a single FR with two acceptance criteria is valid. The 12-section structure scales down gracefully.

How strict are the quality gates?

READY / NEEDS WORK / BLOCK verdicts are configurable per project. Defaults: ambiguity ≤5%, completeness ≥85%, traceability ≥90%, consistency ≥95%.
TERMINAL // SHIP_WITH_SPECS

Ship what the spec says, not what the chat remembered

Access is currently limited while onboarding hardens. Request access to start running machine-verifiable requirements in your repo.