KindLM — Epics & User Stories

Format: Each epic represents a slice of user value. Stories use As a [role], I want [action], so that [benefit] format. Acceptance criteria are testable. Estimates are T-shirt sizes: S (1–2 days), M (3–5 days), L (1–2 weeks), XL (2–4 weeks).

Epic 1: First Test in 10 Minutes

Goal: A developer can install KindLM, write a test, and see results in under 10 minutes.
Phase: 1 (MVP)
Priority: P0 — Nothing else matters if this doesn't work.

Story 1.1: Install and scaffold

As a developer, I want to run npm i -g @kindlm/cli && kindlm init and get a working config file, so that I don't have to write YAML from scratch.

Acceptance criteria:

npm i -g @kindlm/cli completes in < 30 seconds
kindlm init creates kindlm.yaml with a commented, runnable example
kindlm init --template agent creates an agent-focused template with tool definitions
kindlm init --template compliance creates a template with compliance section
Created config passes kindlm validate without edits
.kindlm/ directory created and added to .gitignore suggestion

Size: M

Story 1.2: Validate config without running

As a developer, I want to run kindlm validate and see if my config is valid before burning API credits, so that I catch YAML mistakes early.

Acceptance criteria:

Validates YAML syntax, Zod schema, file references (system_prompt_file, schemaFile)
Lists all suites and test counts on valid config
Shows line-specific error messages on invalid config (e.g., "line 23: unknown assertion type 'tool_caled'")
Exit code 0 (valid) / 1 (invalid)
Runs in < 1 second (no API calls)

Size: M

Story 1.3: Run tests and see results

As a developer, I want to run kindlm test and see colored pass/fail output in my terminal, so that I know which tests passed and why failures happened.

Acceptance criteria:

Reads kindlm.yaml (or specified file), executes all suites
Shows progress indicator during execution
Terminal output shows: suite name, test name, pass/fail, assertion details, timing
Failed assertions show specific reason ("Expected tool lookup_order to be called, but agent called process_refund")
Summary line: pass rate, total tests, total time, total cost
Exit code 0 (all pass) / 1 (any fail)

Size: L

Story 1.4: Filter test execution

As a developer, I want to run a single suite or test without running everything, so that I can iterate quickly on a specific test.

Acceptance criteria:

kindlm test -s refund-agent runs only that suite
kindlm test -s refund-agent -t happy-path runs only that test
kindlm test --grep "refund" matches by pattern
Invalid suite/test name shows helpful error with available options

Size: S

Epic 2: Assert on Agent Behavior

Goal: Tests can verify what tools an agent calls, with what arguments, in what order — not just the text it outputs.
Phase: 1 (MVP)
Priority: P0 — This is the core differentiator.

Story 2.1: Tool call assertions

As a developer, I want to assert that my agent called lookup_order with order_id: "123", so that I catch when prompt changes break tool routing.

Acceptance criteria:

tool_called assertion verifies tool name and optionally partial arg match
tool_not_called asserts a tool was NOT invoked
tool_order asserts tools were called in a specific sequence
Supports nested argument matching (args.filters.status: "active")
Supports wildcard args (args.order_id: "*" — just check it was passed)
Failure message shows what was actually called vs expected

Size: L

Story 2.2: Multi-turn tool simulation

As a developer, I want to define simulated tool responses in YAML, so that my tests run without calling real APIs.

Acceptance criteria:

Tools section in YAML defines: tool name, conditional responses (when/then), default response
Engine runs multi-turn loop: prompt → model response → if tool call, inject sim response → continue
Loop terminates when model produces final text (no more tool calls) or max turns reached
Supports multiple tool calls in a single turn
Timeout per turn (configurable, default 30s)

Size: L

Story 2.3: Schema validation

As a developer, I want to validate that my agent's structured output matches a JSON Schema, so that downstream systems don't break on malformed responses.

Acceptance criteria:

schema assertion validates response text as JSON against a .json schema file (AJV)
Handles non-JSON response gracefully (fails assertion, not crashes)
Failure message shows which schema fields failed and why
Supports $ref in schemas

Size: M

Story 2.4: LLM-as-judge

As a developer, I want to evaluate subjective quality ("Is this response empathetic?") using an LLM judge, so that I can catch tone and quality regressions.

Acceptance criteria:

judge assertion sends response + criteria to a configurable judge model
Returns score 0.0–1.0 with explanation
Configurable threshold (default 0.7)
Judge model can differ from test model (e.g., test with GPT-4o, judge with Claude)
Score and explanation included in report
Retry on judge API failure (up to 2 retries)

Size: M

Story 2.5: PII detection

As a developer, I want to automatically fail tests where the agent leaks personal data, so that I catch privacy violations before production.

Acceptance criteria:

no_pii assertion detects: SSN, credit card numbers, email addresses, phone numbers, IBAN
Supports custom regex patterns via config
Zero tolerance by default (any match = fail)
Failure message identifies which PII type and position in output
Configurable allowlist (e.g., allow test email test@example.com)

Size: M

Story 2.6: Keyword guardrails

As a developer, I want to ensure my agent never says certain phrases and always includes others, so that brand and safety guidelines are enforced.

Acceptance criteria:

keywords_absent fails if any denied phrase appears (case-insensitive)
keywords_present fails if any required phrase is missing
Supports regex patterns, not just literal strings
Failure message shows the matched/missing keyword and context

Size: S

Story 2.7: Latency and cost assertions

As a developer, I want to fail tests where the agent is too slow or too expensive, so that I catch performance regressions.

Acceptance criteria:

latency assertion fails if response time exceeds threshold (ms)
cost assertion fails if token cost exceeds threshold (USD)
Cost calculated from token usage × provider pricing
Works with multi-run aggregation (p95 latency, mean cost)

Size: S

Epic 3: CI Pipeline Integration

Goal: KindLM runs in CI with zero configuration beyond what's already in the repo.
Phase: 1 (MVP)
Priority: P0

Story 3.1: JUnit XML output

As a QA engineer, I want to get JUnit XML from KindLM, so that my CI system (GitHub Actions, GitLab CI) shows test results in its native reporting UI.

Acceptance criteria:

kindlm test --reporter junit writes standard JUnit XML
Each test case is a <testcase> element; failures include assertion detail
Output file path configurable (--junit report.xml)
Compatible with GitHub Actions test reporter and GitLab JUnit artifacts

Size: M

Story 3.2: JSON report

As a developer, I want to get a structured JSON report, so that I can process results programmatically.

Acceptance criteria:

kindlm test --reporter json writes full report to file
Contains: config hash, git info, all suites, all tests, all assertion results, timing, cost
Schema is stable and documented
Usable as input to kindlm upload and kindlm baseline set

Size: M

Story 3.3: Exit codes for CI gating

As a CI engineer, I want KindLM to exit with code 0 on pass and 1 on fail, so that I can gate deployments on test results.

Acceptance criteria:

Exit 0 = all gates passed
Exit 1 = any gate failed OR any unhandled error
--gate 90 sets minimum pass rate (overrides config)
Git commit SHA and branch name included in report (auto-detected)
CI environment auto-detected (GitHub Actions, GitLab CI, Jenkins, CircleCI)

Size: S

Story 3.4: GitHub Actions workflow example

As a developer, I want a copy-paste GitHub Actions workflow, so that I can add KindLM to my CI in < 2 minutes.

Acceptance criteria:

.github/workflows/kindlm.yml template in docs
Uses npx @kindlm/cli test (no global install needed)
API key from secrets
JUnit XML uploaded as artifact
Optional: upload to KindLM Cloud

Size: S

Epic 4: Drift Detection

Goal: Developers can save a "known good" baseline and detect when agent behavior drifts from it.
Phase: 1 (MVP)
Priority: P1

Story 4.1: Save baseline

As a developer, I want to save current test results as a baseline, so that future runs can compare against a known-good state.

Acceptance criteria:

kindlm baseline set saves latest JSON report to .kindlm/baselines/
Optional label: kindlm baseline set --label "v2.0-release"
Baseline contains: response texts, tool calls, scores, timing
Stored as JSON, human-readable

Size: S

Story 4.2: Compare against baseline

As a developer, I want to see what changed between current results and baseline, so that I catch unintended regressions.

Acceptance criteria:

kindlm baseline compare or kindlm test --baseline latest
drift assertion uses LLM-as-judge to score semantic similarity (0.0–1.0)
Field-level diff available (tool calls changed, score dropped, new PII)
Summary table: metric, baseline value, current value, delta
Configurable drift threshold (default 0.1 = 10% change triggers warning)

Size: L

Story 4.3: List baselines

As a developer, I want to see all saved baselines, so that I can compare against a specific historical snapshot.

Acceptance criteria:

kindlm baseline list shows: label, date, test count, pass rate
kindlm baseline compare --label "v2.0-release" compares against specific baseline

Size: S

Epic 5: Compliance Documentation

Goal: Running kindlm test --compliance generates audit-ready documentation for EU AI Act.
Phase: 1 (MVP)
Priority: P1

Story 5.1: Compliance config section

As a CTO, I want to add compliance metadata to my existing kindlm.yaml, so that reports include system name, risk level, and operator info.

Acceptance criteria:

compliance section in YAML: framework, metadata (systemName, riskLevel, operator, version)
Validates risk level against allowed values (minimal, limited, high, unacceptable)
Optional outputDir for report files (default: ./compliance-reports/)

Size: S

Story 5.2: Generate compliance report

As a compliance officer, I want a markdown document mapping test results to EU AI Act articles, so that I can include it in our compliance package.

Acceptance criteria:

kindlm test --compliance generates markdown report
Report structure: system description, test methodology, results per article, artifact hashes
Maps assertions to Annex IV sections (see 06-COMPLIANCE_SPEC.md)
Includes SHA-256 hash of config file, JSON report, and the compliance document itself
Timestamp in ISO 8601
Human-readable without technical knowledge

Size: L

Story 5.3: Compliance PDF export (Cloud)

As an enterprise customer, I want branded PDF compliance reports stored in the cloud, so that auditors can access them without engineering involvement.

Acceptance criteria:

Cloud Team/Enterprise plan: PDF export from dashboard
Company logo, name, date on cover page
Stored with retention per plan (90 days Team, unlimited Enterprise)
Accessible via shareable link (with auth)
Enterprise: digitally signed with org key

Size: XL (Phase 2)

Epic 6: Multi-Model Comparison

Goal: Run the same tests against multiple models and compare quality, cost, and latency.
Phase: 1 (MVP)
Priority: P2

Story 6.1: Multiple providers in config

As a developer, I want to test the same suite against Claude and GPT-4o, so that I can compare which model works better for my use case.

Acceptance criteria:

models section lists multiple provider:model combos
All tests run against all listed models
Report shows side-by-side comparison: pass rate, judge scores, latency, cost per model
Each model's results independent (one model failing doesn't skip others)

Size: M

Epic 7: Cloud Dashboard

Goal: Teams can see test history, trends, and collaborate on results via a web dashboard.
Phase: 2 (Cloud)
Priority: P1

Story 7.1: Upload results

As a developer, I want to upload test results to KindLM Cloud from CLI, so that my team can see them in a dashboard.

Acceptance criteria:

kindlm login authenticates via GitHub OAuth
kindlm test --upload or kindlm upload report.json
Results stored in D1, associated with project and git metadata
Confirmation message with dashboard URL

Size: M

Story 7.2: Test history view

As an engineering lead, I want to see all test runs for a project over time, so that I can spot trends and regressions.

Acceptance criteria:

Dashboard shows: list of runs, pass rate chart over time, latest run detail
Filter by branch, date range, suite
Click into run for full assertion breakdown
Retention per plan (7d free, 90d team, unlimited enterprise)

Size: XL

Story 7.3: Team management

As an engineering lead, I want to invite team members to my org, so that they can view and upload test results.

Acceptance criteria:

Invite by GitHub username or email
Roles: owner, admin, member (member = read + upload, admin = manage projects, owner = billing)
Member limits per plan (1 free, 10 team, unlimited enterprise)

Size: L

Story 7.4: Slack notifications

As a team lead, I want to get a Slack message when tests fail in CI, so that I don't have to check the dashboard constantly.

Acceptance criteria:

Webhook URL configured per project in dashboard
Notification on: test run uploaded with failures, gate failed, drift threshold exceeded
Slack-formatted message with: project, suite, pass rate, link to run

Size: M

Epic 8: Enterprise Compliance

Goal: Regulated companies get audit-grade features for EU AI Act compliance.
Phase: 3 (Enterprise)
Priority: P2

Story 8.1: SSO / SAML

As an enterprise IT admin, I want SSO integration, so that our team authenticates through our identity provider.

Acceptance criteria:

SAML 2.0 support (Okta, Azure AD, OneLogin)
Auto-provisioning of users from SSO
Configurable in dashboard settings

Size: XL

Story 8.2: Audit log API

As a compliance officer, I want a queryable audit log, so that I can prove to auditors exactly which tests were run when.

Acceptance criteria:

GET /v1/audit-log with filters: date range, actor, event type
Events logged: run uploaded, report viewed, report exported, baseline set, member added/removed
Immutable (cannot be deleted by org members)
Enterprise plan only

Size: L

Story 8.3: Signed compliance reports

As a compliance officer, I want digitally signed reports, so that auditors can verify the report hasn't been tampered with.

Acceptance criteria:

Org generates a signing key pair in dashboard
Compliance reports include digital signature
Verification endpoint: GET /v1/reports/:id/verify
Public key downloadable for offline verification

Size: L

Epic 9: Billing & Plans

Goal: Self-serve plan management with Stripe.
Phase: 3 (Enterprise)
Priority: P1

Story 9.1: Stripe subscription

As a team lead, I want to upgrade to Team plan with my credit card, so that I don't need to talk to sales.

Acceptance criteria:

Stripe Checkout for Team plan ($49/mo)
Plan change reflected immediately (feature gates update)
Card management in dashboard
Invoices available for download

Size: L

Story 9.2: Enterprise contact flow

As an enterprise buyer, I want to request an Enterprise plan via a contact form, so that I can discuss custom terms.

Acceptance criteria:

"Contact us" button on pricing → contact form
Form fields: company, name, email, team size, use case
Notification to founder (Slack + email)
Manual setup via admin dashboard

Size: S