KindLM — Product Requirements Document

Version: 1.0
Last updated: February 2026
Author: Petr Kindlmann
Status: Active

1. Problem Statement

Teams building AI agents (support bots, fintech copilots, HR screeners, medical triage) have no reliable way to test agent behavior before deploying. Existing tools test text output quality, but agents make decisions — they call tools, route conversations, approve transactions, escalate tickets. A prompt change on Friday can silently break tool call logic by Monday. No errors fire. No alerts trigger. The output looks fine. The behavior is wrong.

Compounding this: the EU AI Act (Regulation 2024/1689) requires companies deploying high-risk AI systems to maintain test documentation by August 2, 2026. Penalties reach 7% of global annual revenue. Most companies have no tooling to generate this documentation from their existing test processes.

2. Product Vision

KindLM is a CLI-first testing framework that lets engineering teams define behavioral tests for AI agents in YAML and run them locally or in CI. It tests what agents do (tool calls, structured output, guardrail compliance), not just what they say (text quality). It also generates EU AI Act–aligned compliance documentation from test results.

One-liner: Regression tests for AI agents — tool calls, output quality, and compliance. Defined in YAML, run in CI.

3. Target Users

Primary: AI Engineer (Individual Contributor)

Profile: Works at a Series A–C startup (20–200 people). Writes Python + TypeScript. Uses Claude or GPT-4o APIs. Builds agentic features — support bots, internal copilots, data pipelines with LLM steps.
Pain: Prompt changes break agent behavior silently. No test framework understands tool calls. Existing eval tools (Promptfoo, Braintrust) test text, not agent decisions.
Discovery: GitHub, Hacker News, Twitter/X, dev newsletters.
Trigger: A production incident where an agent did the wrong thing after a prompt update.
What they need: npm install, write YAML, run kindlm test, get pass/fail in CI. Under 10 minutes to first test.

Secondary: Compliance-Anxious CTO

Profile: VP Engineering or CTO at a European (or EU-serving) company in fintech, healthtech, or HR-tech. 50–500 employees. Already uses AI in production. Aware of EU AI Act but hasn't started compliance work.
Pain: August 2026 deadline approaching. Legal team flagged risk. No idea how to generate required test documentation. Compliance consultants quote €50K+.
Discovery: LinkedIn, compliance newsletters, CTO peer groups, conferences.
Trigger: Board meeting or legal review flags EU AI Act risk.
What they need: --compliance flag that generates auditor-ready documentation. Cloud dashboard their compliance officer can access.

Tertiary: QA Lead Adding AI Coverage

Profile: Leads QA at a company with established test culture (Playwright, Cypress, Jest). Company is adding AI features. Responsible for "how do we test the AI?"
Pain: Traditional test frameworks can't assert on non-deterministic LLM behavior. Can't write expect(response).toBe(...) for AI output.
Discovery: Testing communities, QA conferences, team Slack channels.
Trigger: Manager asks "how do we test the AI features?"
What they need: Familiar patterns (YAML config, assertions, CI integration, JUnit XML). Feels like a testing tool, not an ML platform.

4. Product Scope

In Scope (v1.0 — MVP)

Feature	Description
YAML config	Define test suites, providers, assertions in `kindlm.yaml`
Provider adapters	OpenAI, Anthropic, Ollama
Tool call assertions	Assert which tools were called, with what args, in what order
Schema assertions	Validate structured output against JSON Schema (AJV)
LLM-as-judge	Score responses against natural language criteria
PII detection	Regex-based SSN, credit card, email, phone, IBAN detection
Keyword guardrails	Required/forbidden phrases
Drift detection	Compare against stored baselines
Latency + cost assertions	Performance and budget guardrails
Multi-run aggregation	Run each test N times, aggregate scores
Pass/fail gates	Configurable thresholds for pass rate, schema failures
Terminal reporter	Colored, readable output
JSON reporter	Full structured report for programmatic use
JUnit XML reporter	Drop into any CI system
Compliance reporter	EU AI Act Annex IV markdown document
Baseline management	Save, compare, list baselines
CLI commands	init, validate, test, baseline
CI integration	Exit codes, env detection, JUnit output

In Scope (v2.0 — Cloud)

Feature	Description
Cloud dashboard	Web UI for test history, trends, team visibility
Upload from CLI	`kindlm upload` or `--upload` flag
Team management	Org, members, roles
Plan gating	Free / Team / Enterprise feature limits
Compliance PDF	Branded PDF export of compliance reports
Webhooks	Slack and webhook notifications on failures
GitHub OAuth	Authentication via GitHub

In Scope (v3.0 — Enterprise)

Feature	Description
SSO / SAML	Enterprise auth integration
Audit log API	Queryable compliance audit trail
Signed reports	Digitally signed compliance documents
Stripe billing	Self-serve plan management
SLA	99.9% uptime guarantee

Out of Scope (Not Building)

What	Why
Training data management	Out of domain — KindLM tests inference, not training
Prompt engineering / optimization	KindLM tests prompts, doesn't write them
Model fine-tuning	Not a training tool
Full GRC platform	KindLM generates test artifacts, not policies or risk assessments
Real-time monitoring / observability	KindLM runs discrete test suites, not continuous monitoring
A/B testing in production	KindLM tests in pre-deployment or CI, not in production traffic
Visual UI for writing tests	CLI + YAML is the interface. No drag-and-drop test builder.

5. Success Metrics

North Star Metric

Weekly active test runs (opt-in anonymous telemetry). A test run = one execution of kindlm test. This measures real adoption, not vanity metrics.

Leading Indicators

Metric	Month 1	Month 3	Month 6
GitHub stars	500	2,000	5,000
npm weekly downloads	100	500	2,000
Blog monthly visitors	1,000	5,000	20,000
YouTube subscribers	100	500	2,000
Twitter/X followers	300	1,000	3,000
Discord/community members	50	200	500

Revenue Metrics (Post-Cloud Launch)

Metric	Month 6	Month 12
Free CLI users	5,000	15,000
Team plan subscribers	100	300
Enterprise subscribers	10	30
MRR	$7,890	$23,670

Quality Metrics

Metric	Target
Time to first test (new user)	< 10 minutes
CLI cold start time	< 2 seconds
Test execution overhead (vs raw API call)	< 15%
Issue response time	< 24 hours
PR review turnaround	< 48 hours

6. User Journeys

Journey 1: First Test (AI Engineer)

Finds KindLM via GitHub search for "AI agent testing" or Hacker News post
Reads README — sees YAML config example, recognizes testing patterns
Runs npm i -g @kindlm/cli && kindlm init
Edits kindlm.yaml — points at their existing system prompt, writes 2–3 test cases
Runs kindlm test — sees green/red terminal output in 30 seconds
Adds kindlm test --reporter junit to their GitHub Actions workflow
Commits, PR passes CI with KindLM check — done

Success criteria: Steps 2–7 complete in under 10 minutes. No account needed. No API key for KindLM.

Journey 2: Compliance Report (CTO)

Engineer on their team is already using KindLM for testing
CTO learns about --compliance flag from a blog post or LinkedIn article
Engineer adds compliance section to existing kindlm.yaml
Runs kindlm test --compliance — gets markdown report mapping tests to EU AI Act articles
CTO shares report with legal team — legal says "this covers most of Annex IV, we need it stored and signed"
Team signs up for KindLM Cloud Enterprise ($299/mo)
kindlm upload sends reports to Cloud. Compliance officer gets dashboard access.

Success criteria: Steps 3–4 add < 5 minutes to existing workflow. Report is useful to legal without modification.

Journey 3: CI Pipeline (QA Lead)

QA lead evaluates KindLM alongside Promptfoo and Braintrust
Reads assertions doc — sees tool call assertions (unique to KindLM)
Writes test suite for their customer support agent (10 test cases)
Sets up GitHub Actions job: kindlm test --reporter junit --gate 90
PR fails because agent stopped calling lookup_order — exactly the kind of bug they were looking for
Team adopts KindLM as standard AI test framework
Upgrades to Team plan when they want shared dashboard

Success criteria: Evaluation to adoption in < 1 week. JUnit XML integrates with existing CI reporting.

7. Competitive Positioning

Positioning Statement

KindLM is the testing framework for AI agents that tests what agents do (tool calls, decisions, behavior), not just what they say (text output). Plus, it generates the compliance documentation your legal team will need by August 2026. One CLI. YAML config. Open source.

Competitive Matrix

Capability	KindLM	Promptfoo	Braintrust	DeepEval	LangSmith
Tool call assertions	✓	—	—	—	—
YAML-first config	✓	✓	—	—	—
LLM-as-judge	✓	✓	✓	✓	✓
Schema validation	✓	✓	—	—	—
PII detection	✓	—	—	—	—
Drift detection	✓	—	—	—	—
EU AI Act compliance reports	✓	—	—	—	—
CI-native (JUnit XML)	✓	✓	—	—	—
Multi-model comparison	✓	✓	✓	✓	—
Open source CLI	✓ (MIT)	✓ (MIT)	—	✓ (Apache)	—
TypeScript-first	✓	✓	—	— (Python)	— (Python)
No account required	✓	✓	—	✓	—

Key Differentiators

Tool call assertions — Only KindLM can assert which tools an agent called, with what arguments, in what order. This is the core differentiator.
Compliance reports — Built-in EU AI Act Annex IV documentation generation. No competitor offers this.
YAML + CLI simplicity — No SDK to learn. Any engineer can read the config and write tests.
Open-core with generous free tier — Full CLI forever free. Cloud is additive, not gated.

8. Risks & Mitigations

Risk	Likelihood	Impact	Mitigation
EU AI Act deadline delayed	Low	Medium	Testing value stands without compliance. Pivot messaging to "prepare regardless."
Promptfoo adds tool call assertions	Medium	High	Ship compliance features fast — that's harder to copy. Build community moat.
AI-generated content detected as "slop"	Medium	Medium	Every piece of marketing shows real terminal output and real code. Authenticity first.
Low GitHub adoption (< 200 stars in 6 weeks)	Medium	High	Validate PMF early. If < 200 stars, revisit positioning before investing in Cloud.
Cloud costs exceed revenue	Low	Medium	Cloudflare Workers + D1 is near-zero cost at low scale. Don't build Cloud until CLI has traction.
Provider API changes break adapters	Medium	Low	Adapter pattern isolates changes. Community can contribute fixes.
Open-source fork competes	Low	Medium	Stay ahead on compliance features. Build brand and community trust. AGPL on cloud prevents easy SaaS forks.

9. Launch Plan

Phase 1: MVP Launch (Weeks 1–6)

Ship CLI with all Phase 1 features
GitHub repo public, npm published
5 blog posts, 3 YouTube tutorials
Show HN post
Target: 500 GitHub stars, 100 npm weekly downloads

Phase 2: Cloud Beta (Weeks 7–12)

Cloud dashboard MVP (test history, trends)
Invite 20 beta users from CLI community
Product Hunt launch (week 10)
Compliance content push (EU AI Act deadline approaching)
Target: 2,000 GitHub stars, 50 Cloud signups

Phase 3: General Availability (Weeks 13–18)

Cloud GA with billing (Stripe)
Enterprise features (SSO, audit log, signed reports)
Conference talks (local Prague/Berlin events)
Target: 5,000 GitHub stars, 100 paying teams

10. Open Questions

Question	Owner	Deadline
Should we support custom assertion plugins (user-defined)?	Petr	Before v1.0
What's the right default for `runs` — 1 or 3?	Petr	Before v1.0
Should compliance reports be a separate CLI command or always a flag?	Petr	Before v1.0
Is AGPL the right license for Cloud, or BSL (Business Source License)?	Petr	Before Cloud launch
Should we offer annual pricing (discount)?	Petr	Before Cloud GA
Do we need a Discord or is GitHub Discussions sufficient?	Petr	Week 2