Testing for AI agents
Know what your agent will do before your users do
Regression tests for agentic workflows — tool calls, output quality, and compliance. Defined in YAML, run in CI.
Open source · MIT · No account needed
The problem
You deploy a prompt change on Friday. Monday, your agent approves refunds it shouldn't — it stopped calling lookup_order and started hallucinating.
No errors. No alerts. The output looked fine. The behavior was wrong.
KindLM catches this before it ships.
How it works
Describe what should happen. Run it.
You write this
KindLM checks this
Capabilities
Every tool call. Every argument.
Assert which tools were called, in what order, with what arguments. Define tools that must never be called. Test the decisions, not just the output.
Quality you can measure.
An LLM judge scores each criterion from 0 to 1 — and explains why. Set thresholds. When a score drops, you know exactly which criterion failed.
See exactly what changed.
Save a baseline. Run again after any change. KindLM compares semantically — not string diffs. Cost, latency, and quality tracked together.
vs. baseline · Feb 10
More
PII detection
SSNs, credit cards, emails. Custom patterns. Zero tolerance by default.
Schema validation
JSON Schema on structured outputs. Every run, automatically.
Keyword guardrails
Words your agent must never say. Phrases it must include.
Multi-model
Same tests against OpenAI, Anthropic, Gemini, Mistral, Cohere, and Ollama. Compare quality, cost, latency.
CI-native
JUnit XML, JSON, exit codes. GitHub Actions and GitLab CI ready.
No SDK
YAML config, CLI execution. Any engineer can read and contribute.
EU AI Act · August 2026
An auditor asks for test records. You have them.
Add --compliance to any run. Annex IV–mapped docs, timestamped and hashed.