Skip to content

Testing for AI agents

Know what your agent will do before your users do

Regression tests for agentic workflows — tool calls, output quality, and compliance. Defined in YAML, run in CI.

Read the docs

Open source · MIT · No account needed

OpenAIAnthropicGeminiMistralCohereOllama

The problem

You deploy a prompt change on Friday. Monday, your agent approves refunds it shouldn't — it stopped calling lookup_order and started hallucinating.

No errors. No alerts. The output looked fine. The behavior was wrong.

KindLM catches this before it ships.

How it works

Describe what should happen. Run it.

You write this

- name: "refund-happy-path"
input: "Charged twice for order #12345"
assert:
- type: tool_called
value: lookup_order
args: { order_id: "12345" }
- type: tool_not_called
value: escalate_to_human
- type: judge
criteria: "Empathetic tone"
threshold: 0.8
- type: no_pii

KindLM checks this

TOOLlookup_order called correctly
TOOLescalate_to_human not called
JUDGEEmpathetic tone — 0.92
PIINo personal data leaked
DRIFT0.04 from baseline
COST$0.003 per execution

Capabilities

Every tool call. Every argument.

Assert which tools were called, in what order, with what arguments. Define tools that must never be called. Test the decisions, not just the output.

assert:
# Look up the order first
- type: tool_called
value: lookup_order
args: { order_id: "12345" }
# Then issue the refund
- type: tool_called
value: issue_refund
# Never escalate routine cases
- type: tool_not_called
value: escalate_to_human

Quality you can measure.

An LLM judge scores each criterion from 0 to 1 — and explains why. Set thresholds. When a score drops, you know exactly which criterion failed.

Empathetic, addresses the issue0.92
No promises about timeline0.71
Correct company terminology0.88

See exactly what changed.

Save a baseline. Run again after any change. KindLM compares semantically — not string diffs. Cost, latency, and quality tracked together.

vs. baseline · Feb 10

Pass rate
87.5%−12.5%
Judge avg
0.89−0.02
Drift
0.04
Latency
940ms−40ms
Cost
$0.12+9%

More

PII detection

SSNs, credit cards, emails. Custom patterns. Zero tolerance by default.

Schema validation

JSON Schema on structured outputs. Every run, automatically.

Keyword guardrails

Words your agent must never say. Phrases it must include.

Multi-model

Same tests against OpenAI, Anthropic, Gemini, Mistral, Cohere, and Ollama. Compare quality, cost, latency.

CI-native

JUnit XML, JSON, exit codes. GitHub Actions and GitLab CI ready.

No SDK

YAML config, CLI execution. Any engineer can read and contribute.

EU AI Act · August 2026

An auditor asks for test records. You have them.

Add --compliance to any run. Annex IV–mapped docs, timestamped and hashed.

$ kindlm test --compliance  → compliance-2026-02-15.md  SHA-256: a1b2c3...e5f6

Three lines of YAML. One command.

Open source. No account required.

Read the docs