Testing for AI agents

Know what your agent will do before your users do

Regression tests for agentic workflows — tool calls, output quality, and compliance. Defined in YAML, run in CI.

Open source · MIT · No account needed

OpenAIAnthropicGeminiMistralCohereOllama

The problem

You deploy a prompt change on Friday. Monday, your agent approves refunds it shouldn't — it stopped calling lookup_order and started hallucinating.

No errors. No alerts. The output looked fine. The behavior was wrong.

KindLM catches this before it ships.

How it works

Describe what should happen. Run it.

You write this

- name: "refund-happy-path"

input: "Charged twice for order #12345"

assert:

- type: tool_called

value: lookup_order

args: { order_id: "12345" }

- type: tool_not_called

value: escalate_to_human

- type: judge

criteria: "Empathetic tone"

threshold: 0.8

- type: no_pii

KindLM checks this

TOOLlookup_order called correctly

TOOLescalate_to_human not called

JUDGEEmpathetic tone — 0.92

PIINo personal data leaked

DRIFT0.04 from baseline

COST$0.003 per execution

Capabilities

Every tool call. Every argument.

Assert which tools were called, in what order, with what arguments. Define tools that must never be called. Test the decisions, not just the output.

assert:

# Look up the order first

- type: tool_called

value: lookup_order

args: { order_id: "12345" }

# Then issue the refund

- type: tool_called

value: issue_refund

# Never escalate routine cases

- type: tool_not_called

value: escalate_to_human

Quality you can measure.

An LLM judge scores each criterion from 0 to 1 — and explains why. Set thresholds. When a score drops, you know exactly which criterion failed.

Empathetic, addresses the issue0.92

No promises about timeline0.71

Correct company terminology0.88

See exactly what changed.

Save a baseline. Run again after any change. KindLM compares semantically — not string diffs. Cost, latency, and quality tracked together.

vs. baseline · Feb 10

Pass rate

87.5%−12.5%

Judge avg

0.89−0.02

Drift

0.04

Latency

940ms−40ms

Cost

$0.12+9%

PII detection

SSNs, credit cards, emails. Custom patterns. Zero tolerance by default.

Schema validation

JSON Schema on structured outputs. Every run, automatically.

Keyword guardrails

Words your agent must never say. Phrases it must include.

Multi-model

Same tests against OpenAI, Anthropic, Gemini, Mistral, Cohere, and Ollama. Compare quality, cost, latency.

CI-native

JUnit XML, JSON, exit codes. GitHub Actions and GitLab CI ready.

No SDK

YAML config, CLI execution. Any engineer can read and contribute.

EU AI Act · August 2026

An auditor asks for test records. You have them.

Add --compliance to any run. Annex IV–mapped docs, timestamped and hashed.

$ kindlm test --compliance → compliance-2026-02-15.md SHA-256: a1b2c3...e5f6

Three lines of YAML. One command.

Open source. No account required.

Read the docs