KindLM — Testing Strategy

Principle: KindLM is a testing tool. If our own tests are bad, nobody will trust us. Every package has clear testing boundaries, and we mock at provider boundaries — never deeper.

1. Test Pyramid

         ╱╲
        ╱  ╲       E2E Tests (5%)
       ╱    ╲      Real CLI → real API → real output
      ╱──────╲
     ╱        ╲    Integration Tests (25%)
    ╱          ╲   Module boundaries, config → engine → report
   ╱────────────╲
  ╱              ╲  Unit Tests (70%)
 ╱                ╲ Pure functions, parsers, assertion logic
╱──────────────────╲

Target coverage: 90%+ on core, 80%+ on cli, 70%+ on cloud. Measured with vitest --coverage.

2. Framework and Tooling

Tool	Purpose
Vitest	Test runner for all packages. Fast, ESM-native, Jest-compatible API.
@vitest/coverage-v8	Code coverage via V8 engine
msw (Mock Service Worker)	HTTP-level mocking for provider API calls
miniflare	Local Cloudflare Workers runtime for cloud package tests
supertest	HTTP assertions for cloud API integration tests

Why Vitest (Not Jest)

Native ESM support (our codebase is ESM-first)
Faster startup (no transform overhead)
Compatible API (easy for Jest users)
Built-in TypeScript support via esbuild
Workspace support for monorepo

3. Testing by Package

3.1 `packages/core` — Unit + Integration

Core is pure logic with zero side effects. Every function takes input and returns a Result type. This makes it highly testable.

Unit tests (packages/core/src/__tests__/):

Module	What to test	Example
`config/parser.ts`	Valid YAML → typed config, invalid YAML → specific errors	`"unknown assertion type 'tool_caled'" error`
`config/schema.ts`	Zod validation edge cases	Optional fields, nested defaults, string coercion
`assertions/tool-called.ts`	Match/no-match with various arg patterns	Partial args, nested args, wildcards, wrong tool
`assertions/tool-order.ts`	Sequence matching with extras allowed	`[A, B, C]` matches `[A, X, B, C]`
`assertions/schema.ts`	AJV validation with various schemas	Valid JSON, invalid JSON, non-JSON response, `$ref`
`assertions/judge.ts`	Score parsing, threshold comparison	Score 0.8 vs threshold 0.7 = pass
`assertions/no-pii.ts`	Regex detection per PII type	SSN, CC, email, phone, IBAN, custom patterns
`assertions/keywords.ts`	Present/absent, case sensitivity, regex	Substring match, whole word, regex pattern
`assertions/drift.ts`	Similarity scoring, threshold comparison	Identical = 1.0, different = low score
`assertions/latency.ts`	Threshold comparison	500ms vs 1000ms limit = pass
`assertions/cost.ts`	Token cost calculation	1000 tokens × $0.01/1K = $0.01
`providers/registry.ts`	Provider string parsing, lookup	`"openai:gpt-4o"` → OpenAI adapter
`engine/runner.ts`	Test execution flow, multi-run aggregation	3 runs, 2 pass = 66.7% rate
`engine/gate.ts`	Pass/fail gate evaluation	92% vs 90% threshold = pass
`reporters/terminal.ts`	Output formatting	Colors, alignment, summary line
`reporters/json.ts`	Complete report structure	All fields present, valid JSON
`reporters/junit.ts`	Valid JUnit XML	Parseable by CI systems
`reporters/compliance.ts`	Annex IV section mapping	Each assertion type maps to correct article
`baseline/manager.ts`	Save, load, compare, list	File I/O, JSON round-trip, diff calculation

Integration tests (packages/core/src/__tests__/integration/):

Test	What it covers
`config-to-engine.test.ts`	Parse YAML config → create engine → run with mock provider → get results
`multi-provider.test.ts`	Same suite against two mock providers → comparison report
`compliance-flow.test.ts`	Config with compliance section → run → compliance markdown generated
`baseline-flow.test.ts`	Run → save baseline → run again → compare → drift detected
`multi-turn-tools.test.ts`	Multi-turn tool simulation: prompt → tool call → sim response → final answer

3.2 `packages/cli` — Integration + Snapshot

CLI tests verify that commands produce correct output and exit codes. We test the CLI as a user would experience it.

Integration tests (packages/cli/src/__tests__/):

Test	How
`init.test.ts`	Run `kindlm init` in temp dir → verify `kindlm.yaml` created and valid
`validate-valid.test.ts`	Run `kindlm validate` on valid config → exit 0, lists suites
`validate-invalid.test.ts`	Run `kindlm validate` on broken config → exit 1, shows line-specific error
`test-pass.test.ts`	Run `kindlm test` with mock provider → exit 0, output contains "PASS"
`test-fail.test.ts`	Run `kindlm test` with failing assertions → exit 1, output contains failure details
`test-filter.test.ts`	Run `kindlm test -s suite-name` → only that suite executes
`reporters.test.ts`	Run with `--reporter json` → valid JSON file; `--reporter junit` → valid XML
`baseline-commands.test.ts`	`baseline set` → `baseline list` → `baseline compare` flow

Snapshot tests:

Terminal output snapshots for consistent formatting. Update with vitest --update when output intentionally changes.

it('shows pass summary', async () => {
  const output = await runCLI(['test', '-c', 'fixtures/passing.yaml']);
  expect(output.stdout).toMatchSnapshot();
  expect(output.exitCode).toBe(0);
});

CLI test helper:

// test-utils/run-cli.ts
export async function runCLI(args: string[]): Promise<{
  stdout: string;
  stderr: string;
  exitCode: number;
}> {
  // Spawns CLI as child process with mock env vars
  // Captures stdout/stderr
  // Returns exit code
}

3.2.1 Scenario Testing

Purpose: Simulate the diverse usage patterns that real users create — config variety, workflow diversity, provider response shapes, and edge cases. These are not load tests; they verify that the CLI handles the full spectrum of valid (and invalid) inputs gracefully.

File: packages/cli/tests/integration/scenarios.test.ts

Categories (~47 tests in 5 describe blocks):

Category	Count	What it covers
Config Diversity	~14	Minimal config, all assertion types combined, multi-provider, unicode names, large vars, YAML comments, skip, repeat, many tests, every optional field, invalid configs
Workflow Diversity	~8	`--reporter json/junit`, `--gate`, `--runs`, `--compliance`, `validate` command on valid/broken configs
Provider Response Diversity	~10	Tool calls, shouldNotCall, multi-tool, PII detection, keyword allow/deny, empty response, large response
Edge Cases	~10	HTTP 429/500/401 errors, argsMatch partial, maxLength, JSON schema validation, gates in config, long test names, notContains
Real-World Patterns	~5	Refund agent, support bot, structured output API, multi-turn tool agent, CI pipeline

How each test works:

Creates a temp directory
Writes a kindlm.yaml config
Starts a mock HTTP server with a custom handler (or uses the default OpenAI handler)
Spawns the real CLI binary as a child process
Asserts exit code and/or stdout content

How to run:

# Run all scenario tests
cd packages/cli && npx vitest run tests/integration/scenarios.test.ts

# Run through turbo (includes all CLI tests)
npx turbo run test --filter=@kindlm/cli

3.3 `packages/cloud` — Integration with Miniflare

Cloud tests run against a local Workers runtime with an in-memory D1 database.

Integration tests (packages/cloud/src/__tests__/):

Test	What it covers
`auth.test.ts`	GitHub OAuth flow mock, token generation, token verification
`projects-crud.test.ts`	Create, list, get, delete projects
`runs-upload.test.ts`	Upload JSON report → stored in D1 → retrievable
`runs-list.test.ts`	Pagination, branch filter, date range filter
`runs-compare.test.ts`	Compare two runs → deltas calculated correctly
`baselines.test.ts`	Set baseline from run, list baselines, compare against baseline
`plan-gating.test.ts`	Free plan hits project limit → 403; Team plan allows 5 projects
`rate-limiting.test.ts`	Exceed rate limit → 429 with Retry-After header
`audit-log.test.ts`	Actions create audit entries; query with filters; Enterprise-only gate
`data-retention.test.ts`	Cron deletes runs older than plan retention period

Miniflare setup:

// test-utils/worker-env.ts
import { Miniflare } from 'miniflare';

export async function createTestEnv() {
  const mf = new Miniflare({
    modules: true,
    script: '', // loaded from built worker
    d1Databases: ['DB'],
    kvNamespaces: ['RATE_LIMITS'],
  });
  
  // Run migrations
  const db = await mf.getD1Database('DB');
  await db.exec(MIGRATIONS_SQL);
  
  return { mf, db };
}

4. Mocking Strategy

Rule: Mock at the HTTP boundary, never deeper.

Provider adapters make HTTP requests. We mock the HTTP responses, not the adapter internals. This means our tests verify that our code correctly handles real-shaped API responses.

msw handlers for providers:

// test-utils/handlers.ts
import { http, HttpResponse } from 'msw';

export const openaiHandlers = [
  http.post('https://api.openai.com/v1/chat/completions', () => {
    return HttpResponse.json({
      choices: [{
        message: {
          role: 'assistant',
          content: 'Test response',
          tool_calls: [{
            id: 'call_1',
            type: 'function',
            function: {
              name: 'lookup_order',
              arguments: '{"order_id": "123"}'
            }
          }]
        }
      }],
      usage: { prompt_tokens: 100, completion_tokens: 50, total_tokens: 150 }
    });
  })
];

export const anthropicHandlers = [
  http.post('https://api.anthropic.com/v1/messages', () => {
    return HttpResponse.json({
      content: [
        { type: 'text', text: 'Test response' },
        { type: 'tool_use', id: 'tu_1', name: 'lookup_order', input: { order_id: '123' } }
      ],
      usage: { input_tokens: 100, output_tokens: 50 }
    });
  })
];

What we mock vs what we don't:

Layer	Mock?	Why
Provider HTTP APIs (OpenAI, Anthropic, Ollama)	Yes (msw)	Expensive, rate-limited, non-deterministic
LLM-as-judge calls	Yes (msw)	Same reasons — return fixed scores in tests
File system (baselines, reports)	No — use temp dirs	File I/O is fast and deterministic
YAML parsing	No	Pure function, deterministic
Zod validation	No	Pure function, deterministic
AJV schema validation	No	Pure function, deterministic
D1 database (cloud)	No — use Miniflare in-memory D1	Tests real SQL queries
GitHub OAuth	Yes (msw)	External service
Stripe	Yes (msw)	External service

5. Test Fixtures

All test fixtures live in packages/*/src/__tests__/fixtures/.

Config fixtures:

fixtures/
├── configs/
│   ├── minimal.yaml          # Bare minimum valid config
│   ├── full-featured.yaml    # Every option used
│   ├── multi-provider.yaml   # Multiple models
│   ├── compliance.yaml       # With compliance section
│   ├── invalid-syntax.yaml   # YAML parse error
│   ├── invalid-schema.yaml   # Valid YAML, fails Zod
│   └── missing-provider.yaml # References non-existent provider
├── schemas/
│   ├── order-response.json   # For schema assertion tests
│   └── ticket-response.json
├── baselines/
│   └── sample-baseline.json  # Stored baseline for compare tests
└── reports/
    ├── passing-report.json   # Full passing report
    └── failing-report.json   # Report with failures

Provider response fixtures:

fixtures/providers/
├── openai/
│   ├── text-response.json       # Simple text completion
│   ├── tool-call-response.json  # Response with tool calls
│   ├── multi-tool-response.json # Multiple tool calls
│   ├── error-429.json           # Rate limit error
│   └── error-500.json           # Server error
├── anthropic/
│   ├── text-response.json
│   ├── tool-use-response.json
│   └── error-overloaded.json
└── ollama/
    ├── text-response.json
    └── tool-call-response.json

6. CI Pipeline Testing

Tests run on every PR and every push to main.

# .github/workflows/test.yml
name: Test
on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        node: [20, 22]
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: ${{ matrix.node }}
      - run: npm ci
      - run: npx turbo test -- --coverage
      - uses: codecov/codecov-action@v4
        with:
          files: packages/*/coverage/lcov.info

  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
      - run: npm ci
      - run: npx turbo lint
      - run: npx turbo typecheck

E2E tests (nightly):

Run against real provider APIs with a dedicated test API key. Small budget ($5/day cap). Tests a curated subset of 5 test cases against live OpenAI and Anthropic APIs.

# .github/workflows/e2e-nightly.yml
name: E2E Nightly
on:
  schedule:
    - cron: '0 3 * * *'  # 3 AM UTC daily

jobs:
  e2e:
    runs-on: ubuntu-latest
    environment: e2e
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
      - run: npm ci && npx turbo build
      - run: npx vitest run --project e2e
        env:
          OPENAI_API_KEY: ${{ secrets.E2E_OPENAI_KEY }}
          ANTHROPIC_API_KEY: ${{ secrets.E2E_ANTHROPIC_KEY }}

7. Coverage Requirements

Package	Minimum Coverage	Enforced
`core`	90% lines, 85% branches	CI blocks merge below threshold
`cli`	80% lines	CI warns, doesn't block
`cloud`	70% lines	CI warns, doesn't block

// vitest.config.ts (root)
export default defineConfig({
  test: {
    coverage: {
      provider: 'v8',
      thresholds: {
        'packages/core/src': { lines: 90, branches: 85 },
        'packages/cli/src': { lines: 80 },
        'packages/cloud/src': { lines: 70 },
      }
    }
  }
});

8. Test Naming Convention

[module].[behavior].[expected result]

Examples:
  config.parser.returns-error-on-invalid-yaml
  assertion.tool-called.passes-when-tool-matches
  assertion.tool-called.fails-when-tool-not-invoked
  assertion.no-pii.detects-credit-card-number
  engine.runner.aggregates-3-runs-correctly
  cli.test-command.exits-1-on-failure
  cloud.runs.upload-creates-run-in-d1
  cloud.plan-gate.blocks-6th-project-on-team-plan

Files: [module].test.ts in __tests__/ directory adjacent to source.

9. Performance Benchmarks

Track in CI to catch regressions:

Operation	Target	Measured How
Config parse (full-featured.yaml)	< 50ms	`performance.now()` in test
Single assertion evaluation	< 5ms	Benchmark suite
Report generation (50 tests)	< 100ms	Benchmark suite
CLI cold start (`kindlm --help`)	< 500ms	Wall clock in CI

Benchmarks run weekly via separate CI job. Regression = > 20% slower than baseline.