KindLM — Testing Strategy

Principle: KindLM is a testing tool. If our own tests are bad, nobody will trust us. Every package has clear testing boundaries, and we mock at provider boundaries — never deeper.


1. Test Pyramid

         ╱╲
        ╱  ╲       E2E Tests (5%)
       ╱    ╲      Real CLI → real API → real output
      ╱──────╲
     ╱        ╲    Integration Tests (25%)
    ╱          ╲   Module boundaries, config → engine → report
   ╱────────────╲
  ╱              ╲  Unit Tests (70%)
 ╱                ╲ Pure functions, parsers, assertion logic
╱──────────────────╲

Target coverage: 90%+ on core, 80%+ on cli, 70%+ on cloud. Measured with vitest --coverage.


2. Framework and Tooling

ToolPurpose
VitestTest runner for all packages. Fast, ESM-native, Jest-compatible API.
@vitest/coverage-v8Code coverage via V8 engine
msw (Mock Service Worker)HTTP-level mocking for provider API calls
miniflareLocal Cloudflare Workers runtime for cloud package tests
supertestHTTP assertions for cloud API integration tests

Why Vitest (Not Jest)

  • Native ESM support (our codebase is ESM-first)
  • Faster startup (no transform overhead)
  • Compatible API (easy for Jest users)
  • Built-in TypeScript support via esbuild
  • Workspace support for monorepo

3. Testing by Package

3.1 packages/core — Unit + Integration

Core is pure logic with zero side effects. Every function takes input and returns a Result type. This makes it highly testable.

Unit tests (packages/core/src/__tests__/):

ModuleWhat to testExample
config/parser.tsValid YAML → typed config, invalid YAML → specific errors"unknown assertion type 'tool_caled'" error
config/schema.tsZod validation edge casesOptional fields, nested defaults, string coercion
assertions/tool-called.tsMatch/no-match with various arg patternsPartial args, nested args, wildcards, wrong tool
assertions/tool-order.tsSequence matching with extras allowed[A, B, C] matches [A, X, B, C]
assertions/schema.tsAJV validation with various schemasValid JSON, invalid JSON, non-JSON response, $ref
assertions/judge.tsScore parsing, threshold comparisonScore 0.8 vs threshold 0.7 = pass
assertions/no-pii.tsRegex detection per PII typeSSN, CC, email, phone, IBAN, custom patterns
assertions/keywords.tsPresent/absent, case sensitivity, regexSubstring match, whole word, regex pattern
assertions/drift.tsSimilarity scoring, threshold comparisonIdentical = 1.0, different = low score
assertions/latency.tsThreshold comparison500ms vs 1000ms limit = pass
assertions/cost.tsToken cost calculation1000 tokens × $0.01/1K = $0.01
providers/registry.tsProvider string parsing, lookup"openai:gpt-4o" → OpenAI adapter
engine/runner.tsTest execution flow, multi-run aggregation3 runs, 2 pass = 66.7% rate
engine/gate.tsPass/fail gate evaluation92% vs 90% threshold = pass
reporters/terminal.tsOutput formattingColors, alignment, summary line
reporters/json.tsComplete report structureAll fields present, valid JSON
reporters/junit.tsValid JUnit XMLParseable by CI systems
reporters/compliance.tsAnnex IV section mappingEach assertion type maps to correct article
baseline/manager.tsSave, load, compare, listFile I/O, JSON round-trip, diff calculation

Integration tests (packages/core/src/__tests__/integration/):

TestWhat it covers
config-to-engine.test.tsParse YAML config → create engine → run with mock provider → get results
multi-provider.test.tsSame suite against two mock providers → comparison report
compliance-flow.test.tsConfig with compliance section → run → compliance markdown generated
baseline-flow.test.tsRun → save baseline → run again → compare → drift detected
multi-turn-tools.test.tsMulti-turn tool simulation: prompt → tool call → sim response → final answer

3.2 packages/cli — Integration + Snapshot

CLI tests verify that commands produce correct output and exit codes. We test the CLI as a user would experience it.

Integration tests (packages/cli/src/__tests__/):

TestHow
init.test.tsRun kindlm init in temp dir → verify kindlm.yaml created and valid
validate-valid.test.tsRun kindlm validate on valid config → exit 0, lists suites
validate-invalid.test.tsRun kindlm validate on broken config → exit 1, shows line-specific error
test-pass.test.tsRun kindlm test with mock provider → exit 0, output contains "PASS"
test-fail.test.tsRun kindlm test with failing assertions → exit 1, output contains failure details
test-filter.test.tsRun kindlm test -s suite-name → only that suite executes
reporters.test.tsRun with --reporter json → valid JSON file; --reporter junit → valid XML
baseline-commands.test.tsbaseline setbaseline listbaseline compare flow

Snapshot tests:

Terminal output snapshots for consistent formatting. Update with vitest --update when output intentionally changes.

it('shows pass summary', async () => {
  const output = await runCLI(['test', '-c', 'fixtures/passing.yaml']);
  expect(output.stdout).toMatchSnapshot();
  expect(output.exitCode).toBe(0);
});

CLI test helper:

// test-utils/run-cli.ts
export async function runCLI(args: string[]): Promise<{
  stdout: string;
  stderr: string;
  exitCode: number;
}> {
  // Spawns CLI as child process with mock env vars
  // Captures stdout/stderr
  // Returns exit code
}

3.2.1 Scenario Testing

Purpose: Simulate the diverse usage patterns that real users create — config variety, workflow diversity, provider response shapes, and edge cases. These are not load tests; they verify that the CLI handles the full spectrum of valid (and invalid) inputs gracefully.

File: packages/cli/tests/integration/scenarios.test.ts

Categories (~47 tests in 5 describe blocks):

CategoryCountWhat it covers
Config Diversity~14Minimal config, all assertion types combined, multi-provider, unicode names, large vars, YAML comments, skip, repeat, many tests, every optional field, invalid configs
Workflow Diversity~8--reporter json/junit, --gate, --runs, --compliance, validate command on valid/broken configs
Provider Response Diversity~10Tool calls, shouldNotCall, multi-tool, PII detection, keyword allow/deny, empty response, large response
Edge Cases~10HTTP 429/500/401 errors, argsMatch partial, maxLength, JSON schema validation, gates in config, long test names, notContains
Real-World Patterns~5Refund agent, support bot, structured output API, multi-turn tool agent, CI pipeline

How each test works:

  1. Creates a temp directory
  2. Writes a kindlm.yaml config
  3. Starts a mock HTTP server with a custom handler (or uses the default OpenAI handler)
  4. Spawns the real CLI binary as a child process
  5. Asserts exit code and/or stdout content

How to run:

# Run all scenario tests
cd packages/cli && npx vitest run tests/integration/scenarios.test.ts

# Run through turbo (includes all CLI tests)
npx turbo run test --filter=@kindlm/cli

3.3 packages/cloud — Integration with Miniflare

Cloud tests run against a local Workers runtime with an in-memory D1 database.

Integration tests (packages/cloud/src/__tests__/):

TestWhat it covers
auth.test.tsGitHub OAuth flow mock, token generation, token verification
projects-crud.test.tsCreate, list, get, delete projects
runs-upload.test.tsUpload JSON report → stored in D1 → retrievable
runs-list.test.tsPagination, branch filter, date range filter
runs-compare.test.tsCompare two runs → deltas calculated correctly
baselines.test.tsSet baseline from run, list baselines, compare against baseline
plan-gating.test.tsFree plan hits project limit → 403; Team plan allows 5 projects
rate-limiting.test.tsExceed rate limit → 429 with Retry-After header
audit-log.test.tsActions create audit entries; query with filters; Enterprise-only gate
data-retention.test.tsCron deletes runs older than plan retention period

Miniflare setup:

// test-utils/worker-env.ts
import { Miniflare } from 'miniflare';

export async function createTestEnv() {
  const mf = new Miniflare({
    modules: true,
    script: '', // loaded from built worker
    d1Databases: ['DB'],
    kvNamespaces: ['RATE_LIMITS'],
  });
  
  // Run migrations
  const db = await mf.getD1Database('DB');
  await db.exec(MIGRATIONS_SQL);
  
  return { mf, db };
}

4. Mocking Strategy

Rule: Mock at the HTTP boundary, never deeper.

Provider adapters make HTTP requests. We mock the HTTP responses, not the adapter internals. This means our tests verify that our code correctly handles real-shaped API responses.

msw handlers for providers:

// test-utils/handlers.ts
import { http, HttpResponse } from 'msw';

export const openaiHandlers = [
  http.post('https://api.openai.com/v1/chat/completions', () => {
    return HttpResponse.json({
      choices: [{
        message: {
          role: 'assistant',
          content: 'Test response',
          tool_calls: [{
            id: 'call_1',
            type: 'function',
            function: {
              name: 'lookup_order',
              arguments: '{"order_id": "123"}'
            }
          }]
        }
      }],
      usage: { prompt_tokens: 100, completion_tokens: 50, total_tokens: 150 }
    });
  })
];

export const anthropicHandlers = [
  http.post('https://api.anthropic.com/v1/messages', () => {
    return HttpResponse.json({
      content: [
        { type: 'text', text: 'Test response' },
        { type: 'tool_use', id: 'tu_1', name: 'lookup_order', input: { order_id: '123' } }
      ],
      usage: { input_tokens: 100, output_tokens: 50 }
    });
  })
];

What we mock vs what we don't:

LayerMock?Why
Provider HTTP APIs (OpenAI, Anthropic, Ollama)Yes (msw)Expensive, rate-limited, non-deterministic
LLM-as-judge callsYes (msw)Same reasons — return fixed scores in tests
File system (baselines, reports)No — use temp dirsFile I/O is fast and deterministic
YAML parsingNoPure function, deterministic
Zod validationNoPure function, deterministic
AJV schema validationNoPure function, deterministic
D1 database (cloud)No — use Miniflare in-memory D1Tests real SQL queries
GitHub OAuthYes (msw)External service
StripeYes (msw)External service

5. Test Fixtures

All test fixtures live in packages/*/src/__tests__/fixtures/.

Config fixtures:

fixtures/
├── configs/
│   ├── minimal.yaml          # Bare minimum valid config
│   ├── full-featured.yaml    # Every option used
│   ├── multi-provider.yaml   # Multiple models
│   ├── compliance.yaml       # With compliance section
│   ├── invalid-syntax.yaml   # YAML parse error
│   ├── invalid-schema.yaml   # Valid YAML, fails Zod
│   └── missing-provider.yaml # References non-existent provider
├── schemas/
│   ├── order-response.json   # For schema assertion tests
│   └── ticket-response.json
├── baselines/
│   └── sample-baseline.json  # Stored baseline for compare tests
└── reports/
    ├── passing-report.json   # Full passing report
    └── failing-report.json   # Report with failures

Provider response fixtures:

fixtures/providers/
├── openai/
│   ├── text-response.json       # Simple text completion
│   ├── tool-call-response.json  # Response with tool calls
│   ├── multi-tool-response.json # Multiple tool calls
│   ├── error-429.json           # Rate limit error
│   └── error-500.json           # Server error
├── anthropic/
│   ├── text-response.json
│   ├── tool-use-response.json
│   └── error-overloaded.json
└── ollama/
    ├── text-response.json
    └── tool-call-response.json

6. CI Pipeline Testing

Tests run on every PR and every push to main.

# .github/workflows/test.yml
name: Test
on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        node: [20, 22]
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: ${{ matrix.node }}
      - run: npm ci
      - run: npx turbo test -- --coverage
      - uses: codecov/codecov-action@v4
        with:
          files: packages/*/coverage/lcov.info

  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
      - run: npm ci
      - run: npx turbo lint
      - run: npx turbo typecheck

E2E tests (nightly):

Run against real provider APIs with a dedicated test API key. Small budget ($5/day cap). Tests a curated subset of 5 test cases against live OpenAI and Anthropic APIs.

# .github/workflows/e2e-nightly.yml
name: E2E Nightly
on:
  schedule:
    - cron: '0 3 * * *'  # 3 AM UTC daily

jobs:
  e2e:
    runs-on: ubuntu-latest
    environment: e2e
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
      - run: npm ci && npx turbo build
      - run: npx vitest run --project e2e
        env:
          OPENAI_API_KEY: ${{ secrets.E2E_OPENAI_KEY }}
          ANTHROPIC_API_KEY: ${{ secrets.E2E_ANTHROPIC_KEY }}

7. Coverage Requirements

PackageMinimum CoverageEnforced
core90% lines, 85% branchesCI blocks merge below threshold
cli80% linesCI warns, doesn't block
cloud70% linesCI warns, doesn't block
// vitest.config.ts (root)
export default defineConfig({
  test: {
    coverage: {
      provider: 'v8',
      thresholds: {
        'packages/core/src': { lines: 90, branches: 85 },
        'packages/cli/src': { lines: 80 },
        'packages/cloud/src': { lines: 70 },
      }
    }
  }
});

8. Test Naming Convention

[module].[behavior].[expected result]

Examples:
  config.parser.returns-error-on-invalid-yaml
  assertion.tool-called.passes-when-tool-matches
  assertion.tool-called.fails-when-tool-not-invoked
  assertion.no-pii.detects-credit-card-number
  engine.runner.aggregates-3-runs-correctly
  cli.test-command.exits-1-on-failure
  cloud.runs.upload-creates-run-in-d1
  cloud.plan-gate.blocks-6th-project-on-team-plan

Files: [module].test.ts in __tests__/ directory adjacent to source.


9. Performance Benchmarks

Track in CI to catch regressions:

OperationTargetMeasured How
Config parse (full-featured.yaml)< 50msperformance.now() in test
Single assertion evaluation< 5msBenchmark suite
Report generation (50 tests)< 100msBenchmark suite
CLI cold start (kindlm --help)< 500msWall clock in CI

Benchmarks run weekly via separate CI job. Regression = > 20% slower than baseline.