litmus - unit tests for AI prompts

TypeScript CLI that runs unit tests against prompts: compare models, check outputs, track cost. Treats prompts as code that needs CI.

Saved Apr 28, 20264 min readView source ↗

#evals #typescript #cli #prompt-engineering #developer-tools

If REST APIs have Postman and frontends have Cypress, the LLM equivalent has been "send the prompt to the model and read the output by hand." Litmus is the missing layer: a YAML-config CLI that runs unit tests against prompts, with pass/fail assertions, cross-model comparison, and cost reports.

The README's framing of the three problems it solves is the right one:

No testing standard for prompts.
Prompt regression is invisible - a one-word change can silently break 15% of edge cases.
Model selection is vibes - "we use GPT-4o because it's good," but is it $15k/month better than Gemini Flash?

The project ships as the litmux CLI on PyPI. Naming aside, the CLI commands and config are what you'll actually use day to day.

Quick start

pip install litmux
cp .env.example .env
# Add at least one of OPENAI_API_KEY, ANTHROPIC_API_KEY, GOOGLE_API_KEY, HF_TOKEN

litmux init    # scaffold a project
litmux run     # run tests against all configured models

No database, no cloud account, no Docker. The CLI works fully offline.

Tests as YAML, assertions as a closed list

A litmux.yaml is small enough to read in one breath:

models:
  - model: gpt-4o-mini
  - model: claude-haiku-4-5-20251001

tests:
  - name: summarize_earnings
    prompt: prompts/summarize.txt
    inputs:
      text: "Revenue grew 15% to $4.2 billion..."
    assert:
      - type: contains
        value: "revenue"
      - type: cost-less-than
        value: 0.01

The assertion types are deliberately a small fixed list, not a programming language:

Type	What it checks
`contains`	output contains substring
`not-contains`	output does not contain substring
`regex`	output matches regex pattern
`json-valid`	output is valid JSON
`json-schema`	output has required JSON keys
`cost-less-than`	cost below threshold (USD)
`latency-less-than`	latency below threshold (ms)
`llm-judge`	LLM scores output 1–10 against criteria

The mix of structural assertions (contains, regex, json-schema), economic assertions (cost-less-than, latency-less-than), and LLM-judge assertions covers the realistic shape of "did the prompt do the right thing." The llm-judge model is configurable via LITMUX_JUDGE_MODEL; the default is gpt-4o-mini.

Bulk evaluation against datasets

The other half of the tool is litmux eval, which runs the same prompt across a CSV of inputs and gates on aggregate quality:

evals:
  - name: ticket_classifier
    prompt: prompts/classify.txt
    dataset: datasets/support_tickets.csv
    input_mapping:
      ticket: text
    expected: expected_category
    assert:
      - type: json-valid
    judge:
      criteria: "Did the model correctly classify the ticket?"
      threshold: 7.0

This is the part that turns "the prompt feels good" into a number you can put in a CI gate.

litmux generate rounds out the loop - feed it a prompt and a small seed CSV, get back a generated dataset with N rows synthesised by the model. The synthetic data is exactly the kind of test set you don't want to hand-write but you also can't ship without.

litmux generate \
  --prompt prompts/classify.txt \
  --seed datasets/sample_tickets.csv \
  --n 50 \
  --output datasets/support_tickets.csv

Cost projection - the underrated command

litmux cost --volume 50000 runs your tests against every configured model, then projects what 50,000 calls would cost on each. The model that passes your tests cheapest wins.

This is the bit most teams skip and most teams should run. The README's blunt example - "is GPT-4o $15k/month better than Gemini Flash for this prompt?" - has a real answer once you've expressed your quality gate as assertions. Cost projection turns model selection from "vibes" to "the cheapest model whose test results you accept."

litmux compare is the qualitative cousin: side-by-side outputs from each model on the same input. Use it before the cost projection to make sure your assertions actually capture what you care about.

CI integration

- run: litmux run --ci
  env:
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

--ci switches to markdown output that renders cleanly in GitHub Actions logs. The exit code carries pass/fail so the action fails loudly when a regression slips in.

Cloud (optional, free, opt-in)

Run litmux login and litmux dashboard to sync results to a hosted dashboard for history and trends. Everything works without it; cloud is opt-in only and currently a private beta. The relevant env vars (LITMUX_API_URL, LITMUX_DASHBOARD_URL, LITMUX_CLOUD_ENABLED) let you self-host or point at staging if needed.

When to reach for it

You're shipping prompts to production and don't have a regression-test loop.
You're choosing between models and want a defensible answer instead of vibes.
You're optimising prompts and need to know whether your last edit broke an edge case.

When not to

Single-person projects where you can verify outputs by hand. The setup overhead is real.
Workloads where the prompt is one sentence and the output is structured. A few unit tests in your existing test suite cover the same ground.
Highly stochastic outputs (creative writing, long generation) where assertions don't compose. The llm-judge assertion partially fills the gap, but expect to invest in the criteria text.

Trade-offs and rough edges

The naming - GitHub repo is litmus, CLI and PyPI package are litmux - is a quirk you'll notice once. Use whichever matches the file you're looking at.

The response cache is on by default; set LITMUX_NO_CACHE=1 to bypass when you're explicitly testing nondeterminism. Cache is the right default for re-running the same suite without melting your credit card.

107 passing tests in the repo and Python 3.11+ required. MIT licensed. The three example projects in examples/ (01-quickstart, 02-multi-model, 03-generate-and-eval) are the fastest path to understanding the workflow end-to-end.

Recent discussion

From the wider web

We open-sourced Litmus, a tool for testing and evaluating LLM prompts
reddit.com · Apr 12, 2026

Related entries

GitHub ToolFeatured

Claudraband - persistent, resumable Claude Code sessions over HTTP and ACP

Wraps the real Claude Code TUI with a session lifecycle layer. Resumable non-interactive workflows, HTTP daemon for remote/headless control, ACP server for editor integrations (Zed, Toad). Drives your existing Claude Code install rather than reimplementing it - keeps skills, hooks, MCPs, and approvals intact.

Why I saved this - Different from the Claude SDK - Claudraband drives the real CLI from outside, so user-installed skills/hooks/MCPs all still work. The ACP support is the easy path to editor integrations.

#claude-code #remote-control #developer-tools #typescript #cli

GitHub Tool

MicrosoftDocs/mcp - official Microsoft Learn MCP server

Microsoft's official MCP server and CLI that exposes Microsoft Learn docs and code samples to LLMs and AI agents in real time.

#mcp #developer-tools #typescript #cli

GitHub Tool

ocx - OpenCode extension manager

Bun-based package manager for OpenCode that gives you portable, isolated profiles. Makes setup reproducible across machines without leaking config between projects.

#opencode #cli #typescript #developer-tools

GitHub Tool

kordoc - HWP/HWPX/PDF/XLSX/DOCX to Markdown

CLI and MCP server that parses Korean office formats (HWP, HWPX) along with PDF, XLSX, and DOCX into Markdown for agent consumption. Targets Korean document workflows that other parsers skip.

#mcp #cli #document-parser #typescript #developer-tools