Discovery
Back to browse

litmus - unit tests for AI prompts

TypeScript CLI that runs unit tests against prompts: compare models, check outputs, track cost. Treats prompts as code that needs CI.

4 min readView source ↗

If REST APIs have Postman and frontends have Cypress, the LLM equivalent has been "send the prompt to the model and read the output by hand." Litmus is the missing layer: a YAML-config CLI that runs unit tests against prompts, with pass/fail assertions, cross-model comparison, and cost reports.

The README's framing of the three problems it solves is the right one:

  1. No testing standard for prompts.
  2. Prompt regression is invisible - a one-word change can silently break 15% of edge cases.
  3. Model selection is vibes - "we use GPT-4o because it's good," but is it $15k/month better than Gemini Flash?

The project ships as the litmux CLI on PyPI. Naming aside, the CLI commands and config are what you'll actually use day to day.

Quick start

pip install litmux
cp .env.example .env
# Add at least one of OPENAI_API_KEY, ANTHROPIC_API_KEY, GOOGLE_API_KEY, HF_TOKEN

litmux init    # scaffold a project
litmux run     # run tests against all configured models

No database, no cloud account, no Docker. The CLI works fully offline.

Tests as YAML, assertions as a closed list

A litmux.yaml is small enough to read in one breath:

models:
  - model: gpt-4o-mini
  - model: claude-haiku-4-5-20251001

tests:
  - name: summarize_earnings
    prompt: prompts/summarize.txt
    inputs:
      text: "Revenue grew 15% to $4.2 billion..."
    assert:
      - type: contains
        value: "revenue"
      - type: cost-less-than
        value: 0.01

The assertion types are deliberately a small fixed list, not a programming language:

TypeWhat it checks
containsoutput contains substring
not-containsoutput does not contain substring
regexoutput matches regex pattern
json-validoutput is valid JSON
json-schemaoutput has required JSON keys
cost-less-thancost below threshold (USD)
latency-less-thanlatency below threshold (ms)
llm-judgeLLM scores output 1–10 against criteria

The mix of structural assertions (contains, regex, json-schema), economic assertions (cost-less-than, latency-less-than), and LLM-judge assertions covers the realistic shape of "did the prompt do the right thing." The llm-judge model is configurable via LITMUX_JUDGE_MODEL; the default is gpt-4o-mini.

Bulk evaluation against datasets

The other half of the tool is litmux eval, which runs the same prompt across a CSV of inputs and gates on aggregate quality:

evals:
  - name: ticket_classifier
    prompt: prompts/classify.txt
    dataset: datasets/support_tickets.csv
    input_mapping:
      ticket: text
    expected: expected_category
    assert:
      - type: json-valid
    judge:
      criteria: "Did the model correctly classify the ticket?"
      threshold: 7.0

This is the part that turns "the prompt feels good" into a number you can put in a CI gate.

litmux generate rounds out the loop - feed it a prompt and a small seed CSV, get back a generated dataset with N rows synthesised by the model. The synthetic data is exactly the kind of test set you don't want to hand-write but you also can't ship without.

litmux generate \
  --prompt prompts/classify.txt \
  --seed datasets/sample_tickets.csv \
  --n 50 \
  --output datasets/support_tickets.csv

Cost projection - the underrated command

litmux cost --volume 50000 runs your tests against every configured model, then projects what 50,000 calls would cost on each. The model that passes your tests cheapest wins.

This is the bit most teams skip and most teams should run. The README's blunt example - "is GPT-4o $15k/month better than Gemini Flash for this prompt?" - has a real answer once you've expressed your quality gate as assertions. Cost projection turns model selection from "vibes" to "the cheapest model whose test results you accept."

litmux compare is the qualitative cousin: side-by-side outputs from each model on the same input. Use it before the cost projection to make sure your assertions actually capture what you care about.

CI integration

- run: litmux run --ci
  env:
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

--ci switches to markdown output that renders cleanly in GitHub Actions logs. The exit code carries pass/fail so the action fails loudly when a regression slips in.

Cloud (optional, free, opt-in)

Run litmux login and litmux dashboard to sync results to a hosted dashboard for history and trends. Everything works without it; cloud is opt-in only and currently a private beta. The relevant env vars (LITMUX_API_URL, LITMUX_DASHBOARD_URL, LITMUX_CLOUD_ENABLED) let you self-host or point at staging if needed.

When to reach for it

  • You're shipping prompts to production and don't have a regression-test loop.
  • You're choosing between models and want a defensible answer instead of vibes.
  • You're optimising prompts and need to know whether your last edit broke an edge case.

When not to

  • Single-person projects where you can verify outputs by hand. The setup overhead is real.
  • Workloads where the prompt is one sentence and the output is structured. A few unit tests in your existing test suite cover the same ground.
  • Highly stochastic outputs (creative writing, long generation) where assertions don't compose. The llm-judge assertion partially fills the gap, but expect to invest in the criteria text.

Trade-offs and rough edges

The naming - GitHub repo is litmus, CLI and PyPI package are litmux - is a quirk you'll notice once. Use whichever matches the file you're looking at.

The response cache is on by default; set LITMUX_NO_CACHE=1 to bypass when you're explicitly testing nondeterminism. Cache is the right default for re-running the same suite without melting your credit card.

107 passing tests in the repo and Python 3.11+ required. MIT licensed. The three example projects in examples/ (01-quickstart, 02-multi-model, 03-generate-and-eval) are the fastest path to understanding the workflow end-to-end.

Recent discussion

From the wider web

Related entries