ai utility

promptdiff.

compare llm outputs with a side-by-side diff.

promptdiff lets you run controlled experiments on your prompts. feed the same text to two models, or tweak a prompt and see how a single model's response changes.

github →npm →download →

what it is

promptdiff is a testing utility for llm outputs. you can feed the same prompt to two different models, or provide two variations of a prompt to a single model. it highlights the differences in the generated text down to the character level.

evaluating model performance by eyeballing two separate chat windows is useless. you miss subtle changes in formatting, tone, or omitted details. promptdiff aligns the responses side-by-side so regressions and improvements are immediately obvious.

this is for engineers trying to optimize prompts for production. when you need to know if switching from GPT-4 to Claude saves money without degrading the output quality, this tool gives you the visual proof.

core features

character-level diff

highlights exactly what changed between two outputs. spots missing punctuation and subtle phrasing shifts.

model comparison

send the same prompt to GPT-4 and Claude simultaneously. see which one follows instructions better.

prompt iteration

test two variations of a prompt against a single model. verify if your tweaks actually improved the response.

side-by-side view

displays results in a synchronized split-screen interface. scrolling one side scrolls the other.

exportable reports

save your diffs as static html files. share proof of model regressions with your team.

how to use it

compare how GPT-4 and Claude respond to the same prompt.

// input

promptdiff --prompt 'explain dns' --model-a gpt-4 --model-b claude-3

// output

gpt-4: dns is like a phonebook for the internet...
claude-3: the domain name system (dns) translates...

diff: -like a phonebook +translates human-readable names

the cli highlights the exact phrasing differences so you can evaluate the models objectively.

why we built it

evaluating llm outputs by staring at two different browser tabs is an exercise in futility. language models are non-deterministic, and spotting a missing negative word in a wall of text is nearly impossible.

we needed a way to prove that tweaking a system prompt actually improved the output. promptdiff strips away the subjective feeling of an evaluation and just shows you the raw character differences.

frequently asked questions

how do i provide my api keys?

the cli reads standard environment variables like OPENAI_API_KEY and ANTHROPIC_API_KEY. it does not store them anywhere.

does it handle json outputs?

yes. if the models return json, promptdiff will format the payload before diffing it, so you are not comparing unformatted strings.

can i run this locally with ollama?

yes. you can point the tool at any local endpoint that matches the standard completion api format.

why not just use standard git diff?

git diff is line-based and fails poorly on wrapped text paragraphs. promptdiff uses a specialized algorithm for natural language.

related tools

tokcount

count tokens and estimate llm api costs offline.

ctxstuff

dump your codebase into a single file for llm context.

// stop guessing if your prompt actually got better.

← all tools