Arbiter
The only LLM evaluation framework that shows you exactly what your evaluations cost.
Running evals shouldn’t be a mystery. You need to know what you’re spending—not after the fact, but as it happens. Arbiter tracks every LLM interaction automatically.
What it does
Arbiter provides comprehensive evaluation capabilities for large language model outputs:
- Semantic similarity scoring between outputs and references
- Custom criteria evaluation for domain-specific assessment
- Pairwise comparison mode for A/B testing different model responses
- Built-in evaluators for factuality, groundedness, and relevance
- Batch processing with progress tracking and concurrency control
Cost transparency
Every evaluation shows you exactly what it costs. Real-time calculations using current pricing data. Expenses broken down by evaluator and model. No manual instrumentation required.
The architecture
Built on PydanticAI with provider-agnostic support across OpenAI, Anthropic, Google, Groq, Mistral, and Cohere. Pure library—no platform signup or external servers. Results can persist to PostgreSQL or Redis.
Status
Arbiter is available now. Install it with pip install arbiter-ai.