Prompt Evaluation
Test and evaluate prompts using promptfoo with OpenRouter models
Prompt Evaluation
Test production prompts using promptfoo - an open-source LLM evaluation framework. This ensures prompt quality, routing accuracy, and format compliance across platforms.
Quick Start
Architecture
Key Design: Production Prompts
This suite tests actual production prompts, not separate test copies:
| Evaluation | Production Source |
|---|---|
| Router | packages/prompts/src/agents/router.ts → getRouterPrompt() |
| Telegram | packages/prompts/src/platforms/telegram.ts → getTelegramPrompt() |
| GitHub | packages/prompts/src/platforms/github.ts → getGitHubBotPrompt() |
JavaScript helpers in prompts/ import the real prompt functions, ensuring tests validate actual production behavior.
Test Suites
Router Classification (15 tests)
Tests query routing accuracy to correct agents:
| Query Type | Expected Agent |
|---|---|
| Simple questions | simple-agent |
| Complex code tasks | orchestrator-agent |
| Research queries | lead-researcher-agent |
| Personal info about Duyet | duyet-info-agent |
Telegram Format (29 tests)
Validates mobile-optimized responses:
- Conciseness (no filler phrases)
- HTML/MarkdownV2 format compliance
- Code formatting (inline vs blocks)
- URL handling and understanding
GitHub Format (6 tests)
Validates GitHub-flavored markdown:
- Code blocks with language identifiers
- Heading structure
- GitHub alerts for warnings/notes
Response Quality (8 tests)
Cross-platform behavior validation:
- Telegram: brevity, progressive disclosure
- GitHub: comprehensive detail, structured sections
LLM Providers
Uses promptfoo's built-in OpenRouter provider with state-of-the-art models:
| Model | ID | Use Case |
|---|---|---|
| Grok 4.1 Fast | openrouter:x-ai/grok-4.1-fast | Primary evaluation |
| Gemini 2.5 Flash Lite | openrouter:google/gemini-2.5-flash-lite | Fast alternative |
| Ministral 8B | openrouter:mistralai/ministral-8b-2512 | Small model comparison |
Change models in config files under the providers section.
Adding Tests
1. Create test cases
2. For router tests, use custom assertion
3. Available assertion types
| Type | Description |
|---|---|
contains | Check if output contains text |
not-contains | Check output doesn't contain text |
contains-any | Check for any of multiple values |
llm-rubric | LLM-evaluated quality assertion |
javascript | Custom JS function |
Assertion Strategy
Progressive scoring (0.0 - 1.0):
| Score | Meaning |
|---|---|
| 1.0 | Perfect compliance |
| 0.8-0.9 | Minor issues (warnings) |
| 0.6-0.7 | Acceptable with reservations |
| 0.3-0.5 | Concerning but usable |
| 0.0-0.2 | Significant issues |
Pass Threshold: 0.5
CI Integration
The evaluation suite runs in CI to catch prompt regressions:
Results are uploaded as artifacts for review.
Related Documentation
- prompts-eval/README.md - Full technical details
- prompts-eval/TESTING_GUIDE.md - Testing procedures
- prompts-eval/PROVIDERS.md - Provider configuration
- promptfoo Documentation - Official docs