Duyetbot Agent
Guides

Prompt Evaluation

Test and evaluate prompts using promptfoo with OpenRouter models

Prompt Evaluation

Test production prompts using promptfoo - an open-source LLM evaluation framework. This ensures prompt quality, routing accuracy, and format compliance across platforms.

Quick Start

# Set OpenRouter API key
export OPENROUTER_API_KEY=your_key_here
 
# Run all evaluations
bun run prompt:eval
 
# Run specific suite
bun run prompt:eval:router     # Router classification
bun run prompt:eval:telegram   # Telegram format
bun run prompt:eval:github     # GitHub format
bun run prompt:eval:quality    # Response quality
 
# View results
bun run prompt:view            # Interactive web UI
bun run prompt:report          # Generate HTML dashboard

Architecture

prompts-eval/
├── configs/           # promptfoo YAML configurations
│   ├── router.promptfoo.yaml
│   ├── telegram.promptfoo.yaml
│   ├── github.promptfoo.yaml
│   └── quality.promptfoo.yaml
├── datasets/          # Test case definitions
│   ├── router-cases.yaml
│   ├── telegram-cases.yaml
│   ├── telegram-markdown-cases.yaml
│   ├── github-cases.yaml
│   └── quality-cases.yaml
├── prompts/           # JS helpers using REAL production prompts
│   ├── router-prompt.cjs
│   ├── telegram-prompt.cjs
│   └── github-prompt.cjs
├── assertions/        # Custom validation logic
│   ├── router-assertion.cjs
│   ├── telegram-format-assertion.cjs
│   ├── telegram-html.ts
│   ├── markdown-v2.ts
│   └── routing-accuracy.ts
└── results/           # JSON + HTML outputs

Key Design: Production Prompts

This suite tests actual production prompts, not separate test copies:

EvaluationProduction Source
Routerpackages/prompts/src/agents/router.tsgetRouterPrompt()
Telegrampackages/prompts/src/platforms/telegram.tsgetTelegramPrompt()
GitHubpackages/prompts/src/platforms/github.tsgetGitHubBotPrompt()

JavaScript helpers in prompts/ import the real prompt functions, ensuring tests validate actual production behavior.

Test Suites

Router Classification (15 tests)

Tests query routing accuracy to correct agents:

Query TypeExpected Agent
Simple questionssimple-agent
Complex code tasksorchestrator-agent
Research querieslead-researcher-agent
Personal info about Duyetduyet-info-agent
# Example: router-cases.yaml
- description: "Simple greeting routes to simple-agent"
  vars:
    query: "Hello, how are you?"
    __expected: "simple-agent"
  assert:
    - type: javascript
      value: file://assertions/router-assertion.cjs

Telegram Format (29 tests)

Validates mobile-optimized responses:

  • Conciseness (no filler phrases)
  • HTML/MarkdownV2 format compliance
  • Code formatting (inline vs blocks)
  • URL handling and understanding

GitHub Format (6 tests)

Validates GitHub-flavored markdown:

  • Code blocks with language identifiers
  • Heading structure
  • GitHub alerts for warnings/notes

Response Quality (8 tests)

Cross-platform behavior validation:

  • Telegram: brevity, progressive disclosure
  • GitHub: comprehensive detail, structured sections

LLM Providers

Uses promptfoo's built-in OpenRouter provider with state-of-the-art models:

ModelIDUse Case
Grok 4.1 Fastopenrouter:x-ai/grok-4.1-fastPrimary evaluation
Gemini 2.5 Flash Liteopenrouter:google/gemini-2.5-flash-liteFast alternative
Ministral 8Bopenrouter:mistralai/ministral-8b-2512Small model comparison

Change models in config files under the providers section.

Adding Tests

1. Create test cases

# datasets/your-cases.yaml
- description: "Your test description"
  vars:
    query: "Test query"
  assert:
    - type: contains
      value: "expected text"
    - type: llm-rubric
      value: "Natural language assertion evaluated by LLM"

2. For router tests, use custom assertion

- description: "Router test"
  vars:
    query: "Test query"
    __expected: "simple-agent"
  assert:
    - type: javascript
      value: file://assertions/router-assertion.cjs

3. Available assertion types

TypeDescription
containsCheck if output contains text
not-containsCheck output doesn't contain text
contains-anyCheck for any of multiple values
llm-rubricLLM-evaluated quality assertion
javascriptCustom JS function

Assertion Strategy

Progressive scoring (0.0 - 1.0):

ScoreMeaning
1.0Perfect compliance
0.8-0.9Minor issues (warnings)
0.6-0.7Acceptable with reservations
0.3-0.5Concerning but usable
0.0-0.2Significant issues

Pass Threshold: 0.5

CI Integration

The evaluation suite runs in CI to catch prompt regressions:

# CI workflow
bun run prompt:eval --output results/ci-results.json

Results are uploaded as artifacts for review.