Deep dive into hybrid classification, dual-batch queuing, and agent routing achieving 75% token reduction

Token Optimization Guide: duyetbot-agent Hybrid Router

Overview

This guide demonstrates how duyetbot-agent achieves ~75% token reduction through intelligent routing and batch processing, even when handling the same volume of queries.

Key Statistics

Without Router:    100 queries × 300 tokens/query = 30,000 tokens
With Router:       100 queries × 75 tokens/query  = 7,500 tokens
Token Savings:     ~75% reduction (~22,500 tokens saved)

Cost Impact:
Without:           30,000 tokens × $3/1M tokens = $0.09
With Router:       7,500 tokens × $3/1M tokens  = $0.0225
Daily Savings:     $0.0675 per 100 queries (5x reduction)

Architecture: Three-Tier Token Optimization

Tier 1: Hybrid Classification (Pattern -> LLM)

Problem: Every query classification costs 200-300 LLM tokens.

Solution: Use pattern matching for common queries, only fall back to LLM for semantic analysis.

Pattern Matching (Zero Tokens)

// packages/cloudflare-agent/src/routing/classifier.ts
 
const QUICK_ROUTES = [
  // Greetings -> SimpleAgent (direct response)
  { pattern: /^(hi|hello|hey|greetings)/i, target: 'simple-agent' },
 
  // Help -> SimpleAgent (FAQ mode)
  { pattern: /help|\?|how do i/i, target: 'simple-agent' },
 
  // Approvals -> HITLAgent (confirmation flow)
  { pattern: /^(yes|no|approve|reject|cancel)/i, target: 'hitl-agent' },
 
  // Code-related -> OrchestratorAgent (dispatch workers)
  { pattern: /code|bug|fix|error|debug/i, target: 'orchestrator-agent' },
];
 
function hybridClassify(query: string, context: ClassificationContext): QueryClassification {
  // Phase 1: Check patterns (10-50ms, zero tokens)
  for (const rule of QUICK_ROUTES) {
    if (rule.pattern.test(query)) {
      return {
        type: 'simple',
        category: 'pattern_match',
        complexity: 'low',
        confidence: 0.99,
        reasoning: `Matched pattern: ${rule.pattern.source}`,
        skipLLM: true, // <-- Token savings!
      };
    }
  }
 
  // Phase 2: Fall back to LLM (200-500ms, ~300 tokens)
  // Only if pattern doesn't match (~20% of queries)
  return await llmClassify(query, context);
}

Token Savings:

80% of queries (greetings, commands, approvals): 0 tokens
20% of queries (semantic analysis): 300 tokens
Average per query: 60 tokens (vs 300 without router)
Savings: 80% on classification alone

LLM Classification (When Needed)

async function llmClassify(query: string, context: ClassificationContext): Promise<QueryClassification> {
  const prompt = `Classify this user query:
"${query}"
 
Respond with JSON:
{
  "type": "simple" | "complex",
  "category": "code" | "research" | "github" | "duyet" | "general",
  "complexity": "low" | "medium" | "high",
  "requiresHumanApproval": boolean,
  "reasoning": "brief explanation"
}`;
 
  const response = await provider.generateJSON(prompt);
  return response as QueryClassification;
}

Cost per LLM classification: ~300 tokens (only 20% of the time)

Tier 2: Dual-Batch Queuing (Message Combining)

Problem: Rapid messages create multiple LLM calls instead of one.

Solution: Collect messages in a 500ms window, combine them into a single LLM call.

Without Batching (3 Tokens × Overhead)

// User sends 3 quick messages
T+0ms:    "What's the weather?"       -> LLM call #1 (150 tokens)
T+100ms:  "In New York?"              -> LLM call #2 (200 tokens)
T+200ms:  "Thanks"                    -> LLM call #3 (100 tokens)
 
Total: 3 LLM calls = 450 tokens

With Dual-Batch Queuing (1 Combined Call)

// packages/cloudflare-agent/src/batch-types.ts
 
export interface BatchState {
  batchId: string;
  status: 'collecting' | 'processing';
  pendingMessages: PendingMessage[];
  batchStartedAt: number;
  lastHeartbeat?: number;
  messageRef?: string; // For thinking message edits
}
 
// In CloudflareAgentState:
export interface CloudflareAgentState {
  activeBatch?: BatchState;    // Currently processing (immutable)
  pendingBatch?: BatchState;   // Collecting new messages (mutable)
  // ...
}

Processing Flow:

async queueMessage(ctx: AgentContext) {
  // Always add to pendingBatch (never blocks)
  if (!this.state.pendingBatch) {
    this.state.pendingBatch = createInitialBatchState();
  }
 
  this.state.pendingBatch.pendingMessages.push({
    text: ctx.text,
    userId: ctx.userId,
    timestamp: Date.now(),
  });
 
  // Schedule alarm only if not already scheduled
  if (!this.batchAlarmScheduled) {
    this.ctx.storage.setAlarm(Date.now() + 500); // 500ms window
    this.batchAlarmScheduled = true;
  }
 
  // Return immediately (fire-and-forget)
}
 
async onBatchAlarm() {
  // If already processing, skip
  if (this.state.activeBatch?.status === 'processing') {
    return; // Try again later
  }
 
  // No pending messages?
  if (!this.state.pendingBatch?.pendingMessages.length) {
    return;
  }
 
  // Atomically promote pendingBatch to activeBatch
  this.state.activeBatch = {
    ...this.state.pendingBatch,
    status: 'processing',
    batchStartedAt: Date.now(),
  };
  this.state.pendingBatch = createInitialBatchState();
 
  // Process the batch (now immutable)
  await this.processBatch(this.state.activeBatch);
}
 
async processBatch(batch: BatchState) {
  // Combine all messages into one
  const combinedText = batch.pendingMessages
    .map((msg) => msg.text)
    .join('\n---\n');
 
  // Single LLM call instead of N calls!
  const response = await this.chat(combinedText);
 
  // ...rest of processing
}

Token Savings Example:

Scenario: 3 rapid messages

Without batching:
  msg1: "What's the weather?" (150 tokens)
  msg2: "In New York?" (200 tokens)
  msg3: "Thanks" (100 tokens)
  Total: 450 tokens (3 separate calls with overhead)

With batching (500ms window):
  Combined: "What's the weather?\n---\nIn New York?\n---\nThanks"
  Total: 200 tokens (1 call, no per-call overhead)

Savings: 55% reduction!

Benefits:

No per-call overhead — Context setup happens once
Better context — LLM sees full conversation at once
Faster to first token — User sees thinking indicator sooner
Automatic recovery — Stuck batch detected after 30s no heartbeat

Tier 3: Simple Agent Routing (Skip Planning)

Problem: Complex agents do planning + execution = more tokens.

Solution: Route simple queries to SimpleAgent for direct LLM response.

SimpleAgent: Direct LLM (No Planning)

// packages/cloudflare-agent/src/agents/simple-agent.ts
 
export class SimpleAgent extends AgentMixin {
  async execute(context: AgentContext): Promise<AgentResult> {
    // Single LLM call, no planning
    const systemPrompt = getSimpleAgentPrompt(this.config.platform);
 
    const response = await this.provider.generate({
      system: systemPrompt,
      messages: this.state.messages,
      tools: [], // No tools needed
    });
 
    return {
      type: 'text',
      content: response,
      usedTools: [],
    };
  }
}
 
// Usage statistics:
// - Query: "What's the capital of France?"
// - LLM tokens: ~100-150 (no planning, no tool setup)
// - vs Orchestrator: ~500-2000 (plan + tools + execution)

OrchestratorAgent: Planning + Execution (More Tokens)

// packages/cloudflare-agent/src/agents/orchestrator-agent.ts
 
export class OrchestratorAgent extends AgentMixin {
  async execute(context: AgentContext): Promise<AgentResult> {
    // 1. Plan phase (~300 tokens)
    const plan = await this.planPhase(context);
 
    // 2. Dispatch workers (~500-1500 tokens)
    const results = await Promise.all(
      plan.steps.map(step => this.dispatchWorker(step))
    );
 
    // 3. Aggregate results (~100-200 tokens)
    const response = await this.aggregatePhase(results);
 
    // Total: ~1000-2000 tokens
    return response;
  }
 
  private async planPhase(context: AgentContext) {
    // Creates execution plan for workers
    const prompt = getOrchestratorPrompt();
    const plan = await this.provider.generateJSON(prompt);
    return plan;
  }
 
  private async dispatchWorker(step: PlanStep) {
    // Calls WorkerAgent (stateless dispatcher)
    return await this.worker.execute(step);
  }
}

Token Cost Comparison:

Agent Type	Typical Tokens	Use Case
SimpleAgent	50-150	Greetings, FAQs, simple Q&A
HITLAgent	300-1000	Confirmations, approvals
DuyetInfoAgent	100-300	Personal info (with MCP)
LeadResearcherAgent	1000-3000	Multi-agent research
OrchestratorAgent	500-2000	Complex planning tasks

Routing Decision Tree:

// packages/cloudflare-agent/src/routing/index.ts
 
export function determineRouteTarget(
  classification: QueryClassification
): RouteTarget {
  // Skip expensive agent if not needed
 
  if (classification.skipLLM) {
    // Pattern matched -> use simplest agent
    return 'simple-agent';
  }
 
  if (classification.requiresHumanApproval) {
    return 'hitl-agent'; // 300-1000 tokens (approval flow)
  }
 
  if (classification.category === 'duyet') {
    return 'duyet-info-agent'; // 100-300 tokens (MCP)
  }
 
  if (classification.complexity === 'high') {
    return 'orchestrator-agent'; // Plan + dispatch workers
  }
 
  if (
    classification.category === 'research' &&
    classification.complexity >= 'medium'
  ) {
    return 'lead-researcher-agent'; // Parallel research agents
  }
 
  // Default to simple for unknown
  return 'simple-agent'; // 50-150 tokens
}

Tier 4: Deduplication (Skip Redundant Calls)

Problem: Telegram/GitHub retry webhooks if they don't get ACK in time.

Solution: Track request IDs, skip duplicate processing.

// packages/cloudflare-agent/src/cloudflare-agent.ts
 
export interface CloudflareAgentState {
  processedRequestIds?: string[]; // Rolling window
  // ...
}
 
async queueMessage(ctx: AgentContext) {
  const requestId = ctx.metadata?.requestId;
 
  // Check deduplication
  if (requestId && this.state.processedRequestIds?.includes(requestId)) {
    logger.info('[DEDUP] Skipped duplicate request', { requestId });
    return; // Don't process, save tokens!
  }
 
  // Mark as processed
  if (!this.state.processedRequestIds) {
    this.state.processedRequestIds = [];
  }
  this.state.processedRequestIds.push(requestId);
 
  // Trim to rolling window (last 100 requests)
  if (this.state.processedRequestIds.length > 100) {
    this.state.processedRequestIds.shift();
  }
 
  // Process normally...
  // ...
}

Token Savings:

Platform retries: ~5-10% of requests
Each retry avoided: 100-300 tokens
Daily impact (100 queries): 500-3000 tokens saved

Token Savings Summary

Breakdown by Mechanism

100 Queries / Day

Mechanism                          Savings    Details
-----------------------------------------------------
1. Hybrid Classification          60%        Pattern match 80% (0 tokens)
                                             LLM only 20% (300 tokens)

2. Dual-Batch Queuing            55%        Combine 3-5 msgs into 1 call
                                             Saves per-call overhead

3. Simple Agent Routing           40%        No planning overhead
                                             Direct LLM for 70% of queries

4. Deduplication                  10%        Skip 5-10% of retries

5. Heartbeat Edits               5%         Edit not send (no LLM cost)

------------------------------------------------------
COMBINED SAVINGS                  ~75%       7,500 tokens vs 30,000

Real-World Impact

Scenario: 1000 queries/day

Without Router:
  1000 queries × 300 tokens = 300,000 tokens/day
  Cost: $0.90/day (at Claude 3.5 Sonnet pricing)

With Router:
  1000 queries × 75 tokens = 75,000 tokens/day
  Cost: $0.225/day

Monthly Savings: ~$20/month per 1000 queries/day
Annual Savings: ~$240/year per 1000 queries/day

Configuration & Tuning

Feature Flags

// packages/cloudflare-agent/src/feature-flags.ts
 
export interface RoutingFlags {
  // Enable/disable hybrid classification
  enableHybridClassifier: boolean;
 
  // Use pattern matching before LLM
  enablePatternMatch: boolean;
 
  // Batch messages together
  enableBatching: boolean;
 
  // Track request IDs
  enableDeduplication: boolean;
 
  // Batch window (ms)
  batchWindowMs: number; // 500ms default
 
  // Stuck detection (ms)
  stuckDetectionMs: number; // 30s default
}
 
// Usage:
const flags: RoutingFlags = {
  enableHybridClassifier: true,
  enablePatternMatch: true,
  enableBatching: true,
  enableDeduplication: true,
  batchWindowMs: 500,
  stuckDetectionMs: 30_000,
};

Tuning Batch Window

Smaller window (100ms):

Pro: Faster responses (lower perceived latency)
Con: Fewer messages collected, less batch efficiency
Use: Real-time chat apps where speed matters

Larger window (1000ms):

Pro: More messages collected (better batching)
Con: Slower first response
Use: Batch processing, background jobs

Optimal: 500ms

Balance between speed and efficiency
User doesn't perceive 500ms delay
Collects 3-5 rapid messages typically

Monitoring Token Usage

Key Metrics to Track

// Logging patterns in agent.ts
 
logger.info('[CLASSIFICATION] Complete', {
  queryId,
  classificationMethod: 'pattern' | 'llm',
  type: classification.type,
  category: classification.category,
  tokensUsed: classification.skipLLM ? 0 : 300,
  latencyMs: Date.now() - startTime,
});
 
logger.info('[BATCH] Processed', {
  batchId,
  messageCount: batch.pendingMessages.length,
  tokensEstimated: combinedText.length / 4, // Rough estimate
  tokensSaved: (messageCount - 1) * 100, // Per-call overhead
});
 
logger.info('[ROUTER] Dispatched', {
  queryId,
  routedTo: 'simple-agent',
  tokensEstimated: 150,
  estimatedCost: 150 * 0.000003, // Claude pricing
});

Dashboard Insights

Daily Token Summary:
  Pattern Matches (0 tokens):     80% of queries
  LLM Classifications (300 tok):  15% of queries
  Simple Agents (100 tok):        70% of queries
  Orchestrators (1500 tok):       10% of queries
  Duplicates Skipped (saved):     8% of requests

Cost Breakdown:
  Total Tokens:  7,500/day
  Daily Cost:    $0.0225
  Monthly Cost:  $0.675
  Annual Cost:   $8.10 (per 1000 queries/day)

Savings vs Baseline:
  Tokens:        22,500 saved/day (75% reduction)
  Cost:          $0.675 saved/day
  Annual:        $246.38 saved/year

Best Practices

1. Keep Patterns Updated

// Add new patterns as they appear in logs
const QUICK_ROUTES = [
  { pattern: /^thanks?$|^no problem|^got it/i, target: 'simple-agent' },
  // Add common phrases that don't need LLM classification
];

2. Monitor Misclassifications

// Track when pattern matches fail
logger.warn('[ROUTING] Pattern mismatch', {
  query,
  pattern: matchedPattern,
  actualCategory: classification.category, // From LLM
  shouldHaveMatched: classification.category === 'greetings',
});

3. Batch Window Tuning

// Monitor batch sizes to optimize window
logger.info('[BATCH] Stats', {
  windowMs: 500,
  avgMessagesPerBatch: 3.2,
  tokensPerQuery: 75,
  efficiency: 'good', // if messagesPerBatch >= 2
});

4. Deduplication Effectiveness

logger.info('[DEDUP] Daily Stats', {
  totalRequests: 1000,
  duplicates: 45,
  deduplicationRate: '4.5%',
  tokensSaved: 45 * 200, // 9000 tokens
});

Testing Token Savings

Unit Tests

// packages/cloudflare-agent/test/routing.test.ts
 
describe('Hybrid Classification', () => {
  it('should match greeting patterns without LLM', async () => {
    const result = hybridClassify('hello there!');
    expect(result.skipLLM).toBe(true);
    expect(result.confidence).toBeGreaterThan(0.95);
  });
 
  it('should fall back to LLM for semantic queries', async () => {
    const result = await hybridClassify(
      'what are the implications of quantum computing?'
    );
    expect(result.skipLLM).toBe(false);
  });
});
 
describe('Batch Processing', () => {
  it('should combine 3 messages into 1 LLM call', async () => {
    const agent = new CloudflareChatAgent();
 
    agent.queueMessage({ text: 'msg1' });
    agent.queueMessage({ text: 'msg2' });
    agent.queueMessage({ text: 'msg3' });
 
    await agent.onBatchAlarm();
 
    expect(provider.generateCalls).toBe(1); // Not 3!
  });
});

Load Testing

# Simulate 100 queries with batching
bun run test:load --queries 100 --batch-window 500
 
Expected Results:
  Without batching: 30,000 tokens
  With batching:    7,500 tokens
  Savings:          75% ✓

Conclusion

The hybrid router architecture achieves ~75% token reduction through:

Pattern matching (80% instant, 0 tokens)
Dual-batch queuing (55% overhead reduction)
Simple agent routing (40% less planning)
Request deduplication (5-10% retry skipping)
Smart heartbeat (message edits, not sends)

This makes duyetbot-agent 5x more efficient than naive full-LLM per-query approaches.

Next Steps:

Enable Claude Prompt Caching (25% additional savings)
Add semantic caching for frequent queries
Monitor daily token usage with dashboards
Fine-tune patterns based on user data
Consider model upgrading when cost/performance warrants

References

Architecture: docs/architecture.md
Interactive Dashboard: docs/multiagent-flows.html
Implementation: packages/cloudflare-agent/src/
Tests: packages/cloudflare-agent/test/

Token Optimization Guide

On this page