24/7 Long-Running Agent Server
Implementation plan for autonomous agent server using Claude Agent SDK with Ralph Loop integration
24/7 Long-Running Agent Server - Implementation Plan
Status: DESIGN PHASE Priority: HIGH Created: 2025-12-30 Iteration: 125
Table of Contents
- Overview
- Architecture
- Task Sources
- Ralph Loop Integration
- LLM Providers
- Implementation Phases
- Deployment
- Monitoring & Observability
Overview
The 24/7 Long-Running Agent Server is an autonomous agent system that continuously picks up tasks from multiple sources, executes them using Claude Agent SDK, and leverages Ralph Loop stop hooks for investigation checkpoints.
Key Design Principles
Value Proposition
| Feature | Benefit |
|---|---|
| 24/7 Operation | Continuous autonomous development |
| Multi-Source Tasks | Single agent for TODO.md, MCP todos, GitHub triggers |
| Ralph Loop Hooks | Structured investigation with checkpoint/recovery |
| Claude Agent SDK | Production-ready agent framework |
| LLM Flexibility | Switch between OpenRouter, AI Gateway, Claude API |
Architecture
System Diagram
Project Structure
Task Sources
1. Memory MCP REST API
Purpose: Poll todo items from the Memory MCP server.
Implementation:
Polling Strategy:
- Interval: 30 seconds
- Pagination: 100 items per request
- Backoff: Exponential on errors (30s → 60s → 120s → 300s)
- Timeout: 10 seconds per request
2. TODO.md Files
Purpose: Watch project TODO.md files for task definitions.
File Format:
Implementation:
Watching Strategy:
- Use
chokidarfor cross-platform file watching - Debounce: 500ms (ignore rapid changes)
- Initial scan: On startup
- Parse format: Flexible markdown-based
3. GitHub Webhooks
Purpose: Receive task triggers from GitHub events.
Events:
issue_comment.created: @duyetbot mentions in issuespull_request_review.submitted: Review requestsworkflow_run.completed: CI/CD trigger actions
Implementation:
Webhook Strategy:
- Verification: HMAC-SHA256 signature check
- Deduplication: By delivery ID
- Priority: HIGH (webhooks get immediate attention)
- Response: <200ms (return immediately, process async)
Ralph Loop Integration
Stop Hook Architecture
Ralph Loop provides stop hooks that allow the agent to pause at specific points, investigate, and continue.
Hook Points:
Implementation
Investigation Checkpoint Pattern
LLM Providers
Provider Abstraction
Environment Configuration
Provider Selection Strategy
Implementation Phases
Phase 1: Foundation (Iteration 151-160)
Goal: Basic agent server with Claude Agent SDK and single task source
Tasks
-
Set up project structure
- Initialize
apps/agent-serverpackage.json - Configure TypeScript
- Set up Vitest for testing
- Create directory structure
- Initialize
-
Implement HTTP server
- Hono/Express server setup
- Health check endpoint (
GET /health) - GitHub webhook receiver (
POST /webhook/github) - Metrics endpoint (
GET /metrics)
-
Claude Agent SDK integration
- Install
@anthropic-ai/claude-agent-sdk - Create basic agent loop
- Implement tool execution
- Add streaming responses
- Install
-
Database setup
- SQLite schema design
- Migration scripts
- Task repository implementation
- Database connection pooling
-
Memory MCP task source
- REST API polling implementation
- Task parsing and normalization
- Status update callbacks
- Error handling and retry logic
-
Basic tools
- Bash tool wrapper
- File read/write tool
- Git operations tool (clone, status, commit)
- Todo task management tool
-
Testing
- Unit tests for task sources
- Integration tests for agent loop
- End-to-end tests with mock tasks
File Structure (Phase 1)
Phase 2: Multi-Source Tasks (Iteration 161-170)
Goal: Complete task source implementations
Tasks
-
TODO.md file watching
- Chokidar integration
- Markdown parser
- Task extraction and normalization
- Debouncing and deduplication
-
GitHub webhook receiver
- Webhook signature verification
- Event parsing (issues, PRs, comments)
- @duyetbot mention detection
- Task extraction from GitHub events
-
Task queue improvements
- Priority-based scheduling
- Deduplication by content hash
- Retry logic with exponential backoff
- Task dependencies (wait for task X before Y)
-
Workspace management
- Per-task workspace directories
- Automatic cleanup
- Git repository isolation
- Filesystem quotas
Phase 3: Ralph Loop Integration (Iteration 171-180)
Goal: Complete stop hook implementation
Tasks
-
Stop hook implementation
- onThinkingStart hook
- onToolComplete hook
- onError hook
- onTaskComplete hook
-
Investigation system
- Tool result analysis
- Pattern detection for common issues
- New task generation from findings
- Investigation note storage
-
Recovery strategies
- Retry with backoff
- Skip and continue
- Escalate to human
- Alternative approach
-
Memory MCP checkpoint storage
- Checkpoint schema
- Investigation note storage
- Query for past checkpoints
- Resume from checkpoint
Phase 4: Advanced Tools (Iteration 181-190)
Goal: Complete tool suite for autonomous development
Tasks
-
Code analysis tools
- AST-based code analysis
- Dependency graph generation
- Complexity metrics
- Security vulnerability scanning
-
Git operations
- Advanced git operations (rebase, cherry-pick)
- PR creation and management
- Commit message generation
- Branch management
-
Testing tools
- Test discovery and execution
- Coverage reporting
- Failure analysis
- Test result storage
-
Documentation tools
- README generation
- API documentation extraction
- Changelog generation
- Diagram generation (Mermaid)
Phase 5: Deployment & Operations (Iteration 191-200)
Goal: Production-ready deployment
Tasks
-
Containerization
- Multi-stage Dockerfile
- Health check configuration
- Volume mounting for workspace
- Secret management
-
Fly.io deployment
- fly.toml configuration
- Volume mounting
- Auto-scaling rules
- Deployment scripts
-
Monitoring & alerting
- Prometheus metrics
- Grafana dashboards
- Alert rules (PagerDuty, Slack)
- Log aggregation (Loki, ELK)
-
Security hardening
- API authentication
- Rate limiting
- Request signing verification
- Secret scanning
Deployment
Fly.io Deployment
Configuration (fly.toml):
Dockerfile:
Deployment Commands:
Monitoring & Observability
Metrics
Prometheus Metrics:
Logging
Structured Logging:
Alerting
Alert Conditions:
Success Criteria
Phase 1 Success (Iteration 160)
- Agent server deployed and running
- Claude Agent SDK executing tasks
- Memory MCP task source operational
- Basic tools working (bash, fs, git)
- Task queue with persistence
- Health check endpoint responding
- 50+ tests passing
Phase 2 Success (Iteration 170)
- TODO.md file watching operational
- GitHub webhook receiver processing events
- Task queue handling all three sources
- Priority-based scheduling working
- Deduplication preventing duplicate tasks
- Workspace isolation per task
Phase 3 Success (Iteration 180)
- Ralph Loop stop hooks implemented
- Investigation checkpoints storing notes
- New tasks generated from investigations
- Recovery strategies handling errors
- Memory MCP checkpoint storage working
Phase 4 Success (Iteration 190)
- Complete tool suite operational
- Code analysis tools working
- Git operations advanced features
- Testing tools executing and reporting
- Documentation tools generating docs
Phase 5 Success (Iteration 200)
- Production deployment on Fly.io
- Monitoring dashboards operational
- Alerting configured and tested
- Security hardening complete
- 24/7 operation verified
Next Actions
Immediate (Iteration 125-130)
- Review and approve this implementation plan
- Set up
apps/agent-serverpackage structure - Begin Phase 1 foundation tasks
Short-term (Iteration 131-150)
- Continue with Ralph Loop autonomous development
- Complete remaining TODO.md items
- Prepare infrastructure for agent server
Medium-term (Iteration 151+)
- Begin Phase 1 implementation
- Deploy staging environment
- Test with real tasks from Memory MCP
Document Version: 1.0 Created: 2025-12-30 Last Updated: 2025-12-30 Author: Claude Code with duyetbot co-authorship