Monitoring & Debugging
This guide covers everything you need to monitor, debug, and troubleshoot your OmniDaemon system in production.Overview
What You’ll Learn:- ✅ Real-time health monitoring
- ✅ Metrics collection and analysis
- ✅ Agent monitoring (status, performance)
- ✅ Event bus monitoring (streams, consumers, DLQ)
- ✅ Storage monitoring (health, capacity)
- ✅ Debugging failed events
- ✅ Performance optimization
- ✅ Production best practices
Health Monitoring
Check Overall System Health
- System Status: RUNNING, READY, DEGRADED, or DOWN
- Runner ID: Unique identifier for this runner instance
- Uptime: How long the runner has been active
- Event Bus: Connection status and type
- Storage: Connection status and backend type
Health Status Meanings
| Status | Meaning | Action |
|---|---|---|
| RUNNING | Agent runner active, all systems healthy | ✅ Normal operation |
| READY | No runner, but event bus and storage healthy | ℹ️ Ready to start agents |
| DEGRADED | One system unhealthy (bus OR storage) | ⚠️ Investigate unhealthy component |
| DOWN | Both event bus and storage unhealthy | ❌ Critical: Check connections |
Programmatic Health Check
- ✅ Kubernetes liveness probes
- ✅ Load balancer health checks
- ✅ Monitoring dashboards
- ✅ Alerting systems
Agent Monitoring
List All Agents
- Shows topic hierarchy
- Agent details nested under topics
- Easy to scan
Get Agent Details
Programmatic Agent Monitoring
Metrics Monitoring
View All Metrics
- Received: Tasks delivered to agent
- Processed: Tasks completed successfully
- Failed: Tasks that errored (sent to DLQ)
- Avg Time: Average processing time per task
- Success Rate: (Processed / Received) × 100%
Filter Metrics by Topic
Export Metrics
Programmatic Metrics
- ✅ Performance dashboards
- ✅ SLA monitoring
- ✅ Capacity planning
- ✅ Bottleneck identification
Event Bus Monitoring
List All Streams
Inspect Stream Messages
List Consumer Groups
- Group: Consumer group name
- Consumers: Number of active consumers
- Pending: Messages in Pending Entries List (not ack’d)
Check Dead Letter Queue (DLQ)
- ❌ Max retries exceeded (default: 3)
- ❌ Callback raised exception repeatedly
- ❌ Message processing timeout
- ❌ Invalid message format
Get Bus Statistics
Storage Monitoring
Check Storage Health
Programmatic Storage Monitoring
Debugging Failed Events
Step 1: Check Metrics
- ❌ High “Failed” count
- ❌ Low success rate (< 95%)
- ⏱️ High average processing time
Step 2: Inspect DLQ
- error: Why it failed
- retry_count: How many times it was tried
- failed_message: Original payload
Step 3: Identify Root Cause
Common Failure Patterns: 1. Invalid Datamax_retries for transient errors
3. Timeout
reclaim_idle_ms or optimize callback
4. Bad Logic
Step 4: Test Fix
Step 5: Republish (if needed)
Performance Optimization
1. Monitor Processing Times
- ✅ Fast tasks: < 1 second
- ✅ Medium tasks: 1-10 seconds
- ⚠️ Slow tasks: > 10 seconds
- Optimize callback logic
- Reduce external API calls
- Use caching
- Process asynchronously
2. Monitor Pending Messages
- ⚠️ Agents can’t keep up with load
- Fix: Increase
consumer_count
3. Monitor DLQ Growth
- ❌ Systematic failures
- Fix: Identify and fix root cause (see Debugging section)
4. Monitor Memory Usage
- Clear old data
- Reduce result TTL
- Trim metrics
Production Monitoring Best Practices
1. Set Up Continuous Monitoring
2. Alert on Anomalies
3. Log Aggregation
4. Dashboards
Grafana/Prometheus Example:5. Regular Maintenance
Troubleshooting Checklist
System Not Starting
Agents Not Processing
High Latency
DLQ Growing
Further Reading
- Agent Lifecycle - Understanding agent states
- Event Bus Architecture - How Redis Streams works
- Storage Architecture - Storage internals
- Configuration Guide - Tuning parameters
Summary
Key Monitoring Commands:- ✅ Success rate (> 95%)
- ✅ Processing time (< 10s)
- ✅ Pending messages (< 100)
- ✅ DLQ growth (near 0)
- ✅ Memory usage (< 80%)
- Monitor continuously (cron/systemd timers)
- Alert on anomalies
- Aggregate logs
- Create dashboards
- Regular maintenance
- Check metrics (identify problem topic)
- Inspect DLQ (identify failure pattern)
- Analyze root cause
- Test fix locally
- Deploy and monitor