Enterprise Deployment Guide

Complete guide for deploying OmniDaemon in production enterprise environments.

Deployment Overview

OmniDaemon can be deployed across various environments:

Cloud (AWS, Azure, GCP)
On-Premise (Data centers, private cloud)
Hybrid (Mix of cloud and on-prem)
Edge (IoT, manufacturing, retail)

Pre-Deployment Checklist

1. Infrastructure Requirements

To Get Started (Minimum):

1 vCPU
1GB RAM
10GB storage
Redis server (for event bus + storage)

Production (Depends on Your Use Case): The requirements vary significantly based on:

Agent workload type: I/O-bound (API calls, file ops) vs CPU-bound (ML inference, heavy computation)
Event throughput: Events per second your system processes
Number of consumers: Run one or more consumers per agent for load balancing (you decide based on your needs)

Realistic ranges:

Light I/O workloads: 1-2 vCPU, 2-4GB RAM (handles most API-calling agents)
Heavy computation: 4-8 vCPU, 8-16GB RAM (ML inference, data processing)
Redis server: 2 vCPU, 4GB RAM (scales well for most workloads)
Storage: 20GB+ (depends on result retention and metrics volume)

2. Network Requirements

What You Need (Current Implementation):

Port 6379 - Redis (for event bus + storage)
Port 8765 (optional) - OmniDaemon API if you enable it

Connectivity:

Your agent runners need to connect to Redis
Low latency connection is preferred but not required
If using multiple servers, keep them in same VPC/region for best performance

That’s it! Redis handles both event streaming and storage in the current implementation.

3. Security Requirements

TLS certificates for encrypted connections
Credentials management (AWS Secrets Manager, HashiCorp Vault)
Network security groups/firewall rules
IAM roles/service accounts

Deployment Architectures

1. Single Server (Getting Started)

Best for:

Development
Testing
Small-to-medium workloads (most I/O-bound use cases)

What You Actually Need:

┌───────────────────────────────────────┐
│    Single Server (1-2 vCPU, 2-4GB)    │
│                                       │
│  ┌─────────────────────┐              │
│  │   Redis Server      │              │
│  │   (Event Bus +      │              │
│  │    Storage)         │              │
│  └─────────────────────┘              │
│            ↕                          │
│  ┌─────────────────────┐              │
│  │  Agent Runner       │              │
│  │  (Your Python app)  │              │
│  │  • Multiple         │              │
│  │    consumers        │              │
│  └─────────────────────┘              │
│                                       │
└───────────────────────────────────────┘

Setup:

# 1. Install Redis
docker run -d -p 6379:6379 --name redis redis:latest

# 2. Install OmniDaemon
uv add omnidaemon

# 3. Set environment
export REDIS_URL=redis://localhost:6379
export EVENT_BUS_TYPE=redis_stream
export STORAGE_BACKEND=redis

# 4. Run your agent runner
python agent_runner.py

Scaling on Single Server: You can handle significant load on one server by:

Adding more consumers per agent (scale based on your workload)
Running multiple agent types on same runner
Redis can handle 100K+ ops/sec on modest hardware

2. Multi-Server (Horizontal Scaling)

When to scale horizontally:

Single server hits resource limits (CPU/Memory)
Need high availability (failover protection)
Want to separate different agent types across servers

How horizontal scaling works:

Each agent runner connects to the same Redis
Redis consumer groups automatically distribute work
Just start more agent runner instances - that’s it!

Architecture:

┌────────────────┐   ┌────────────────┐   ┌────────────────┐
│  Runner 1      │   │  Runner 2      │   │  Runner N      │
│  (1-2 vCPU)    │   │  (1-2 vCPU)    │   │  (1-2 vCPU)    │
│  • FileAgent   │   │  • FileAgent   │   │  • EmailAgent  │
│    (consumers) │   │    (consumers) │   │    (consumers) │
└────────┬───────┘   └────────┬───────┘   └────────┬───────┘
         │                    │                    │
         └────────────────────┼────────────────────┘
                              │
                    ┌─────────▼──────────┐
                    │   Redis Server     │
                    │   (2 vCPU, 4GB)    │
                    │   Event Bus +      │
                    │   Storage          │
                    └────────────────────┘

Redis handles:

Consumer group coordination (automatic work distribution)
Message persistence and replay
Load balancing across all consumers

Setup (Docker Compose):

version: '3.8'

services:
  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
    volumes:
      - redis-data:/data
    command: redis-server --appendonly yes

  runner-1:
    build: .
    environment:
      - REDIS_URL=redis://redis:6379
      - EVENT_BUS_TYPE=redis_stream
      - STORAGE_BACKEND=redis
    depends_on:
      - redis
    restart: unless-stopped

  runner-2:
    build: .
    environment:
      - REDIS_URL=redis://redis:6379
      - EVENT_BUS_TYPE=redis_stream
      - STORAGE_BACKEND=redis
    depends_on:
      - redis
    restart: unless-stopped

volumes:
  redis-data:

Scaling strategy:

Vertical first: Add consumers per agent (scale based on your workload)
Horizontal next: Add more runner instances (just start them - Redis handles coordination)
Redis scaling: Only when Redis itself becomes bottleneck (rare for most use cases)

3. Kubernetes (Cloud-Native)

Best for:

Auto-scaling
Cloud deployments (AWS/Azure/GCP)
Large-scale production

Architecture:

┌────────────────────────────────────────────┐
│         Kubernetes Cluster                 │
│                                            │
│  ┌──────────────────────────────────────┐ │
│  │  OmniDaemon Deployment               │ │
│  │  (ReplicaSet: 3 pods)                │ │
│  │                                      │ │
│  │  ┌────────┐ ┌────────┐ ┌────────┐  │ │
│  │  │ Pod 1  │ │ Pod 2  │ │ Pod 3  │  │ │
│  │  └───┬────┘ └───┬────┘ └───┬────┘  │ │
│  └──────┼──────────┼──────────┼────────┘ │
└─────────┼──────────┼──────────┼──────────┘
          │          │          │
          └──────────┴──────────┘
                     │
      ┌──────────────┴──────────────┐
      │                             │
      ▼                             ▼
┌──────────────┐           ┌──────────────┐
│AWS MSK       │           │  RDS         │
│(Kafka)       │           │(PostgreSQL)  │
└──────────────┘           └──────────────┘

Kubernetes Manifests:

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: omnidaemon-runner
  labels:
    app: omnidaemon
spec:
  replicas: 2  # Start with 2, scale based on load
  selector:
    matchLabels:
      app: omnidaemon
  template:
    metadata:
      labels:
        app: omnidaemon
    spec:
      containers:
      - name: runner
        image: your-registry/omnidaemon-runner:latest
        env:
        - name: REDIS_URL
          valueFrom:
            secretKeyRef:
              name: omnidaemon-secrets
              key: redis-url
        - name: EVENT_BUS_TYPE
          value: "redis_stream"
        - name: STORAGE_BACKEND
          value: "redis"
        resources:
          requests:
            memory: "1Gi"     # Start small, adjust based on workload
            cpu: "500m"       # 0.5 CPU
          limits:
            memory: "2Gi"     # Max for safety
            cpu: "1000m"      # 1 CPU max
        livenessProbe:
          exec:
            command:
            - python
            - -c
            - "from omnidaemon import OmniDaemonSDK; sdk = OmniDaemonSDK(); import asyncio; asyncio.run(sdk.health())"
          initialDelaySeconds: 30
          periodSeconds: 60
---
# hpa.yaml (Horizontal Pod Autoscaler)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: omnidaemon-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: omnidaemon-runner
  minReplicas: 2
  maxReplicas: 10  # Adjust based on your needs
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

Environment Configuration

Development

# .env.development
EVENT_BUS_TYPE=redis_stream
REDIS_URL=redis://localhost:6379
STORAGE_BACKEND=redis
OMNIDAEMON_LOG_LEVEL=DEBUG
OMNIDAEMON_RECLAIM_INTERVAL=30
OMNIDAEMON_DLQ_RETRY_LIMIT=3

Production

# .env.production
EVENT_BUS_TYPE=redis_stream
REDIS_URL=redis://redis-cluster.internal:6379
STORAGE_BACKEND=redis
OMNIDAEMON_LOG_LEVEL=INFO
OMNIDAEMON_RECLAIM_INTERVAL=60
OMNIDAEMON_DLQ_RETRY_LIMIT=5
OMNIDAEMON_METRICS_STREAM_MAXLEN=10000
OMNIDAEMON_DEFAULT_MAXLEN=100000

High Availability Setup

Redis Cluster (Event Bus + Storage)

# Redis Sentinel for HA
docker run -d --name redis-master redis:7
docker run -d --name redis-replica-1 redis:7 --replicaof redis-master 6379
docker run -d --name redis-replica-2 redis:7 --replicaof redis-master 6379
docker run -d --name redis-sentinel-1 redis:7 redis-sentinel /sentinel.conf
docker run -d --name redis-sentinel-2 redis:7 redis-sentinel /sentinel.conf
docker run -d --name redis-sentinel-3 redis:7 redis-sentinel /sentinel.conf

# Connection string
export REDIS_URL=redis-sentinel://sentinel-host:26379/mymaster

Load Balancing

# nginx.conf (for API)
upstream omnidaemon_api {
    least_conn;
    server runner-1:8765;
    server runner-2:8765;
    server runner-3:8765;
}

server {
    listen 80;
    location / {
        proxy_pass http://omnidaemon_api;
        proxy_set_header Host $host;
    }
}

Monitoring & Observability

Health Checks

# healthcheck.py
import asyncio
from omnidaemon import OmniDaemonSDK

async def health_check():
    sdk = OmniDaemonSDK()
    health = await sdk.health()

    if health["status"] != "running":
        exit(1)  # Fail health check

    if health["event_bus"]["status"] != "healthy":
        exit(1)

    if health["storage"]["status"] != "healthy":
        exit(1)

    print("✅ Healthy")
    exit(0)

asyncio.run(health_check())

Metrics Collection

# metrics_exporter.py (Prometheus format)
import asyncio
from omnidaemon import OmniDaemonSDK
from prometheus_client import Gauge, start_http_server

# Define Prometheus metrics
tasks_received = Gauge('omnidaemon_tasks_received', 'Total tasks received', ['topic', 'agent'])
tasks_processed = Gauge('omnidaemon_tasks_processed', 'Total tasks processed', ['topic', 'agent'])
tasks_failed = Gauge('omnidaemon_tasks_failed', 'Total tasks failed', ['topic', 'agent'])
avg_processing_time = Gauge('omnidaemon_avg_processing_time_ms', 'Average processing time', ['topic', 'agent'])

async def collect_metrics():
    sdk = OmniDaemonSDK()
    while True:
        metrics = await sdk.get_metrics()

        for topic, agents in metrics.items():
            for agent, stats in agents.items():
                tasks_received.labels(topic, agent).set(stats['tasks_received'])
                tasks_processed.labels(topic, agent).set(stats['tasks_processed'])
                tasks_failed.labels(topic, agent).set(stats['tasks_failed'])
                avg_processing_time.labels(topic, agent).set(stats['avg_processing_time_ms'])

        await asyncio.sleep(15)  # Update every 15 seconds

# Start Prometheus metrics server
start_http_server(9090)
asyncio.run(collect_metrics())

Security Best Practices

1. Credential Management

AWS Secrets Manager:

import boto3
import json

def get_redis_url():
    client = boto3.client('secretsmanager', region_name='us-east-1')
    secret = client.get_secret_value(SecretId='prod/omnidaemon/redis')
    return json.loads(secret['SecretString'])['REDIS_URL']

# Use in your agent
os.environ['REDIS_URL'] = get_redis_url()

HashiCorp Vault:

import hvac

client = hvac.Client(url='https://vault.internal:8200')
client.auth.approle.login(role_id='...', secret_id='...')

redis_url = client.secrets.kv.v2.read_secret_version(path='omnidaemon/redis')['data']['data']['url']
os.environ['REDIS_URL'] = redis_url

2. Network Security

# AWS Security Group (Terraform)
resource "aws_security_group" "omnidaemon_runner" {
  name = "omnidaemon-runner"

  # Allow outbound to Redis
  egress {
    from_port   = 6379
    to_port     = 6379
    protocol    = "tcp"
    cidr_blocks = ["10.0.1.0/24"]  # Redis subnet
  }

  # Deny all inbound (runners don't need incoming)
  ingress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = []
  }
}

3. TLS/SSL

# Redis with TLS
export REDIS_URL=rediss://redis.internal:6380?ssl_cert_reqs=required&ssl_ca_certs=/etc/ssl/certs/ca.pem

# Kafka with SSL
export KAFKA_BOOTSTRAP_SERVERS=kafka.internal:9093
export KAFKA_SECURITY_PROTOCOL=SSL
export KAFKA_SSL_CAFILE=/etc/ssl/certs/ca.pem
export KAFKA_SSL_CERTFILE=/etc/ssl/certs/client.pem
export KAFKA_SSL_KEYFILE=/etc/ssl/private/client-key.pem

Performance Tuning

1. Agent Runner Optimization

# Increase reclaim interval for lower overhead
export OMNIDAEMON_RECLAIM_INTERVAL=120  # 2 minutes

# Increase message batch size (Redis)
export OMNIDAEMON_BATCH_SIZE=100

# Increase stream max length
export OMNIDAEMON_DEFAULT_MAXLEN=1000000  # 1M messages

2. Python Performance

# Use uvloop for faster asyncio
import uvloop
uvloop.install()

# Optimize Python runtime
export PYTHONOPTIMIZE=2
export PYTHONDONTWRITEBYTECODE=1

3. Resource Limits

# Kubernetes resource limits
resources:
  requests:
    memory: "4Gi"
    cpu: "2"
  limits:
    memory: "8Gi"  # Allow headroom
    cpu: "4"       # Burst capacity

Disaster Recovery

1. Backup Strategy

# Redis backup (RDB + AOF)
redis-cli BGSAVE
redis-cli BGREWRITEAOF

# Automated backups (cron)
0 */6 * * * /usr/local/bin/backup-redis.sh  # Every 6 hours

2. Restore Procedure

# 1. Stop agent runners
kubectl scale deployment omnidaemon-runner --replicas=0

# 2. Restore Redis data
cp backup.rdb /var/lib/redis/dump.rdb

# 3. Restart Redis
systemctl restart redis

# 4. Restart runners
kubectl scale deployment omnidaemon-runner --replicas=3

# 5. Verify health
omnidaemon health

3. Stream Replay

# Replay events after data loss
from datetime import datetime, timedelta

# Replay last 24 hours
start_time = datetime.now() - timedelta(days=1)
await sdk.replay_stream(
    topic="critical.events",
    start_time=start_time
)

CI/CD Pipeline

GitHub Actions

# .github/workflows/deploy.yml
name: Deploy OmniDaemon

on:
  push:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      - run: uv sync
      - run: pytest tests/

  build:
    needs: test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: docker/build-push-action@v4
        with:
          push: true
          tags: myregistry/omnidaemon-runner:${{ github.sha }}

  deploy:
    needs: build
    runs-on: ubuntu-latest
    steps:
      - uses: azure/k8s-set-context@v3
        with:
          kubeconfig: ${{ secrets.KUBE_CONFIG }}
      - run: |
          kubectl set image deployment/omnidaemon-runner \
            runner=myregistry/omnidaemon-runner:${{ github.sha }}
      - run: kubectl rollout status deployment/omnidaemon-runner

Troubleshooting

Common Issues

1. Agent not receiving messages:

# Check consumer group exists
omnidaemon bus groups --topic your.topic

# Check stream has messages
omnidaemon bus inspect --stream your.topic

# Check agent registration
omnidaemon agent list

2. High memory usage:

# Check stream max length
redis-cli XINFO STREAM your.topic

# Trim old messages
redis-cli XTRIM your.topic MAXLEN ~ 10000

# Or set env variable
export OMNIDAEMON_DEFAULT_MAXLEN=10000

3. Slow processing:

# Check metrics
omnidaemon metrics

# Check DLQ for failures
omnidaemon bus dlq --topic your.topic

# Profile your agent callback
python -m cProfile agent_runner.py

Production Checklist

Before going live:

Next Steps

Use Cases - Real-world enterprise scenarios
Monitoring Guide - CLI monitoring tools
Configuration - Environment variables

Need help with deployment? Contact us for enterprise support.

Get Started

Core Concepts

How-To Guides

Architecture

Enterprise

Community

​Enterprise Deployment Guide

​Deployment Overview

​Pre-Deployment Checklist

​1. Infrastructure Requirements

​2. Network Requirements

​3. Security Requirements

​Deployment Architectures

​1. Single Server (Getting Started)

​2. Multi-Server (Horizontal Scaling)

​3. Kubernetes (Cloud-Native)

​Environment Configuration

​Development

​Production

​High Availability Setup

​Redis Cluster (Event Bus + Storage)

​Load Balancing

​Monitoring & Observability

​Health Checks

​Metrics Collection

​Security Best Practices

​1. Credential Management

​2. Network Security

​3. TLS/SSL

​Performance Tuning

​1. Agent Runner Optimization

​2. Python Performance

​3. Resource Limits

​Disaster Recovery

​1. Backup Strategy

​2. Restore Procedure

​3. Stream Replay

​CI/CD Pipeline

​GitHub Actions

​Troubleshooting

​Common Issues

​Production Checklist

​Next Steps