Skip to main content
The Heartbeat system provides health checks, monitoring, and operational commands for managing switchAILocal in production environments.

Overview

The Heartbeat system includes:
  • Health Checks: Continuous monitoring of system health
  • Steering Commands: Runtime configuration management
  • Hooks System: Custom event handling
  • Learning Commands: Model performance analysis
The Heartbeat, Steering, Hooks, and Learning commands are currently in development. This page documents the planned architecture based on the source code structure.

CLI Commands

switchAILocal provides several system management commands:
# Memory system (documented separately)
switchAILocal memory <command>

# Heartbeat monitoring
switchAILocal heartbeat <command>

# Steering (runtime configuration)
switchAILocal steering <command>

# Hooks system
switchAILocal hooks <command>

# Learning system
switchAILocal learning <command>

Health Monitoring

Streaming Configuration

Configure heartbeat intervals for streaming requests in config.yaml:
config.yaml
streaming:
  # SSE heartbeat interval (seconds)
  keepalive-seconds: 15
  
  # Number of retries before first byte
  bootstrap-retries: 2
Purpose:
  • keepalive-seconds: Send SSE heartbeat comments to prevent connection timeouts
  • bootstrap-retries: Retry authentication/connection before streaming starts

Monitoring Endpoints

Use the management API for health checks:
Get comprehensive system metrics:
curl http://localhost:18080/management/metrics
Example Response:
{
  "uptime_seconds": 86400,
  "requests_total": 15234,
  "requests_success": 14987,
  "requests_failed": 247,
  "avg_latency_ms": 234,
  "intelligence": {
    "routing_decisions": 12456,
    "cache_hit_rate": 0.35,
    "semantic_tier_usage": 0.30
  },
  "superbrain": {
    "healing_attempts": 142,
    "healing_success_rate": 0.90
  },
  "memory": {
    "total_decisions": 15234,
    "disk_usage_bytes": 47431680
  }
}

Steering Commands

Steering allows runtime configuration changes without restarting:

Reload Configuration

# Reload config.yaml without restart
curl -X POST http://localhost:18080/v0/management/steering/reload \
  -H "Authorization: Bearer YOUR_MANAGEMENT_KEY"
Use cases:
  • Update intelligence matrix
  • Add/remove API keys
  • Adjust Superbrain settings
  • Modify routing strategies
Some configuration changes (like port or tls) require a full restart.

Dynamic Matrix Updates

Update the intelligence matrix at runtime:
1

Edit config.yaml

config.yaml
intelligence:
  matrix:
    coding: "switchai-chat"      # Updated
    reasoning: "gemini-2.5-pro"  # Updated
    creative: "switchai-chat"
2

Reload configuration

curl -X POST http://localhost:18080/v0/management/steering/reload \
  -H "Authorization: Bearer YOUR_MANAGEMENT_KEY"
3

Verify changes

curl http://localhost:18080/management/metrics | jq '.intelligence.matrix'

Hooks System

Hooks allow custom event handling for routing decisions, failures, and other events.

Hook Types

Execute before routing decisions:
plugins/hooks/pre-request.lua
function on_pre_request(request)
  -- Modify request before routing
  if request.user_id == "special_user" then
    request.priority = "high"
  end
  return request
end

Configuration

config.yaml
plugin:
  enabled: true
  hooks:
    enabled: true
    directory: "plugins/hooks"
    pre-request: true
    post-request: true
    on-error: true

Learning Commands

The Learning system analyzes model performance and optimizes routing:

Performance Analysis

# Analyze model performance over time
switchAILocal learning analyze --days 30

# Compare models for specific intent
switchAILocal learning compare --intent coding

# Generate optimization recommendations
switchAILocal learning recommend

Configuration

config.yaml
intelligence:
  learning:
    enabled: true
    min-samples: 100          # Minimum decisions before analysis
    confidence-threshold: 0.8 # Minimum confidence for recommendations
    update-interval: 86400    # Analyze daily (seconds)

Operational Best Practices

1

Monitor Metrics Regularly

Set up automated monitoring:
# Check every 5 minutes
*/5 * * * * curl -s http://localhost:18080/management/metrics | \
  jq -r '.requests_failed' | \
  (read v; [ "$v" -gt 100 ] && alert_team)
2

Configure Health Checks

Use the health endpoint for load balancer health checks:
nginx.conf
upstream switchailocal {
  server localhost:18080;
  
  health_check interval=10s
               uri=/health
               match=ok_status;
}
3

Enable Audit Logging

Track all management operations:
config.yaml
remote-management:
  allow-remote: false
  secret-key: "your-management-key"

superbrain:
  security:
    audit_log_enabled: true
    audit_log_path: "./logs/audit.log"
4

Set Up Alerts

Monitor critical metrics:
  • Request failure rate > 5%
  • Average latency > 5000ms
  • Disk usage > 90%
  • Superbrain healing success rate < 80%
5

Regular Backups

Back up configuration and memory data:
# Daily backup script
#!/bin/bash
DATE=$(date +%Y%m%d)
cp config.yaml "backups/config-${DATE}.yaml"
switchAILocal memory export --output "backups/memory-${DATE}.tar.gz"

Monitoring Stack Integration

Prometheus

Export metrics to Prometheus:
docker-compose.yml
services:
  switchailocal:
    image: switchailocal:latest
    ports:
      - "18080:18080"
    environment:
      - PROMETHEUS_ENABLED=true
      - PROMETHEUS_PORT=9090

Grafana Dashboard

Key metrics to monitor:
  • Requests per second
  • Success rate
  • Average latency (p50, p95, p99)
  • Error rate by type

Troubleshooting

Symptom: Health check endpoint taking > 1s to respondPossible causes:
  1. System under heavy load
  2. Database/disk I/O bottleneck
  3. Network connectivity issues
Solutions:
  • Reduce concurrent request limit
  • Increase server resources
  • Enable request queuing
Error: Failed to reload configurationCheck:
  1. Management key is correct
  2. config.yaml syntax is valid
  3. File permissions allow reading config.yaml
Debug:
# Validate YAML syntax
yamlint config.yaml

# Check file permissions
ls -la config.yaml
Symptom: Streaming requests disconnect after 15-30 secondsSolution: Increase keepalive interval:
config.yaml
streaming:
  keepalive-seconds: 30  # Increase from 15

Emergency Procedures

Service Degradation

If performance degrades:
1

Check metrics

curl http://localhost:18080/management/metrics
2

Identify bottleneck

  • High latency? Scale horizontally
  • High error rate? Check provider status
  • High memory? Reduce retention period
3

Apply quick fixes

config.yaml
# Disable expensive features temporarily
intelligence:
  cascade:
    enabled: false  # Disable cascading
  semantic-cache:
    max-size: 5000  # Reduce cache size

superbrain:
  mode: "observe"   # Disable healing
4

Reload configuration

curl -X POST http://localhost:18080/v0/management/steering/reload \
  -H "Authorization: Bearer YOUR_MANAGEMENT_KEY"

Complete Outage

If switchAILocal stops responding:
1

Check process status

ps aux | grep switchAILocal
2

Review logs

tail -100 logs/switchailocal.log
3

Restart service

systemctl restart switchailocal
# or
docker restart switchailocal
4

Verify health

curl http://localhost:18080/health

Next Steps