Heartbeat & Monitoring

The Heartbeat system provides health checks, monitoring, and operational commands for managing switchAILocal in production environments.

Overview

The Heartbeat system includes:

Health Checks: Continuous monitoring of system health
Steering Commands: Runtime configuration management
Hooks System: Custom event handling
Learning Commands: Model performance analysis

The Heartbeat, Steering, Hooks, and Learning commands are currently in development. This page documents the planned architecture based on the source code structure.

CLI Commands

switchAILocal provides several system management commands:

# Memory system (documented separately)
switchAILocal memory <command>

# Heartbeat monitoring
switchAILocal heartbeat <command>

# Steering (runtime configuration)
switchAILocal steering <command>

# Hooks system
switchAILocal hooks <command>

# Learning system
switchAILocal learning <command>

Health Monitoring

Streaming Configuration

Configure heartbeat intervals for streaming requests in config.yaml:

config.yaml

streaming:
  # SSE heartbeat interval (seconds)
  keepalive-seconds: 15
  
  # Number of retries before first byte
  bootstrap-retries: 2

Purpose:

keepalive-seconds: Send SSE heartbeat comments to prevent connection timeouts
bootstrap-retries: Retry authentication/connection before streaming starts

Monitoring Endpoints

Use the management API for health checks:

Metrics
Health Check
Provider Status

Get comprehensive system metrics:

curl http://localhost:18080/management/metrics

Example Response:

{
  "uptime_seconds": 86400,
  "requests_total": 15234,
  "requests_success": 14987,
  "requests_failed": 247,
  "avg_latency_ms": 234,
  "intelligence": {
    "routing_decisions": 12456,
    "cache_hit_rate": 0.35,
    "semantic_tier_usage": 0.30
  },
  "superbrain": {
    "healing_attempts": 142,
    "healing_success_rate": 0.90
  },
  "memory": {
    "total_decisions": 15234,
    "disk_usage_bytes": 47431680
  }
}

Simple health check endpoint:

curl http://localhost:18080/health

Response:

{
  "status": "healthy",
  "timestamp": "2026-03-09T14:23:45Z",
  "version": "v2.5.0"
}

Check provider availability:

curl http://localhost:18080/management/providers

Response:

{
  "providers": [
    {
      "name": "geminicli",
      "status": "available",
      "models": 15,
      "last_check": "2026-03-09T14:23:45Z"
    },
    {
      "name": "claudecli",
      "status": "available",
      "models": 8,
      "last_check": "2026-03-09T14:23:45Z"
    },
    {
      "name": "ollama",
      "status": "available",
      "models": 12,
      "last_check": "2026-03-09T14:23:45Z"
    }
  ]
}

Steering Commands

Steering allows runtime configuration changes without restarting:

Reload Configuration

# Reload config.yaml without restart
curl -X POST http://localhost:18080/v0/management/steering/reload \
  -H "Authorization: Bearer YOUR_MANAGEMENT_KEY"

Use cases:

Update intelligence matrix
Add/remove API keys
Adjust Superbrain settings
Modify routing strategies

Some configuration changes (like port or tls) require a full restart.

Dynamic Matrix Updates

Update the intelligence matrix at runtime:

Edit config.yaml

config.yaml

intelligence:
  matrix:
    coding: "switchai-chat"      # Updated
    reasoning: "gemini-2.5-pro"  # Updated
    creative: "switchai-chat"

Reload configuration

curl -X POST http://localhost:18080/v0/management/steering/reload \
  -H "Authorization: Bearer YOUR_MANAGEMENT_KEY"

Verify changes

curl http://localhost:18080/management/metrics | jq '.intelligence.matrix'

Hooks System

Hooks allow custom event handling for routing decisions, failures, and other events.

Hook Types

Pre-Request
Post-Request
On-Error

Execute before routing decisions:

plugins/hooks/pre-request.lua

function on_pre_request(request)
  -- Modify request before routing
  if request.user_id == "special_user" then
    request.priority = "high"
  end
  return request
end

Execute after request completion:

plugins/hooks/post-request.lua

function on_post_request(request, response)
  -- Log or process response
  if response.latency_ms > 5000 then
    log_slow_request(request, response)
  end
end

Execute when errors occur:

plugins/hooks/on-error.lua

function on_error(request, error)
  -- Custom error handling
  if error.type == "rate_limit" then
    notify_admin(request, error)
  end
end

Configuration

config.yaml

plugin:
  enabled: true
  hooks:
    enabled: true
    directory: "plugins/hooks"
    pre-request: true
    post-request: true
    on-error: true

Learning Commands

The Learning system analyzes model performance and optimizes routing:

Performance Analysis

# Analyze model performance over time
switchAILocal learning analyze --days 30

# Compare models for specific intent
switchAILocal learning compare --intent coding

# Generate optimization recommendations
switchAILocal learning recommend

Configuration

config.yaml

intelligence:
  learning:
    enabled: true
    min-samples: 100          # Minimum decisions before analysis
    confidence-threshold: 0.8 # Minimum confidence for recommendations
    update-interval: 86400    # Analyze daily (seconds)

Operational Best Practices

Monitor Metrics Regularly

Set up automated monitoring:

# Check every 5 minutes
*/5 * * * * curl -s http://localhost:18080/management/metrics | \
  jq -r '.requests_failed' | \
  (read v; [ "$v" -gt 100 ] && alert_team)

Configure Health Checks

Use the health endpoint for load balancer health checks:

nginx.conf

upstream switchailocal {
  server localhost:18080;
  
  health_check interval=10s
               uri=/health
               match=ok_status;
}

Enable Audit Logging

Track all management operations:

config.yaml

remote-management:
  allow-remote: false
  secret-key: "your-management-key"

superbrain:
  security:
    audit_log_enabled: true
    audit_log_path: "./logs/audit.log"

Set Up Alerts

Monitor critical metrics:

Request failure rate > 5%
Average latency > 5000ms
Disk usage > 90%
Superbrain healing success rate < 80%

Regular Backups

Back up configuration and memory data:

# Daily backup script
#!/bin/bash
DATE=$(date +%Y%m%d)
cp config.yaml "backups/config-${DATE}.yaml"
switchAILocal memory export --output "backups/memory-${DATE}.tar.gz"

Monitoring Stack Integration

Prometheus

Export metrics to Prometheus:

docker-compose.yml

services:
  switchailocal:
    image: switchailocal:latest
    ports:
      - "18080:18080"
    environment:
      - PROMETHEUS_ENABLED=true
      - PROMETHEUS_PORT=9090

Grafana Dashboard

Key metrics to monitor:

Request Metrics
Intelligence Metrics
Superbrain Metrics
System Metrics

Requests per second
Success rate
Average latency (p50, p95, p99)
Error rate by type

Troubleshooting

High latency on health checks

Symptom: Health check endpoint taking > 1s to respondPossible causes:

System under heavy load
Database/disk I/O bottleneck
Network connectivity issues

Solutions:

Reduce concurrent request limit
Increase server resources
Enable request queuing

Steering reload fails

Error: Failed to reload configurationCheck:

Management key is correct
config.yaml syntax is valid
File permissions allow reading config.yaml

Debug:

# Validate YAML syntax
yamlint config.yaml

# Check file permissions
ls -la config.yaml

Streaming requests timing out

Symptom: Streaming requests disconnect after 15-30 secondsSolution: Increase keepalive interval:

config.yaml

streaming:
  keepalive-seconds: 30  # Increase from 15

Emergency Procedures

Service Degradation

If performance degrades:

Check metrics

curl http://localhost:18080/management/metrics

Identify bottleneck

High latency? Scale horizontally
High error rate? Check provider status
High memory? Reduce retention period

Apply quick fixes

config.yaml

# Disable expensive features temporarily
intelligence:
  cascade:
    enabled: false  # Disable cascading
  semantic-cache:
    max-size: 5000  # Reduce cache size

superbrain:
  mode: "observe"   # Disable healing

Reload configuration

curl -X POST http://localhost:18080/v0/management/steering/reload \
  -H "Authorization: Bearer YOUR_MANAGEMENT_KEY"

Complete Outage

If switchAILocal stops responding:

Check process status

ps aux | grep switchAILocal

Review logs

tail -100 logs/switchailocal.log

Restart service

systemctl restart switchailocal
# or
docker restart switchailocal

Verify health

curl http://localhost:18080/health

Get Started

Core Concepts

Configuration

Intelligent Systems

Advanced Features

Guides

Overview

CLI Commands

Health Monitoring

Streaming Configuration

Monitoring Endpoints

Steering Commands

Reload Configuration

Dynamic Matrix Updates

Hooks System

Hook Types

Configuration

Learning Commands

Performance Analysis

Configuration

Operational Best Practices

Monitoring Stack Integration

Prometheus

Grafana Dashboard

Troubleshooting

Emergency Procedures

Service Degradation

Complete Outage

Next Steps

Configuration Guide

Management API

Get Started

Core Concepts

Configuration

Intelligent Systems

Advanced Features

Guides

​Overview

​CLI Commands

​Health Monitoring

​Streaming Configuration

​Monitoring Endpoints

​Steering Commands

​Reload Configuration

​Dynamic Matrix Updates

​Hooks System

​Hook Types

​Configuration

​Learning Commands

​Performance Analysis

​Configuration

​Operational Best Practices

​Monitoring Stack Integration

​Prometheus

​Grafana Dashboard

​Troubleshooting

​Emergency Procedures

​Service Degradation

​Complete Outage

​Next Steps

Configuration Guide

Management API

Overview

CLI Commands

Health Monitoring

Streaming Configuration

Monitoring Endpoints

Steering Commands

Reload Configuration

Dynamic Matrix Updates

Hooks System

Hook Types

Configuration

Learning Commands

Performance Analysis

Configuration

Operational Best Practices

Monitoring Stack Integration

Prometheus

Grafana Dashboard

Troubleshooting

Emergency Procedures

Service Degradation

Complete Outage

Next Steps