The Heartbeat system provides health checks, monitoring, and operational commands for managing switchAILocal in production environments.
Overview
The Heartbeat system includes:
Health Checks : Continuous monitoring of system health
Steering Commands : Runtime configuration management
Hooks System : Custom event handling
Learning Commands : Model performance analysis
The Heartbeat, Steering, Hooks, and Learning commands are currently in development. This page documents the planned architecture based on the source code structure.
CLI Commands
switchAILocal provides several system management commands:
# Memory system (documented separately)
switchAILocal memory < comman d >
# Heartbeat monitoring
switchAILocal heartbeat < comman d >
# Steering (runtime configuration)
switchAILocal steering < comman d >
# Hooks system
switchAILocal hooks < comman d >
# Learning system
switchAILocal learning < comman d >
Health Monitoring
Streaming Configuration
Configure heartbeat intervals for streaming requests in config.yaml:
streaming :
# SSE heartbeat interval (seconds)
keepalive-seconds : 15
# Number of retries before first byte
bootstrap-retries : 2
Purpose:
keepalive-seconds : Send SSE heartbeat comments to prevent connection timeouts
bootstrap-retries : Retry authentication/connection before streaming starts
Monitoring Endpoints
Use the management API for health checks:
Metrics
Health Check
Provider Status
Get comprehensive system metrics: curl http://localhost:18080/management/metrics
Example Response: {
"uptime_seconds" : 86400 ,
"requests_total" : 15234 ,
"requests_success" : 14987 ,
"requests_failed" : 247 ,
"avg_latency_ms" : 234 ,
"intelligence" : {
"routing_decisions" : 12456 ,
"cache_hit_rate" : 0.35 ,
"semantic_tier_usage" : 0.30
},
"superbrain" : {
"healing_attempts" : 142 ,
"healing_success_rate" : 0.90
},
"memory" : {
"total_decisions" : 15234 ,
"disk_usage_bytes" : 47431680
}
}
Simple health check endpoint: curl http://localhost:18080/health
Response: {
"status" : "healthy" ,
"timestamp" : "2026-03-09T14:23:45Z" ,
"version" : "v2.5.0"
}
Check provider availability: curl http://localhost:18080/management/providers
Response: {
"providers" : [
{
"name" : "geminicli" ,
"status" : "available" ,
"models" : 15 ,
"last_check" : "2026-03-09T14:23:45Z"
},
{
"name" : "claudecli" ,
"status" : "available" ,
"models" : 8 ,
"last_check" : "2026-03-09T14:23:45Z"
},
{
"name" : "ollama" ,
"status" : "available" ,
"models" : 12 ,
"last_check" : "2026-03-09T14:23:45Z"
}
]
}
Steering Commands
Steering allows runtime configuration changes without restarting:
Reload Configuration
# Reload config.yaml without restart
curl -X POST http://localhost:18080/v0/management/steering/reload \
-H "Authorization: Bearer YOUR_MANAGEMENT_KEY"
Use cases:
Update intelligence matrix
Add/remove API keys
Adjust Superbrain settings
Modify routing strategies
Some configuration changes (like port or tls) require a full restart.
Dynamic Matrix Updates
Update the intelligence matrix at runtime:
Edit config.yaml
intelligence :
matrix :
coding : "switchai-chat" # Updated
reasoning : "gemini-2.5-pro" # Updated
creative : "switchai-chat"
Reload configuration
curl -X POST http://localhost:18080/v0/management/steering/reload \
-H "Authorization: Bearer YOUR_MANAGEMENT_KEY"
Verify changes
curl http://localhost:18080/management/metrics | jq '.intelligence.matrix'
Hooks System
Hooks allow custom event handling for routing decisions, failures, and other events.
Hook Types
Pre-Request
Post-Request
On-Error
Execute before routing decisions: plugins/hooks/pre-request.lua
function on_pre_request ( request )
-- Modify request before routing
if request . user_id == "special_user" then
request . priority = "high"
end
return request
end
Execute after request completion: plugins/hooks/post-request.lua
function on_post_request ( request , response )
-- Log or process response
if response . latency_ms > 5000 then
log_slow_request ( request , response )
end
end
Execute when errors occur: plugins/hooks/on-error.lua
function on_error ( request , error )
-- Custom error handling
if error . type == "rate_limit" then
notify_admin ( request , error )
end
end
Configuration
plugin :
enabled : true
hooks :
enabled : true
directory : "plugins/hooks"
pre-request : true
post-request : true
on-error : true
Learning Commands
The Learning system analyzes model performance and optimizes routing:
# Analyze model performance over time
switchAILocal learning analyze --days 30
# Compare models for specific intent
switchAILocal learning compare --intent coding
# Generate optimization recommendations
switchAILocal learning recommend
Configuration
intelligence :
learning :
enabled : true
min-samples : 100 # Minimum decisions before analysis
confidence-threshold : 0.8 # Minimum confidence for recommendations
update-interval : 86400 # Analyze daily (seconds)
Operational Best Practices
Monitor Metrics Regularly
Set up automated monitoring: # Check every 5 minutes
* /5 * * * * curl -s http://localhost:18080/management/metrics | \
jq -r '.requests_failed' | \
( read v ; [ " $v " -gt 100 ] && alert_team )
Configure Health Checks
Use the health endpoint for load balancer health checks: upstream switchailocal {
server localhost:18080;
health_check interval=10s
uri=/health
match=ok_status;
}
Enable Audit Logging
Track all management operations: remote-management :
allow-remote : false
secret-key : "your-management-key"
superbrain :
security :
audit_log_enabled : true
audit_log_path : "./logs/audit.log"
Set Up Alerts
Monitor critical metrics:
Request failure rate > 5%
Average latency > 5000ms
Disk usage > 90%
Superbrain healing success rate < 80%
Regular Backups
Back up configuration and memory data: # Daily backup script
#!/bin/bash
DATE = $( date +%Y%m%d )
cp config.yaml "backups/config-${ DATE }.yaml"
switchAILocal memory export --output "backups/memory-${ DATE }.tar.gz"
Monitoring Stack Integration
Prometheus
Export metrics to Prometheus:
services :
switchailocal :
image : switchailocal:latest
ports :
- "18080:18080"
environment :
- PROMETHEUS_ENABLED=true
- PROMETHEUS_PORT=9090
Grafana Dashboard
Key metrics to monitor:
Request Metrics
Intelligence Metrics
Superbrain Metrics
System Metrics
Requests per second
Success rate
Average latency (p50, p95, p99)
Error rate by type
Routing tier distribution
Cache hit rate
Semantic tier usage
Average confidence scores
Healing attempts
Healing success rate
Silence detections
Fallback routing count
Memory usage
Disk usage
CPU usage
Open connections
Troubleshooting
High latency on health checks
Symptom: Health check endpoint taking > 1s to respondPossible causes:
System under heavy load
Database/disk I/O bottleneck
Network connectivity issues
Solutions:
Reduce concurrent request limit
Increase server resources
Enable request queuing
Error: Failed to reload configurationCheck:
Management key is correct
config.yaml syntax is valid
File permissions allow reading config.yaml
Debug: # Validate YAML syntax
yamlint config.yaml
# Check file permissions
ls -la config.yaml
Streaming requests timing out
Symptom: Streaming requests disconnect after 15-30 secondsSolution:
Increase keepalive interval:streaming :
keepalive-seconds : 30 # Increase from 15
Emergency Procedures
Service Degradation
If performance degrades:
Check metrics
curl http://localhost:18080/management/metrics
Identify bottleneck
High latency? Scale horizontally
High error rate? Check provider status
High memory? Reduce retention period
Apply quick fixes
# Disable expensive features temporarily
intelligence :
cascade :
enabled : false # Disable cascading
semantic-cache :
max-size : 5000 # Reduce cache size
superbrain :
mode : "observe" # Disable healing
Reload configuration
curl -X POST http://localhost:18080/v0/management/steering/reload \
-H "Authorization: Bearer YOUR_MANAGEMENT_KEY"
Complete Outage
If switchAILocal stops responding:
Check process status
ps aux | grep switchAILocal
Review logs
tail -100 logs/switchailocal.log
Restart service
systemctl restart switchailocal
# or
docker restart switchailocal
Verify health
curl http://localhost:18080/health
Next Steps