Skip to main content
Cortex Router transforms switchAILocal from a static router into an intelligent orchestrator that automatically selects the optimal model for each request using multi-tier classification and semantic matching.

Overview

Cortex Router uses a four-tier routing architecture to match requests with the best available model:
1

Reflex Tier

Pattern matching for instant routing (< 1ms)
  • PII detection → secure models
  • Code blocks → coding models
  • Image URLs → vision models
2

Semantic Tier

Embedding-based intent matching (< 20ms)
  • Uses local embedding models
  • Bypasses LLM for high-confidence matches
  • Matches against 21 pre-built skills
3

Cognitive Tier

LLM-powered classification (200-500ms)
  • Uses lightweight router model
  • Returns confidence scores
  • Falls back to semantic verification
4

Cascade Tier

Quality-based model escalation
  • Detects incomplete responses
  • Automatically retries with stronger models
  • Preserves context across attempts

Quick Start

Basic Configuration

Add the intelligence section to your config.yaml:
config.yaml
intelligence:
  enabled: true
  
  # Core routing models
  router-model: "ollama:qwen:0.5b"
  router-fallback: "openai:gpt-4o-mini"
  
  # Intent-to-model mapping
  matrix:
    coding: "switchai-chat"
    reasoning: "switchai-reasoner"
    creative: "switchai-chat"
    fast: "switchai-fast"
    secure: "ollama:llama3.2"  # Local for privacy
    vision: "switchai-chat"

Enable Phase 2 Features

config.yaml
intelligence:
  enabled: true
  
  # Automatic model discovery
  discovery:
    enabled: true
    refresh-interval: 3600  # seconds
    cache-dir: "~/.switchailocal/cache/discovery"
  
  # Local embedding for semantic matching
  embedding:
    enabled: true
    model: "all-MiniLM-L6-v2"
  
  # Semantic tier routing
  semantic-tier:
    enabled: true
    confidence-threshold: 0.85
  
  # Skill-based prompt augmentation
  skills:
    enabled: true
    directory: "plugins/cortex-router/skills"
  
  skill-matching:
    enabled: true
    confidence-threshold: 0.80
  
  # Semantic caching
  semantic-cache:
    enabled: true
    similarity-threshold: 0.95
    max-size: 10000
  
  # Confidence scoring
  confidence:
    enabled: true
  
  # Cross-verification
  verification:
    enabled: true
    confidence-threshold-low: 0.60
    confidence-threshold-high: 0.90
  
  # Automatic quality-based cascading
  cascade:
    enabled: true
    quality-threshold: 0.70
  
  # Feedback collection
  feedback:
    enabled: true
    retention-days: 90

Download Embedding Model

Before using semantic features, download the embedding model:
./scripts/download-embedding-model.sh

Usage

Use model: "auto" or model: "cortex" to enable intelligent routing:
curl http://localhost:18080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-test-123" \
  -d '{
    "model": "auto",
    "messages": [{
      "role": "user",
      "content": "Write a Python function to parse JSON"
    }]
  }'

Intent Classification

Cortex Router automatically detects request intent and routes to specialized models:
IntentDescriptionExample Queries
codingCode generation, debugging”Write a Go function”, “Fix this TypeScript error”
reasoningComplex analysis, math”Analyze these trends”, “Solve this logic puzzle”
creativeWriting, brainstorming”Write a blog post”, “Generate product names”
fastQuick factual questions”What is the capital of France?”, “Convert 100 USD to EUR”
secureSensitive data handling”Analyze this medical record”, “Review financial data”
visionImage analysis”Describe this image”, “Extract text from screenshot”

Dynamic Matrix

Phase 2 introduces automatic model discovery that builds optimal routing tables based on available models:
config.yaml
auto-assign:
  enabled: true
  prefer-local: true       # Prefer local models for 'secure' slot
  cost-optimization: true  # Favor cheaper models when quality is similar
  overrides:
    secure: "ollama:llama3.2"  # Manual override

Capability Scoring

Models are scored and assigned to capability slots:
SlotPriority Factors
codingCoding capability, context window size, code quality
reasoningReasoning capability, accuracy, mathematical ability
creativeGeneral capability, context window, creativity score
fastLow latency, low cost, acceptable quality
secureLocal models preferred, privacy features
visionVision capability required, image understanding

Pre-Built Skills

Cortex Router includes 21 domain-specific skills that augment prompts with expert instructions:
  • api-designer: REST API design, OpenAPI specifications
  • devops-expert: CI/CD, infrastructure as code, monitoring
  • docker-expert: Containerization, Dockerfile optimization
  • frontend-expert: React, TailwindCSS, modern frontend
  • go-expert: Go/Golang development for switchAILocal
  • k8s-expert: Kubernetes, Helm, cloud native
  • mcp-builder: Model Context Protocol server development
  • python-expert: Python with async, type hints, pytest
  • typescript-expert: TypeScript type system, advanced patterns
  • testing-expert: Testing methodologies, TDD, Vitest

Semantic Cache

The semantic cache stores routing decisions based on embedding similarity, enabling sub-millisecond routing for repeated queries:
config.yaml
semantic-cache:
  enabled: true
  similarity-threshold: 0.95  # Cache hit if similarity >= this
  max-size: 10000             # Maximum cache entries
Performance: Cache hits return in < 1ms vs 200-500ms for LLM classification.

Quality-Based Cascading

Cortex automatically escalates to stronger models when response quality is insufficient:
config.yaml
cascade:
  enabled: true
  quality-threshold: 0.70  # Cascade if quality score < this

Cascade Flow

fast → standard → reasoning
  ↓        ↓          ↓
✗ Low   ✗ Low    ✓ Success
Quality signals detected:
  • Abrupt endings
  • Missing sections
  • Incomplete code blocks
  • Error patterns
  • Very short responses
Cascading increases cost and latency. Set quality-threshold carefully based on your requirements.

Performance Tuning

config.yaml
intelligence:
  semantic-tier:
    confidence-threshold: 0.80  # Lower = more semantic routing
  semantic-cache:
    enabled: true
    max-size: 50000  # Larger cache
  cascade:
    enabled: false   # Disable for speed

Management API

Phase 2 adds management endpoints for monitoring and control:
EndpointMethodDescription
/v0/management/skillsGETList all loaded skills
/v0/management/feedbackGETGet routing feedback statistics
/v0/management/feedbackPOSTSubmit explicit feedback
/v0/management/steering/reloadPOSTReload configuration without restart

Troubleshooting

  1. Check embedding model is downloaded:
    ls ~/.switchailocal/models/all-MiniLM-L6-v2/
    
  2. Verify embedding is enabled:
    embedding:
      enabled: true
    
  3. Check logs for initialization errors
  1. Verify skills directory exists:
    ls plugins/cortex-router/skills/
    
  2. Lower the confidence threshold:
    skill-matching:
      confidence-threshold: 0.70  # Lower from 0.80
    
  3. Check skill descriptions are descriptive enough
  1. Lower similarity threshold for more hits:
    semantic-cache:
      similarity-threshold: 0.90  # Lower from 0.95
    
  2. Increase cache size:
    semantic-cache:
      max-size: 50000  # Increase from 10000
    
  1. Check provider credentials are configured
  2. Verify network connectivity to providers
  3. Check discovery cache directory is writable:
    mkdir -p ~/.switchailocal/cache/discovery
    chmod 0700 ~/.switchailocal/cache/discovery
    

Next Steps