Skip to main content

Overview

The Cortex Router plugin provides intelligent, multi-tier routing that automatically selects the optimal model based on request content. This eliminates the need to manually choose providers and models for each request.

Quick Start

Enable Cortex Router

# config.yaml
plugin:
  enabled: true
  enabled-plugins:
    - "cortex-router"

intelligence:
  enabled: true
  router-model: "ollama:qwen:0.5b"  # Fast classification model
  matrix:
    coding: "switchai-chat"
    reasoning: "switchai-reasoner"
    fast: "switchai-fast"
    secure: "ollama:llama3.2"  # Local for sensitive data

Basic Usage

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:18080/v1",
    api_key="sk-test-123",
)

# Use "auto" or "cortex" to trigger intelligent routing
completion = client.chat.completions.create(
    model="auto",  # Let Cortex Router decide
    messages=[{"role": "user", "content": "Write a Python function to sort a list"}]
)
# → Automatically routes to coding model

print(completion.choices[0].message.content)

How It Works

Cortex Router analyzes requests through multiple tiers:
1

Reflex Tier (<1ms)

Pattern matching for obvious cases:
  • Code blocks → coding model
  • PII detection → secure/local model
  • Image attachments → vision model
2

Semantic Tier (<20ms)

Embedding-based intent matching (Phase 2):
  • Compute query embedding
  • Match against pre-computed intent vectors
  • Route if confidence > threshold
3

Cognitive Tier (200-500ms)

LLM-based classification:
  • Analyze query with router model
  • Generate confidence scores
  • Select optimal capability slot

Routing Examples

Coding Tasks

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:18080/v1",
    api_key="sk-test-123",
)

# These are automatically routed to the coding model
coding_queries = [
    "Write a binary search algorithm in Go",
    "Debug this Python code: def factorial(n): return n * factorial(n)",
    "Explain how async/await works in JavaScript",
    "Generate unit tests for this function",
]

for query in coding_queries:
    completion = client.chat.completions.create(
        model="auto",
        messages=[{"role": "user", "content": query}]
    )
    print(f"Query: {query[:50]}...")
    print(f"Response: {completion.choices[0].message.content[:100]}...\n")

Reasoning Tasks

# These route to the reasoning model
reasoning_queries = [
    "Solve this logic puzzle: If all bloops are razzies...",
    "What are the ethical implications of AI in healthcare?",
    "Prove that the square root of 2 is irrational",
    "Analyze the pros and cons of remote work",
]

for query in reasoning_queries:
    completion = client.chat.completions.create(
        model="cortex",  # Explicit Cortex routing
        messages=[{"role": "user", "content": query}]
    )
    print(f"Reasoning: {completion.choices[0].message.content[:100]}...\n")

Fast Queries

# These route to the fast model
fast_queries = [
    "What is 2+2?",
    "Define recursion",
    "What is the capital of France?",
    "Translate 'hello' to Spanish",
]

for query in fast_queries:
    completion = client.chat.completions.create(
        model="auto",
        messages=[{"role": "user", "content": query}]
    )
    print(f"Fast response: {completion.choices[0].message.content}\n")

Secure/Private Data

# Queries with PII automatically route to local models
sensitive_queries = [
    "My email is john@example.com and I need help with...",
    "Analyze this customer data: SSN 123-45-6789",
    "Review this API key: sk_live_1234567890abcdef",
]

for query in sensitive_queries:
    print("🔒 Detected sensitive data -> routing to local model")
    completion = client.chat.completions.create(
        model="auto",
        messages=[{"role": "user", "content": query}]
    )
    # → Automatically routes to ollama:llama3.2 (local)

Phase 2 Features

Semantic Tier

Enable faster intent matching with embeddings:
intelligence:
  enabled: true
  
  embedding:
    enabled: true
    model: "all-MiniLM-L6-v2"
  
  semantic-tier:
    enabled: true
    confidence-threshold: 0.85
Download the embedding model first:
./scripts/download-embedding-model.sh
Benefits:
  • 10-20x faster than LLM classification
  • Deterministic results
  • No API costs

Skill Matching

Match queries to domain-specific skills:
intelligence:
  skills:
    enabled: true
    directory: "plugins/cortex-router/skills"
  
  skill-matching:
    enabled: true
    confidence-threshold: 0.80
Example:
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:18080/v1",
    api_key="sk-test-123",
)

# These match to specialized skills
completion = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "Write a Kubernetes deployment YAML"}]
)
# → Matches "kubernetes" skill → augments prompt with K8s expertise

completion = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "Optimize this Go code for concurrency"}]
)
# → Matches "golang" skill → adds Go best practices to context

Dynamic Model Discovery

Automatically discover available models from all providers:
intelligence:
  discovery:
    enabled: true
    refresh-interval: 3600  # Re-discover every hour
    cache-dir: "~/.switchailocal/cache/discovery"
  
  auto-assign:
    enabled: true
    prefer-local: true  # Prefer local models for 'secure' slot
    cost-optimization: true  # Favor cheaper models when quality is similar
What it does:
  • Queries all providers for available models at startup
  • Assigns optimal models to capability slots based on:
    • Model capabilities (coding, reasoning, vision)
    • Context window size
    • Cost
    • Local vs cloud
  • Updates the matrix automatically

Quality-Based Cascading

Automatically escalate to better models if quality is insufficient:
intelligence:
  cascade:
    enabled: true
    quality-threshold: 0.70
  
  verification:
    enabled: true
Example:
# Start with fast model, escalate if needed
completion = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "Explain quantum entanglement"}]
)
# Flow:
# 1. Routes to switchai-fast
# 2. Verifies response quality < 0.70
# 3. Cascades to switchai-reasoner
# 4. Returns higher-quality response

Capability Matrix

The matrix defines which models handle which types of requests:
intelligence:
  matrix:
    coding: "switchai-chat"          # Code generation, debugging
    reasoning: "switchai-reasoner"   # Complex analysis, math
    creative: "switchai-chat"        # Writing, brainstorming
    fast: "switchai-fast"            # Quick factual questions
    secure: "ollama:llama3.2"        # Sensitive data (local)
    vision: "switchai-chat"          # Image analysis

Slot Descriptions

SlotUse CasePriority
codingCode generation, debugging, code reviewCoding capability, context window
reasoningComplex analysis, math, logic puzzlesReasoning capability, accuracy
creativeWriting, storytelling, brainstormingGeneral capability, creativity
fastQuick questions, simple tasksLow latency, low cost
secureSensitive data, PIILocal models preferred
visionImage analysis, multimodalVision capability required

Monitoring and Debugging

View Routing Decisions

Check logs to see how queries are routed:
tail -f ~/.switchailocal/logs/request.log | grep cortex
Example log:
{
  "timestamp": "2026-03-09T10:30:15Z",
  "level": "info",
  "plugin": "cortex-router",
  "tier": "reflex",
  "intent": "coding",
  "confidence": 1.0,
  "latency_ms": 0.5,
  "model": "switchai-chat"
}

Override Routing

Force a specific slot:
completion = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "Explain machine learning"}],
    extra_body={
        "cortex": {
            "force_slot": "reasoning"  # Override automatic detection
        }
    }
)

Disable Cortex for Specific Requests

# Use a specific model directly (bypasses Cortex)
completion = client.chat.completions.create(
    model="geminicli:gemini-2.5-pro",  # Explicit provider
    messages=[{"role": "user", "content": "Hello"}]
)

Performance Comparison

TierLatencyAccuracyUse Case
Reflex<1ms95%+Obvious patterns (code, PII)
Semantic<20ms85%+Intent matching
Cognitive200-500ms90%+Complex classification
Manual0ms100%Explicit provider selection

Best Practices

Start with Reflex Tier - Most requests can be handled by fast pattern matching.
Enable Semantic Tier - Adds 10-20ms but saves 200-500ms on LLM classification.
Monitor Routing Decisions - Check logs to ensure queries are routed correctly.
Download Embedding Model - Semantic tier requires the MiniLM model. Run ./scripts/download-embedding-model.sh.

Complete Configuration Example

# Full Phase 2 configuration
plugin:
  enabled: true
  enabled-plugins:
    - "cortex-router"

intelligence:
  enabled: true
  
  # Core routing
  router-model: "ollama:qwen:0.5b"
  router-fallback: "openai:gpt-4o-mini"
  
  # Capability matrix
  matrix:
    coding: "switchai-chat"
    reasoning: "switchai-reasoner"
    creative: "switchai-chat"
    fast: "switchai-fast"
    secure: "ollama:llama3.2"
    vision: "switchai-chat"
  
  # Phase 2 features
  discovery:
    enabled: true
    refresh-interval: 3600
  
  embedding:
    enabled: true
    model: "all-MiniLM-L6-v2"
  
  semantic-tier:
    enabled: true
    confidence-threshold: 0.85
  
  skill-matching:
    enabled: true
    confidence-threshold: 0.80
  
  cascade:
    enabled: true
    quality-threshold: 0.70
  
  semantic-cache:
    enabled: true
    similarity-threshold: 0.95
    max-size: 10000
  
  verification:
    enabled: true
  
  feedback:
    enabled: true
    retention-days: 90

Next Steps