Overview
switchAILocal is a unified AI proxy server that provides OpenAI-compatible API interfaces for multiple AI service providers. It acts as an intelligent gateway that manages authentication, routing, and request translation across CLI tools, cloud APIs, and local models.Core Components
The architecture is built around several key subsystems that work together to provide a seamless proxy experience:Service Manager
Orchestrates the complete lifecycle including authentication, file watching, HTTP server, and provider integrations.
Auth Manager
Handles credential management, OAuth flows, API keys, and automatic token refresh for all providers.
Executor Layer
Provider-specific executors that handle request translation and execution against upstream APIs.
Intelligence Service
Powers Cortex Router Phase 2 with semantic matching, intent classification, and dynamic model allocation.
System Flow
The architecture supports hot-reloading of configuration changes without requiring a server restart.
Request Lifecycle
1. Request Reception
The API server (cmd/server/main.go) receives OpenAI-compatible requests:
2. Authentication
The Auth Manager (sdk/switchailocal/auth/conductor.go) orchestrates credential selection:
Auth Manager Responsibilities
Auth Manager Responsibilities
- Credential Selection: Choose appropriate credentials using routing strategies
- State Management: Track auth status, quota limits, and cooldown periods
- Auto-Refresh: Automatically refresh OAuth tokens before expiration
- Failure Handling: Mark credentials unavailable and implement backoff strategies
3. Routing & Execution
The system uses a Selector to pick credentials based on your routing strategy:- RoundRobinSelector: Distributes requests evenly across credentials
- FillFirstSelector: Uses first credential until exhausted
4. Provider Execution
Each provider has a dedicated Executor that implements:CLI Executors
GeminiCLIExecutorOllamaExecutorOpenCodeExecutor
Cloud Executors
GeminiExecutorClaudeExecutorCodexExecutor
Compat Executors
OpenAICompatExecutorLMStudioExecutorAntigravityExecutor
Token Translation Pipeline
Requests flow through a translation pipeline (sdk/translator/pipeline.go) that converts between formats:
State Management
Auth State
Each authentication credential tracks its own state:Model-Level State
Fine-grained tracking per model per credential:Model-level state allows one API key to serve
gpt-4 while gpt-3.5-turbo is in cooldown.Concurrency Model
The system uses Go’s concurrency primitives for safe operation:- Read locks for credential selection (high throughput)
- Write locks only for state updates
- Atomic operations for retry counters
Storage Layer
Multiple storage backends for different deployment scenarios:File Store
Local filesystem storage for single-node deployments
Postgres Store
Distributed storage for multi-node deployments
Git Store
Version-controlled storage with remote sync
Object Store
S3-compatible storage for cloud deployments
Hot-Reload Mechanism
The Config Watcher (internal/watcher/watcher.go) monitors configuration changes:
- Parse new configuration
- Calculate diff from current state
- Dispatch targeted updates (add/remove/update)
- Reload executors without dropping connections
WebSocket Gateway
For real-time streaming providers:Intelligence Layer (Cortex Phase 2)
When enabled, the Intelligence Service adds:- Semantic Matching: Embed requests and match against known patterns
- Intent Classification: Use LLM to classify request intent
- Dynamic Routing: Route based on content, not just model name
- Skill Augmentation: Inject context from skill definitions
Security Architecture
Security is enforced at multiple layers:Path Validation
Error Sanitization
File Permissions
Extension Points
The architecture supports custom extensions:Custom Executors
Custom Selectors
Usage Plugins
Performance Considerations
Connection Pooling
HTTP clients use connection pooling for reduced latency
Streaming Support
SSE streaming with heartbeat keepalives
Retry Logic
Exponential backoff with configurable limits
Quota Management
Per-model cooldown to respect rate limits