Overview
The switchAILocal embedding SDK provides ONNX-based local embedding generation using the MiniLM model. This enables advanced features like semantic caching, intelligent routing, and skill matching without requiring external API calls.Key Features
- Local Processing: All embeddings computed on-device using ONNX Runtime
- 384-Dimensional Vectors: Standard MiniLM-L6-v2 model output
- Fast Inference: Optimized for real-time semantic matching (<20ms)
- No External Dependencies: Fully offline after model download
- Thread-Safe: Concurrent embedding generation support
Architecture
When to Use Embeddings
Semantic Tier (Phase 2)
Match user queries to intents using embedding similarity instead of LLM classification:- 10-20x faster than LLM classification
- Deterministic results
- No API costs
Semantic Caching
Cache responses based on semantic similarity:- Deduplicating similar queries
- Reducing API costs
- Faster response times for common questions
Skill Matching
Match queries to domain-specific skills:- Language experts (Go, Python, TypeScript)
- Infrastructure (Docker, Kubernetes)
- Security, Testing, Debugging
Model Details
all-MiniLM-L6-v2
| Property | Value |
|---|---|
| Dimensions | 384 |
| Max Sequence Length | 256 tokens |
| Model Size | ~23 MB (ONNX) |
| Vocabulary Size | ~30,000 tokens |
| Performance | ~5-10ms per embedding |
Download the Model
model.onnx- The ONNX model filevocab.txt- The tokenizer vocabulary
~/.switchailocal/models/.
Quick Start
Configuration Options
Performance Characteristics
Latency
| Operation | Typical Latency |
|---|---|
| Single embedding | 5-10ms |
| Batch (10 texts) | 30-50ms |
| Cosine similarity | <1ms |
Memory Usage
- Model Loading: ~50 MB
- Per Request: ~1-2 MB (temporary)
- Cached Embeddings: 384 floats × 4 bytes = 1.5 KB per vector
Accuracy
- Semantic Similarity: 0.0 (unrelated) to 1.0 (identical)
- Typical Intent Match: >0.85 for correct matches
- Typical Skill Match: >0.80 for relevant skills
Comparison with Alternatives
| Feature | switchAILocal Embedding | OpenAI Embedding API | SentenceTransformers |
|---|---|---|---|
| Cost | Free (local) | $0.0001/1K tokens | Free (local) |
| Latency | 5-10ms | 50-200ms (network) | 10-20ms |
| Privacy | 100% local | Data sent to API | 100% local |
| Dimensions | 384 | 1536 (ada-002) | Varies |
| Dependencies | ONNX Runtime only | Internet required | Python + PyTorch |