Skip to main content

Overview

The switchAILocal embedding SDK provides ONNX-based local embedding generation using the MiniLM model. This enables advanced features like semantic caching, intelligent routing, and skill matching without requiring external API calls.

Key Features

  • Local Processing: All embeddings computed on-device using ONNX Runtime
  • 384-Dimensional Vectors: Standard MiniLM-L6-v2 model output
  • Fast Inference: Optimized for real-time semantic matching (<20ms)
  • No External Dependencies: Fully offline after model download
  • Thread-Safe: Concurrent embedding generation support

Architecture

When to Use Embeddings

Semantic Tier (Phase 2)

Match user queries to intents using embedding similarity instead of LLM classification:
intelligence:
  embedding:
    enabled: true
    model: "all-MiniLM-L6-v2"
  semantic-tier:
    enabled: true
    confidence-threshold: 0.85
Benefits:
  • 10-20x faster than LLM classification
  • Deterministic results
  • No API costs

Semantic Caching

Cache responses based on semantic similarity:
intelligence:
  semantic-cache:
    enabled: true
    similarity-threshold: 0.95
    max-size: 10000
Use Cases:
  • Deduplicating similar queries
  • Reducing API costs
  • Faster response times for common questions

Skill Matching

Match queries to domain-specific skills:
intelligence:
  skill-matching:
    enabled: true
    confidence-threshold: 0.80
Example Skills:
  • Language experts (Go, Python, TypeScript)
  • Infrastructure (Docker, Kubernetes)
  • Security, Testing, Debugging

Model Details

all-MiniLM-L6-v2

PropertyValue
Dimensions384
Max Sequence Length256 tokens
Model Size~23 MB (ONNX)
Vocabulary Size~30,000 tokens
Performance~5-10ms per embedding

Download the Model

./scripts/download-embedding-model.sh
This downloads:
  • model.onnx - The ONNX model file
  • vocab.txt - The tokenizer vocabulary
Files are stored in ~/.switchailocal/models/.

Quick Start

1

Download Model

./scripts/download-embedding-model.sh
2

Enable in Config

intelligence:
  enabled: true
  embedding:
    enabled: true
    model: "all-MiniLM-L6-v2"
3

Start Server

./ail.sh start
4

Verify

Check logs for:
INFO Embedding engine initialized with model: model.onnx

Configuration Options

intelligence:
  embedding:
    enabled: true
    model: "all-MiniLM-L6-v2"  # Model name
    model-path: "~/.switchailocal/models/model.onnx"  # Override path
    vocab-path: "~/.switchailocal/models/vocab.txt"   # Override vocab
    shared-library: ""  # ONNX Runtime library (auto-detected)

Performance Characteristics

Latency

OperationTypical Latency
Single embedding5-10ms
Batch (10 texts)30-50ms
Cosine similarity<1ms

Memory Usage

  • Model Loading: ~50 MB
  • Per Request: ~1-2 MB (temporary)
  • Cached Embeddings: 384 floats × 4 bytes = 1.5 KB per vector

Accuracy

  • Semantic Similarity: 0.0 (unrelated) to 1.0 (identical)
  • Typical Intent Match: >0.85 for correct matches
  • Typical Skill Match: >0.80 for relevant skills

Comparison with Alternatives

FeatureswitchAILocal EmbeddingOpenAI Embedding APISentenceTransformers
CostFree (local)$0.0001/1K tokensFree (local)
Latency5-10ms50-200ms (network)10-20ms
Privacy100% localData sent to API100% local
Dimensions3841536 (ada-002)Varies
DependenciesONNX Runtime onlyInternet requiredPython + PyTorch

Next Steps