Embedding SDK Overview

Overview

The switchAILocal embedding SDK provides ONNX-based local embedding generation using the MiniLM model. This enables advanced features like semantic caching, intelligent routing, and skill matching without requiring external API calls.

Key Features

Local Processing: All embeddings computed on-device using ONNX Runtime
384-Dimensional Vectors: Standard MiniLM-L6-v2 model output
Fast Inference: Optimized for real-time semantic matching (<20ms)
No External Dependencies: Fully offline after model download
Thread-Safe: Concurrent embedding generation support

Architecture

When to Use Embeddings

Semantic Tier (Phase 2)

Match user queries to intents using embedding similarity instead of LLM classification:

intelligence:
  embedding:
    enabled: true
    model: "all-MiniLM-L6-v2"
  semantic-tier:
    enabled: true
    confidence-threshold: 0.85

Benefits:

10-20x faster than LLM classification
Deterministic results
No API costs

Semantic Caching

Cache responses based on semantic similarity:

intelligence:
  semantic-cache:
    enabled: true
    similarity-threshold: 0.95
    max-size: 10000

Use Cases:

Deduplicating similar queries
Reducing API costs
Faster response times for common questions

Skill Matching

Match queries to domain-specific skills:

intelligence:
  skill-matching:
    enabled: true
    confidence-threshold: 0.80

Example Skills:

Language experts (Go, Python, TypeScript)
Infrastructure (Docker, Kubernetes)
Security, Testing, Debugging

Model Details

all-MiniLM-L6-v2

Property	Value
Dimensions	384
Max Sequence Length	256 tokens
Model Size	~23 MB (ONNX)
Vocabulary Size	~30,000 tokens
Performance	~5-10ms per embedding

Download the Model

./scripts/download-embedding-model.sh

This downloads:

model.onnx - The ONNX model file
vocab.txt - The tokenizer vocabulary

Files are stored in ~/.switchailocal/models/.

Quick Start

Download Model

./scripts/download-embedding-model.sh

Enable in Config

intelligence:
  enabled: true
  embedding:
    enabled: true
    model: "all-MiniLM-L6-v2"

Start Server

./ail.sh start

Verify

Check logs for:

INFO Embedding engine initialized with model: model.onnx

Configuration Options

intelligence:
  embedding:
    enabled: true
    model: "all-MiniLM-L6-v2"  # Model name
    model-path: "~/.switchailocal/models/model.onnx"  # Override path
    vocab-path: "~/.switchailocal/models/vocab.txt"   # Override vocab
    shared-library: ""  # ONNX Runtime library (auto-detected)

Performance Characteristics

Latency

Operation	Typical Latency
Single embedding	5-10ms
Batch (10 texts)	30-50ms
Cosine similarity	<1ms

Memory Usage

Model Loading: ~50 MB
Per Request: ~1-2 MB (temporary)
Cached Embeddings: 384 floats × 4 bytes = 1.5 KB per vector

Accuracy

Semantic Similarity: 0.0 (unrelated) to 1.0 (identical)
Typical Intent Match: >0.85 for correct matches
Typical Skill Match: >0.80 for relevant skills

Comparison with Alternatives

Feature	switchAILocal Embedding	OpenAI Embedding API	SentenceTransformers
Cost	Free (local)	$0.0001/1K tokens	Free (local)
Latency	5-10ms	50-200ms (network)	10-20ms
Privacy	100% local	Data sent to API	100% local
Dimensions	384	1536 (ada-002)	Varies
Dependencies	ONNX Runtime only	Internet required	Python + PyTorch

Next Steps

Usage Guide

Learn how to use the embedding SDK

Custom Providers

Integrate custom embedding models

Semantic Tier

Configure semantic intent matching

Semantic Cache

Enable semantic caching

Client SDKs

Embedding SDK

Examples

Overview

Key Features

Architecture

When to Use Embeddings

Semantic Tier (Phase 2)

Semantic Caching

Skill Matching

Model Details

all-MiniLM-L6-v2

Download the Model

Quick Start

Configuration Options

Performance Characteristics

Latency

Memory Usage

Accuracy

Comparison with Alternatives

Next Steps

Usage Guide

Custom Providers

Semantic Tier

Semantic Cache

Client SDKs

Embedding SDK

Examples

​Overview

​Key Features

​Architecture

​When to Use Embeddings

​Semantic Tier (Phase 2)

​Semantic Caching

​Skill Matching

​Model Details

​all-MiniLM-L6-v2

​Download the Model

​Quick Start

​Configuration Options

​Performance Characteristics

​Latency

​Memory Usage

​Accuracy

​Comparison with Alternatives

​Next Steps

Usage Guide

Custom Providers

Semantic Tier

Semantic Cache

Overview

Key Features

Architecture

When to Use Embeddings

Semantic Tier (Phase 2)

Semantic Caching

Skill Matching

Model Details

all-MiniLM-L6-v2

Download the Model

Quick Start

Configuration Options

Performance Characteristics

Latency

Memory Usage

Accuracy

Comparison with Alternatives

Next Steps