Embeddings

Endpoint

POST /v1/embeddings

Generate vector embeddings for one or more text inputs. Compatible with the OpenAI Embeddings API.

Embeddings support varies by provider. Gemini and Ollama provide the best embedding model availability.

Request Body

input

string | array

required

Input text or array of texts to generate embeddings for.

model

string

required

The embedding model to use. Examples:

gemini:text-embedding-004
ollama:nomic-embed-text
switchai:text-embedding-3-small

encoding_format

string

Format for the embeddings: float or base64 (default: float)

dimensions

integer

Number of dimensions for the embedding (model-dependent)

Response Format

object

string

Always list

data

array

Array of embedding objects

Show Embedding Object Properties

object

string

Always embedding

index

integer

Position in the input array

embedding

array

Vector of floating point numbers representing the embedding

model

string

The model used to generate embeddings

usage

object

Token usage statistics

Show Usage Properties

prompt_tokens

integer

Number of tokens in the input

total_tokens

integer

Total tokens processed

Examples

Basic Request

curl http://localhost:18080/v1/embeddings \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-test-123" \
  -d '{
    "model": "gemini:text-embedding-004",
    "input": "The quick brown fox jumps over the lazy dog"
  }'

Batch Embeddings

Generate embeddings for multiple texts:

response = client.embeddings.create(
    model="gemini:text-embedding-004",
    input=[
        "Machine learning is a subset of AI",
        "Deep learning uses neural networks",
        "Natural language processing enables text understanding"
    ]
)

for i, data in enumerate(response.data):
    print(f"Text {i}: {len(data.embedding)} dimensions")

Similarity Search

Use embeddings for semantic similarity:

import numpy as np
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:18080/v1",
    api_key="sk-test-123"
)

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Generate embeddings
response = client.embeddings.create(
    model="gemini:text-embedding-004",
    input=[
        "What is machine learning?",
        "How do I train a neural network?",
        "Best pizza toppings"
    ]
)

embeddings = [d.embedding for d in response.data]

# Compare first query to others
query = embeddings[0]
similarity_1 = cosine_similarity(query, embeddings[1])
similarity_2 = cosine_similarity(query, embeddings[2])

print(f"ML vs Neural Networks: {similarity_1:.3f}")
print(f"ML vs Pizza: {similarity_2:.3f}")

Supported Models

Gemini Embeddings

Model	Dimensions	Max Input	Description
`gemini:text-embedding-004`	768	2048 tokens	Latest embedding model
`gemini:text-embedding-preview-1009`	768	2048 tokens	Preview model
`gemini:embedding-001`	768	2048 tokens	Legacy model

Ollama Embeddings

Ollama provides various open-source embedding models:

# Pull embedding model
ollama pull nomic-embed-text

# Use in request
curl http://localhost:18080/v1/embeddings \
  -H "Authorization: Bearer sk-test-123" \
  -d '{
    "model": "ollama:nomic-embed-text",
    "input": "Sample text"
  }'

switchAI Embeddings

switchAI provides access to multiple embedding providers:

response = client.embeddings.create(
    model="switchai:text-embedding-3-small",
    input="Text to embed"
)

Response Example

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [
        0.0123,
        -0.0234,
        0.0345,
        // ... 765 more values
      ]
    }
  ],
  "model": "gemini:text-embedding-004",
  "usage": {
    "prompt_tokens": 8,
    "total_tokens": 8
  }
}

Use Cases

Semantic Search

Find similar documents:

# 1. Embed all documents
documents = ["doc1 text", "doc2 text", "doc3 text"]
response = client.embeddings.create(
    model="gemini:text-embedding-004",
    input=documents
)
doc_embeddings = [d.embedding for d in response.data]

# 2. Embed query
query = "search query"
query_response = client.embeddings.create(
    model="gemini:text-embedding-004",
    input=query
)
query_embedding = query_response.data[0].embedding

# 3. Find most similar
similarities = [
    cosine_similarity(query_embedding, doc_emb)
    for doc_emb in doc_embeddings
]
best_match = documents[np.argmax(similarities)]

Clustering

Group similar texts:

from sklearn.cluster import KMeans

# Generate embeddings
texts = ["text1", "text2", "text3", ...]
response = client.embeddings.create(
    model="gemini:text-embedding-004",
    input=texts
)
embeddings = np.array([d.embedding for d in response.data])

# Cluster
kmeans = KMeans(n_clusters=3)
labels = kmeans.fit_predict(embeddings)

Recommendation Systems

Recommend similar items:

# Embed user preferences and item descriptions
user_pref = "I like action movies with great CGI"
items = ["Movie A: Action-packed blockbuster", "Movie B: Romantic drama", ...]

response = client.embeddings.create(
    model="gemini:text-embedding-004",
    input=[user_pref] + items
)

user_emb = response.data[0].embedding
item_embs = [d.embedding for d in response.data[1:]]

# Rank by similarity
scores = [cosine_similarity(user_emb, item) for item in item_embs]
top_items = sorted(zip(items, scores), key=lambda x: x[1], reverse=True)

Error Handling

from openai import OpenAI, APIError

client = OpenAI(
    base_url="http://localhost:18080/v1",
    api_key="sk-test-123"
)

try:
    response = client.embeddings.create(
        model="invalid-model",
        input="Test text"
    )
except APIError as e:
    print(f"API error: {e.message}")
    print(f"Status: {e.status_code}")

Performance Tips

Batch Processing

Process multiple texts in a single request for better throughput:

# Good: Single request for 10 texts
response = client.embeddings.create(
    model="gemini:text-embedding-004",
    input=texts  # List of 10 texts
)

# Avoid: 10 separate requests
for text in texts:
    response = client.embeddings.create(
        model="gemini:text-embedding-004",
        input=text
    )

Caching

Cache embeddings for frequently used texts:

import pickle

# Save embeddings
with open('embeddings.pkl', 'wb') as f:
    pickle.dump(embeddings, f)

# Load cached embeddings
with open('embeddings.pkl', 'rb') as f:
    embeddings = pickle.load(f)

Model Selection

Choose appropriate model for your use case:

Gemini: Best for multilingual and semantic search
Ollama: Best for privacy and offline usage
switchAI: Best for unified access to multiple providers

Limitations

Provider	Max Tokens	Dimensions	Notes
Gemini	2048	768	Supports batch requests
Ollama	Model-dependent	Model-dependent	Local processing
Claude	N/A	N/A	No embedding support

Overview

Endpoints

Provider Formats

Management API

Endpoint

Request Body

Response Format

Examples

Basic Request

Batch Embeddings

Similarity Search

Supported Models

Gemini Embeddings

Ollama Embeddings

switchAI Embeddings

Response Example

Use Cases

Semantic Search

Clustering

Recommendation Systems

Error Handling

Performance Tips

Limitations

Next Steps

Models

Chat Completions

Overview

Endpoints

Provider Formats

Management API

​Endpoint

​Request Body

​Response Format

​Examples

​Basic Request

​Batch Embeddings

​Similarity Search

​Supported Models

​Gemini Embeddings

​Ollama Embeddings

​switchAI Embeddings

​Response Example

​Use Cases

​Semantic Search

​Clustering

​Recommendation Systems

​Error Handling

​Performance Tips

​Limitations

​Next Steps

Models

Chat Completions

Endpoint

Request Body

Response Format

Examples

Basic Request

Batch Embeddings

Similarity Search

Supported Models

Gemini Embeddings

Ollama Embeddings

switchAI Embeddings

Response Example

Use Cases

Semantic Search

Clustering

Recommendation Systems

Error Handling

Performance Tips

Limitations

Next Steps