Skip to main content

Endpoint

POST /v1/embeddings
Generate vector embeddings for one or more text inputs. Compatible with the OpenAI Embeddings API.
Embeddings support varies by provider. Gemini and Ollama provide the best embedding model availability.

Request Body

input
string | array
required
Input text or array of texts to generate embeddings for.
model
string
required
The embedding model to use. Examples:
  • gemini:text-embedding-004
  • ollama:nomic-embed-text
  • switchai:text-embedding-3-small
encoding_format
string
Format for the embeddings: float or base64 (default: float)
dimensions
integer
Number of dimensions for the embedding (model-dependent)

Response Format

object
string
Always list
data
array
Array of embedding objects
model
string
The model used to generate embeddings
usage
object
Token usage statistics

Examples

Basic Request

curl http://localhost:18080/v1/embeddings \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-test-123" \
  -d '{
    "model": "gemini:text-embedding-004",
    "input": "The quick brown fox jumps over the lazy dog"
  }'

Batch Embeddings

Generate embeddings for multiple texts:
response = client.embeddings.create(
    model="gemini:text-embedding-004",
    input=[
        "Machine learning is a subset of AI",
        "Deep learning uses neural networks",
        "Natural language processing enables text understanding"
    ]
)

for i, data in enumerate(response.data):
    print(f"Text {i}: {len(data.embedding)} dimensions")
Use embeddings for semantic similarity:
import numpy as np
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:18080/v1",
    api_key="sk-test-123"
)

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Generate embeddings
response = client.embeddings.create(
    model="gemini:text-embedding-004",
    input=[
        "What is machine learning?",
        "How do I train a neural network?",
        "Best pizza toppings"
    ]
)

embeddings = [d.embedding for d in response.data]

# Compare first query to others
query = embeddings[0]
similarity_1 = cosine_similarity(query, embeddings[1])
similarity_2 = cosine_similarity(query, embeddings[2])

print(f"ML vs Neural Networks: {similarity_1:.3f}")
print(f"ML vs Pizza: {similarity_2:.3f}")

Supported Models

Gemini Embeddings

ModelDimensionsMax InputDescription
gemini:text-embedding-0047682048 tokensLatest embedding model
gemini:text-embedding-preview-10097682048 tokensPreview model
gemini:embedding-0017682048 tokensLegacy model

Ollama Embeddings

Ollama provides various open-source embedding models:
# Pull embedding model
ollama pull nomic-embed-text

# Use in request
curl http://localhost:18080/v1/embeddings \
  -H "Authorization: Bearer sk-test-123" \
  -d '{
    "model": "ollama:nomic-embed-text",
    "input": "Sample text"
  }'

switchAI Embeddings

switchAI provides access to multiple embedding providers:
response = client.embeddings.create(
    model="switchai:text-embedding-3-small",
    input="Text to embed"
)

Response Example

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [
        0.0123,
        -0.0234,
        0.0345,
        // ... 765 more values
      ]
    }
  ],
  "model": "gemini:text-embedding-004",
  "usage": {
    "prompt_tokens": 8,
    "total_tokens": 8
  }
}

Use Cases

Find similar documents:
# 1. Embed all documents
documents = ["doc1 text", "doc2 text", "doc3 text"]
response = client.embeddings.create(
    model="gemini:text-embedding-004",
    input=documents
)
doc_embeddings = [d.embedding for d in response.data]

# 2. Embed query
query = "search query"
query_response = client.embeddings.create(
    model="gemini:text-embedding-004",
    input=query
)
query_embedding = query_response.data[0].embedding

# 3. Find most similar
similarities = [
    cosine_similarity(query_embedding, doc_emb)
    for doc_emb in doc_embeddings
]
best_match = documents[np.argmax(similarities)]

Clustering

Group similar texts:
from sklearn.cluster import KMeans

# Generate embeddings
texts = ["text1", "text2", "text3", ...]
response = client.embeddings.create(
    model="gemini:text-embedding-004",
    input=texts
)
embeddings = np.array([d.embedding for d in response.data])

# Cluster
kmeans = KMeans(n_clusters=3)
labels = kmeans.fit_predict(embeddings)

Recommendation Systems

Recommend similar items:
# Embed user preferences and item descriptions
user_pref = "I like action movies with great CGI"
items = ["Movie A: Action-packed blockbuster", "Movie B: Romantic drama", ...]

response = client.embeddings.create(
    model="gemini:text-embedding-004",
    input=[user_pref] + items
)

user_emb = response.data[0].embedding
item_embs = [d.embedding for d in response.data[1:]]

# Rank by similarity
scores = [cosine_similarity(user_emb, item) for item in item_embs]
top_items = sorted(zip(items, scores), key=lambda x: x[1], reverse=True)

Error Handling

from openai import OpenAI, APIError

client = OpenAI(
    base_url="http://localhost:18080/v1",
    api_key="sk-test-123"
)

try:
    response = client.embeddings.create(
        model="invalid-model",
        input="Test text"
    )
except APIError as e:
    print(f"API error: {e.message}")
    print(f"Status: {e.status_code}")

Performance Tips

Process multiple texts in a single request for better throughput:
# Good: Single request for 10 texts
response = client.embeddings.create(
    model="gemini:text-embedding-004",
    input=texts  # List of 10 texts
)

# Avoid: 10 separate requests
for text in texts:
    response = client.embeddings.create(
        model="gemini:text-embedding-004",
        input=text
    )
Cache embeddings for frequently used texts:
import pickle

# Save embeddings
with open('embeddings.pkl', 'wb') as f:
    pickle.dump(embeddings, f)

# Load cached embeddings
with open('embeddings.pkl', 'rb') as f:
    embeddings = pickle.load(f)
Choose appropriate model for your use case:
  • Gemini: Best for multilingual and semantic search
  • Ollama: Best for privacy and offline usage
  • switchAI: Best for unified access to multiple providers

Limitations

ProviderMax TokensDimensionsNotes
Gemini2048768Supports batch requests
OllamaModel-dependentModel-dependentLocal processing
ClaudeN/AN/ANo embedding support

Next Steps