Endpoint
POST /v1/chat/completions
Send a conversation to an AI model and receive a text response. Compatible with the OpenAI Chat Completions API.
Request Body
The model to use for completion. Use provider prefix for explicit routing (e.g., geminicli:gemini-2.5-pro) or omit prefix for auto-routing.
Array of message objects forming the conversation. Show Message Object Properties
The role of the message author: system, user, or assistant
The message content. Can be a string or array of content parts for multimodal inputs.
Optional name for the message author
If true, returns a stream of server-sent events instead of a single response.
Sampling temperature between 0 and 2. Higher values make output more random.
Maximum number of tokens to generate in the completion.
Nucleus sampling parameter. Alternative to temperature.
Penalize tokens based on frequency in the text so far (-2.0 to 2.0).
Penalize tokens based on presence in the text so far (-2.0 to 2.0).
Up to 4 sequences where the API will stop generating tokens.
List of tools the model may call. Currently supports function calling.
Provider-specific extensions. See CLI Attachments for CLI provider options.
Unique identifier for the completion
Object type, always chat.completion or chat.completion.chunk for streaming
Unix timestamp of when the completion was created
The model used for completion
Array of completion choices Show Choice Object Properties
The index of this choice in the array
The generated message The generated text content
Tool calls made by the model, if any
Why the completion stopped: stop, length, tool_calls, or content_filter
Token usage statistics Number of tokens in the prompt
Number of tokens in the completion
Examples
Basic Request
curl http://localhost:18080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-test-123" \
-d '{
"model": "gemini-2.5-pro",
"messages": [
{"role": "user", "content": "What is the capital of France?"}
]
}'
Streaming Response
from openai import OpenAI
client = OpenAI(
base_url = "http://localhost:18080/v1" ,
api_key = "sk-test-123"
)
stream = client.chat.completions.create(
model = "gemini-2.5-pro" ,
messages = [{ "role" : "user" , "content" : "Tell me a story" }],
stream = True
)
for chunk in stream:
if chunk.choices[ 0 ].delta.content:
print (chunk.choices[ 0 ].delta.content, end = "" , flush = True )
Multi-Turn Conversation
messages = [
{ "role" : "system" , "content" : "You are a helpful coding assistant." },
{ "role" : "user" , "content" : "Write a Python function to calculate fibonacci." },
{ "role" : "assistant" , "content" : "Here's a recursive implementation..." },
{ "role" : "user" , "content" : "Now make it iterative." }
]
response = client.chat.completions.create(
model = "geminicli:gemini-2.5-pro" ,
messages = messages
)
Provider-Specific Routing
Auto-Routing
Gemini CLI
Claude CLI
Ollama
# Let switchAILocal choose the best provider
curl http://localhost:18080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-test-123" \
-d '{
"model": "gemini-2.5-pro",
"messages": [{"role": "user", "content": "Hello!"}]
}'
# Route to Gemini CLI (uses your CLI subscription)
curl http://localhost:18080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-test-123" \
-d '{
"model": "geminicli:gemini-2.5-pro",
"messages": [{"role": "user", "content": "Hello!"}]
}'
# Route to Claude CLI
curl http://localhost:18080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-test-123" \
-d '{
"model": "claudecli:claude-sonnet-4",
"messages": [{"role": "user", "content": "Hello!"}]
}'
# Route to local Ollama
curl http://localhost:18080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-test-123" \
-d '{
"model": "ollama:llama3.2",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Advanced Features
Temperature Control
Adjust randomness of responses:
response = client.chat.completions.create(
model = "gemini-2.5-pro" ,
messages = [{ "role" : "user" , "content" : "Write a creative story" }],
temperature = 1.5 # Higher = more creative
)
Token Limits
Constrain response length:
response = client.chat.completions.create(
model = "gemini-2.5-pro" ,
messages = [{ "role" : "user" , "content" : "Summarize quantum physics" }],
max_tokens = 100 # Limit to 100 tokens
)
Stop Sequences
Stop generation at specific strings:
response = client.chat.completions.create(
model = "gemini-2.5-pro" ,
messages = [{ "role" : "user" , "content" : "List 5 colors:" }],
stop = [ " \n\n " , "6." ] # Stop at double newline or "6."
)
Error Handling
from openai import OpenAI, APIError, RateLimitError
client = OpenAI(
base_url = "http://localhost:18080/v1" ,
api_key = "sk-test-123"
)
try :
response = client.chat.completions.create(
model = "invalid-model" ,
messages = [{ "role" : "user" , "content" : "Hello" }]
)
except RateLimitError:
print ( "Rate limit exceeded" )
except APIError as e:
print ( f "API error: { e.message } " )
Next Steps