Skip to main content

Endpoint

POST /v1/chat/completions
Send a conversation to an AI model and receive a text response. Compatible with the OpenAI Chat Completions API.

Request Body

model
string
required
The model to use for completion. Use provider prefix for explicit routing (e.g., geminicli:gemini-2.5-pro) or omit prefix for auto-routing.
messages
array
required
Array of message objects forming the conversation.
stream
boolean
default:false
If true, returns a stream of server-sent events instead of a single response.
temperature
number
Sampling temperature between 0 and 2. Higher values make output more random.
max_tokens
integer
Maximum number of tokens to generate in the completion.
top_p
number
Nucleus sampling parameter. Alternative to temperature.
frequency_penalty
number
default:0
Penalize tokens based on frequency in the text so far (-2.0 to 2.0).
presence_penalty
number
default:0
Penalize tokens based on presence in the text so far (-2.0 to 2.0).
stop
string | array
Up to 4 sequences where the API will stop generating tokens.
tools
array
List of tools the model may call. Currently supports function calling.
extra_body
object
Provider-specific extensions. See CLI Attachments for CLI provider options.

Response Format

id
string
Unique identifier for the completion
object
string
Object type, always chat.completion or chat.completion.chunk for streaming
created
integer
Unix timestamp of when the completion was created
model
string
The model used for completion
choices
array
Array of completion choices
usage
object
Token usage statistics

Examples

Basic Request

curl http://localhost:18080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-test-123" \
  -d '{
    "model": "gemini-2.5-pro",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ]
  }'

Streaming Response

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:18080/v1",
    api_key="sk-test-123"
)

stream = client.chat.completions.create(
    model="gemini-2.5-pro",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Multi-Turn Conversation

messages = [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": "Write a Python function to calculate fibonacci."},
    {"role": "assistant", "content": "Here's a recursive implementation..."},
    {"role": "user", "content": "Now make it iterative."}
]

response = client.chat.completions.create(
    model="geminicli:gemini-2.5-pro",
    messages=messages
)

Provider-Specific Routing

# Let switchAILocal choose the best provider
curl http://localhost:18080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-test-123" \
  -d '{
    "model": "gemini-2.5-pro",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Advanced Features

Temperature Control

Adjust randomness of responses:
response = client.chat.completions.create(
    model="gemini-2.5-pro",
    messages=[{"role": "user", "content": "Write a creative story"}],
    temperature=1.5  # Higher = more creative
)

Token Limits

Constrain response length:
response = client.chat.completions.create(
    model="gemini-2.5-pro",
    messages=[{"role": "user", "content": "Summarize quantum physics"}],
    max_tokens=100  # Limit to 100 tokens
)

Stop Sequences

Stop generation at specific strings:
response = client.chat.completions.create(
    model="gemini-2.5-pro",
    messages=[{"role": "user", "content": "List 5 colors:"}],
    stop=["\n\n", "6."]  # Stop at double newline or "6."
)

Error Handling

from openai import OpenAI, APIError, RateLimitError

client = OpenAI(
    base_url="http://localhost:18080/v1",
    api_key="sk-test-123"
)

try:
    response = client.chat.completions.create(
        model="invalid-model",
        messages=[{"role": "user", "content": "Hello"}]
    )
except RateLimitError:
    print("Rate limit exceeded")
except APIError as e:
    print(f"API error: {e.message}")

Next Steps