Chat Completions

Endpoint

POST /v1/chat/completions

Send a conversation to an AI model and receive a text response. Compatible with the OpenAI Chat Completions API.

Request Body

model

string

required

The model to use for completion. Use provider prefix for explicit routing (e.g., geminicli:gemini-2.5-pro) or omit prefix for auto-routing.

messages

array

required

Array of message objects forming the conversation.

Show Message Object Properties

role

string

required

The role of the message author: system, user, or assistant

content

string | array

required

The message content. Can be a string or array of content parts for multimodal inputs.

name

string

Optional name for the message author

stream

boolean

default:false

If true, returns a stream of server-sent events instead of a single response.

temperature

number

Sampling temperature between 0 and 2. Higher values make output more random.

max_tokens

integer

Maximum number of tokens to generate in the completion.

top_p

number

Nucleus sampling parameter. Alternative to temperature.

frequency_penalty

number

default:0

Penalize tokens based on frequency in the text so far (-2.0 to 2.0).

presence_penalty

number

default:0

Penalize tokens based on presence in the text so far (-2.0 to 2.0).

stop

string | array

Up to 4 sequences where the API will stop generating tokens.

tools

array

List of tools the model may call. Currently supports function calling.

extra_body

object

Provider-specific extensions. See CLI Attachments for CLI provider options.

Response Format

string

Unique identifier for the completion

object

string

Object type, always chat.completion or chat.completion.chunk for streaming

created

integer

Unix timestamp of when the completion was created

model

string

The model used for completion

choices

array

Array of completion choices

Show Choice Object Properties

index

integer

The index of this choice in the array

message

object

The generated message

Show Message Properties

role

string

Always assistant

content

string

The generated text content

tool_calls

array

Tool calls made by the model, if any

finish_reason

string

Why the completion stopped: stop, length, tool_calls, or content_filter

usage

object

Token usage statistics

Show Usage Properties

prompt_tokens

integer

Number of tokens in the prompt

completion_tokens

integer

Number of tokens in the completion

total_tokens

integer

Total tokens used

Examples

Basic Request

curl http://localhost:18080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-test-123" \
  -d '{
    "model": "gemini-2.5-pro",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ]
  }'

Streaming Response

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:18080/v1",
    api_key="sk-test-123"
)

stream = client.chat.completions.create(
    model="gemini-2.5-pro",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Multi-Turn Conversation

messages = [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": "Write a Python function to calculate fibonacci."},
    {"role": "assistant", "content": "Here's a recursive implementation..."},
    {"role": "user", "content": "Now make it iterative."}
]

response = client.chat.completions.create(
    model="geminicli:gemini-2.5-pro",
    messages=messages
)

Provider-Specific Routing

Auto-Routing
Gemini CLI
Claude CLI
Ollama

# Let switchAILocal choose the best provider
curl http://localhost:18080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-test-123" \
  -d '{
    "model": "gemini-2.5-pro",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

# Route to Gemini CLI (uses your CLI subscription)
curl http://localhost:18080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-test-123" \
  -d '{
    "model": "geminicli:gemini-2.5-pro",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

# Route to Claude CLI
curl http://localhost:18080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-test-123" \
  -d '{
    "model": "claudecli:claude-sonnet-4",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

# Route to local Ollama
curl http://localhost:18080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-test-123" \
  -d '{
    "model": "ollama:llama3.2",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Advanced Features

Temperature Control

Adjust randomness of responses:

response = client.chat.completions.create(
    model="gemini-2.5-pro",
    messages=[{"role": "user", "content": "Write a creative story"}],
    temperature=1.5  # Higher = more creative
)

Token Limits

Constrain response length:

response = client.chat.completions.create(
    model="gemini-2.5-pro",
    messages=[{"role": "user", "content": "Summarize quantum physics"}],
    max_tokens=100  # Limit to 100 tokens
)

Stop Sequences

Stop generation at specific strings:

response = client.chat.completions.create(
    model="gemini-2.5-pro",
    messages=[{"role": "user", "content": "List 5 colors:"}],
    stop=["\n\n", "6."]  # Stop at double newline or "6."
)

Error Handling

from openai import OpenAI, APIError, RateLimitError

client = OpenAI(
    base_url="http://localhost:18080/v1",
    api_key="sk-test-123"
)

try:
    response = client.chat.completions.create(
        model="invalid-model",
        messages=[{"role": "user", "content": "Hello"}]
    )
except RateLimitError:
    print("Rate limit exceeded")
except APIError as e:
    print(f"API error: {e.message}")

Next Steps

Provider Prefixes

Learn about routing to specific providers

CLI Attachments

Pass files and folders to CLI providers

Models

List and discover available models

WebSocket

Use real-time bidirectional streaming

Overview

Endpoints

Provider Formats

Management API

Endpoint

Request Body

Response Format

Examples

Basic Request

Streaming Response

Multi-Turn Conversation

Provider-Specific Routing

Advanced Features

Temperature Control

Token Limits

Stop Sequences

Error Handling

Next Steps

Provider Prefixes

CLI Attachments

Models

WebSocket

Overview

Endpoints

Provider Formats

Management API

​Endpoint

​Request Body

​Response Format

​Examples

​Basic Request

​Streaming Response

​Multi-Turn Conversation

​Provider-Specific Routing

​Advanced Features

​Temperature Control

​Token Limits

​Stop Sequences

​Error Handling

​Next Steps

Provider Prefixes

CLI Attachments

Models

WebSocket

Endpoint

Request Body

Response Format

Examples

Basic Request

Streaming Response

Multi-Turn Conversation

Provider-Specific Routing

Advanced Features

Temperature Control

Token Limits

Stop Sequences

Error Handling

Next Steps