LLM Providers¶
Qanot AI supports four LLM providers out of the box, with automatic failover when multiple providers are configured.
Supported Providers¶
Anthropic (Claude)¶
Features:
- Native streaming via messages.stream()
- Prompt caching with cache_control: ephemeral on the system prompt
- OAuth token support (tokens starting with sk-ant-oat use Bearer auth)
- Cost tracking with per-model pricing
Available models:
| Model | Input $/MTok | Output $/MTok | Cache Read | Cache Write |
|---|---|---|---|---|
claude-sonnet-4-6 |
3.00 | 15.00 | 0.30 | 3.75 |
claude-opus-4-6 |
15.00 | 75.00 | 1.50 | 18.75 |
claude-haiku-4-5-20251001 |
0.80 | 4.00 | 0.08 | 1.00 |
OAuth tokens: If your API key starts with sk-ant-oat, Qanot automatically switches to Bearer authentication with the anthropic-beta: oauth-2025-04-20 header.
1M context window: Opus 4.6 and Sonnet 4.6 support up to 1M tokens context window. Qanot auto-detects these models and adjusts max_context_tokens accordingly.
Thinking display: Qanot sets thinking.display: "omitted" by default, which reduces time-to-first-token by not streaming the thinking content back.
Server-side features:
- Code execution (
code_execution_20250825): Enable with"code_execution": truein config. Allows the agent to run Python code in Anthropic's sandbox. Free when used with web search. - Memory tool (
memory_20250818): Enable with"memory_tool": truein config. Adds trained memory behavior where the model auto-checks and creates structured memory notes.
OpenAI (GPT)¶
Features:
- Streaming via chat completions with stream: true
- Function calling format (tool definitions auto-converted from Anthropic format)
- Usage tracking with stream_options: include_usage
Available models:
| Model | Input $/MTok | Output $/MTok |
|---|---|---|
gpt-4.1 |
2.00 | 8.00 |
gpt-4.1-mini |
0.40 | 1.60 |
gpt-4o |
2.50 | 10.00 |
gpt-4o-mini |
0.15 | 0.60 |
Google Gemini¶
Features:
- Uses OpenAI-compatible API via generativelanguage.googleapis.com
- Automatic stripping of unsupported JSON Schema keys (patternProperties, additionalProperties, $ref)
- Synthetic user turn insertion (Gemini requires conversations to start with a user message)
- Free embedding tier for RAG (preferred embedder)
Available models:
| Model | Input $/MTok | Output $/MTok |
|---|---|---|
gemini-3.1-pro-preview |
2.00 | 12.00 |
gemini-3.1-flash-lite |
0.25 | 1.50 |
gemini-3-flash-preview |
0.15 | 0.60 |
gemini-2.5-pro |
1.25 | 10.00 |
gemini-2.5-flash |
0.15 | 0.60 |
gemini-2.0-flash |
0.10 | 0.40 |
Custom base URL: You can override the base URL for Gemini, which is useful for proxies or regional endpoints:
{
"provider": "gemini",
"model": "gemini-2.5-flash",
"api_key": "AIza...",
"base_url": "https://generativelanguage.googleapis.com/v1beta/openai/"
}
Groq¶
Features:
- Uses OpenAI-compatible API via api.groq.com
- Very fast inference (sub-second responses for smaller models)
- Generous free tier
Available models:
| Model | Input $/MTok | Output $/MTok |
|---|---|---|
meta-llama/llama-4-scout-17b-16e-instruct |
0.11 | 0.18 |
llama-3.3-70b-versatile |
0.59 | 0.79 |
llama-3.1-8b-instant |
0.05 | 0.08 |
qwen/qwen3-32b |
0.29 | 0.39 |
moonshotai/kimi-k2-instruct |
0.20 | 0.20 |
groq/compound |
0.59 | 0.79 |
groq/compound-mini |
0.05 | 0.08 |
Limitation: Groq does not offer an embedding API. If Groq is your only provider, RAG will not function unless you also add a Gemini or OpenAI provider.
Message Format Conversion¶
Qanot uses Anthropic's message format internally (tool_use/tool_result blocks). The OpenAI, Gemini, and Groq providers automatically convert:
- Tool definitions: Anthropic
input_schemaformat is converted to OpenAIfunction.parameters - Messages:
tool_useblocks becomefunctiontool calls;tool_resultblocks becometoolrole messages - System prompt: Moved from a dedicated field to a system role message
This conversion is transparent. You do not need to worry about format differences.
Multi-Provider Failover¶
When you configure multiple providers, Qanot creates a FailoverProvider that automatically switches between them on errors.
Configuration¶
{
"providers": [
{
"name": "claude-primary",
"provider": "anthropic",
"model": "claude-sonnet-4-6",
"api_key": "sk-ant-..."
},
{
"name": "gemini-secondary",
"provider": "gemini",
"model": "gemini-2.5-flash",
"api_key": "AIza..."
},
{
"name": "groq-fallback",
"provider": "groq",
"model": "llama-3.3-70b-versatile",
"api_key": "gsk_..."
}
]
}
How Failover Works¶
- The first provider in the list is the active provider
- On each API call, Qanot tries the active provider first
- If it fails with a classified error, the next available provider is tried
- Successful calls reset the failure state for that provider
- Failed providers enter a cooldown period
Error Classification¶
Errors are classified into categories that determine retry behavior:
| Error Type | HTTP Codes | Behavior |
|---|---|---|
rate_limit |
429 | Transient -- try next provider, cooldown |
overloaded |
503, 529 | Transient -- try next provider, cooldown |
timeout |
408, 500, 502, 504 | Transient -- try next provider, cooldown |
not_found |
404 | Transient -- try next provider |
auth |
401, 403 | Permanent -- provider disabled until restart |
billing |
402 | Permanent -- provider disabled until restart |
unknown |
Other | Not retried, error raised |
Cooldown Mechanism¶
- Transient failures: Provider enters cooldown for
120 * failure_countseconds (max 600s) - Permanent failures: Provider is disabled for the session lifetime
- Success: Resets failure count and cooldown for that provider
Provider Initialization¶
Providers are lazily initialized. The second and third providers are only created when first needed (on failover), reducing startup time and memory usage.
Adding Custom Providers¶
Any provider that speaks the OpenAI chat completions API can be used through the openai provider type with a custom base_url:
{
"provider": "openai",
"model": "your-model-name",
"api_key": "your-key",
"base_url": "https://your-api.example.com/v1"
}
This works for: - OpenRouter - Azure OpenAI - Local models (vLLM, llama.cpp server) - Any OpenAI-compatible API
Ollama Native API¶
For Ollama, Qanot uses the native /api/chat endpoint with think=false instead of the OpenAI-compatible endpoint. This provides approximately 30x faster inference by disabling the thinking/reasoning step that Ollama's OpenAI compatibility layer does not support efficiently. The native API is automatically selected when Ollama is detected (by API key or base URL containing port 11434).
FastEmbed for RAG with Ollama¶
When Ollama is your LLM provider, Qanot automatically selects FastEmbed (CPU-based, ONNX runtime) for RAG embeddings instead of requiring a separate embedding API. This avoids GPU VRAM conflicts between the chat model and the embedding model. Install with pip install fastembed. If FastEmbed is not installed, Qanot falls back to using Ollama's own embedding endpoint via the OpenAI-compatible API.
For providers with significant API differences, you can subclass LLMProvider:
from qanot.providers.base import LLMProvider, ProviderResponse, StreamEvent
class MyProvider(LLMProvider):
def __init__(self, api_key: str, model: str):
self.model = model
# Initialize your client
async def chat(self, messages, tools=None, system=None) -> ProviderResponse:
# Implement chat
return ProviderResponse(content="Hello", stop_reason="end_turn")
async def chat_stream(self, messages, tools=None, system=None):
# Optional: implement streaming
# Default falls back to chat() if not overridden
yield StreamEvent(type="text_delta", text="Hello")
yield StreamEvent(type="done", response=ProviderResponse(content="Hello"))
Register it by modifying _create_single_provider in qanot/providers/failover.py.