LLM Providers¶

Qanot AI supports four LLM providers out of the box, with automatic failover when multiple providers are configured.

Supported Providers¶

Anthropic (Claude)¶

{
  "provider": "anthropic",
  "model": "claude-sonnet-4-6",
  "api_key": "sk-ant-..."
}

Features: - Native streaming via messages.stream() - Prompt caching with cache_control: ephemeral on the system prompt - OAuth token support (tokens starting with sk-ant-oat use Bearer auth) - Cost tracking with per-model pricing

Available models:

Model	Input $/MTok	Output $/MTok	Cache Read	Cache Write
`claude-sonnet-4-6`	3.00	15.00	0.30	3.75
`claude-opus-4-6`	15.00	75.00	1.50	18.75
`claude-haiku-4-5-20251001`	0.80	4.00	0.08	1.00

OAuth tokens: If your API key starts with sk-ant-oat, Qanot automatically switches to Bearer authentication with the anthropic-beta: oauth-2025-04-20 header.

1M context window: Opus 4.6 and Sonnet 4.6 support up to 1M tokens context window. Qanot auto-detects these models and adjusts max_context_tokens accordingly.

Thinking display: Qanot sets thinking.display: "omitted" by default, which reduces time-to-first-token by not streaming the thinking content back.

Server-side features:

Code execution (code_execution_20250825): Enable with "code_execution": true in config. Allows the agent to run Python code in Anthropic's sandbox. Free when used with web search.
Memory tool (memory_20250818): Enable with "memory_tool": true in config. Adds trained memory behavior where the model auto-checks and creates structured memory notes.

OpenAI (GPT)¶

{
  "provider": "openai",
  "model": "gpt-4.1",
  "api_key": "sk-..."
}

Features: - Streaming via chat completions with stream: true - Function calling format (tool definitions auto-converted from Anthropic format) - Usage tracking with stream_options: include_usage

Available models:

Model	Input $/MTok	Output $/MTok
`gpt-4.1`	2.00	8.00
`gpt-4.1-mini`	0.40	1.60
`gpt-4o`	2.50	10.00
`gpt-4o-mini`	0.15	0.60

Google Gemini¶

{
  "provider": "gemini",
  "model": "gemini-2.5-flash",
  "api_key": "AIza..."
}

Features: - Uses OpenAI-compatible API via generativelanguage.googleapis.com - Automatic stripping of unsupported JSON Schema keys (patternProperties, additionalProperties, $ref) - Synthetic user turn insertion (Gemini requires conversations to start with a user message) - Free embedding tier for RAG (preferred embedder)

Available models:

Model	Input $/MTok	Output $/MTok
`gemini-3.1-pro-preview`	2.00	12.00
`gemini-3.1-flash-lite`	0.25	1.50
`gemini-3-flash-preview`	0.15	0.60
`gemini-2.5-pro`	1.25	10.00
`gemini-2.5-flash`	0.15	0.60
`gemini-2.0-flash`	0.10	0.40

Custom base URL: You can override the base URL for Gemini, which is useful for proxies or regional endpoints:

{
  "provider": "gemini",
  "model": "gemini-2.5-flash",
  "api_key": "AIza...",
  "base_url": "https://generativelanguage.googleapis.com/v1beta/openai/"
}

Groq¶

{
  "provider": "groq",
  "model": "llama-3.3-70b-versatile",
  "api_key": "gsk_..."
}

Features: - Uses OpenAI-compatible API via api.groq.com - Very fast inference (sub-second responses for smaller models) - Generous free tier

Available models:

Model	Input $/MTok	Output $/MTok
`meta-llama/llama-4-scout-17b-16e-instruct`	0.11	0.18
`llama-3.3-70b-versatile`	0.59	0.79
`llama-3.1-8b-instant`	0.05	0.08
`qwen/qwen3-32b`	0.29	0.39
`moonshotai/kimi-k2-instruct`	0.20	0.20
`groq/compound`	0.59	0.79
`groq/compound-mini`	0.05	0.08

Limitation: Groq does not offer an embedding API. If Groq is your only provider, RAG will not function unless you also add a Gemini or OpenAI provider.

Message Format Conversion¶

Qanot uses Anthropic's message format internally (tool_use/tool_result blocks). The OpenAI, Gemini, and Groq providers automatically convert:

Tool definitions: Anthropic input_schema format is converted to OpenAI function.parameters
Messages: tool_use blocks become function tool calls; tool_result blocks become tool role messages
System prompt: Moved from a dedicated field to a system role message

This conversion is transparent. You do not need to worry about format differences.

Multi-Provider Failover¶

When you configure multiple providers, Qanot creates a FailoverProvider that automatically switches between them on errors.

Configuration¶

{
  "providers": [
    {
      "name": "claude-primary",
      "provider": "anthropic",
      "model": "claude-sonnet-4-6",
      "api_key": "sk-ant-..."
    },
    {
      "name": "gemini-secondary",
      "provider": "gemini",
      "model": "gemini-2.5-flash",
      "api_key": "AIza..."
    },
    {
      "name": "groq-fallback",
      "provider": "groq",
      "model": "llama-3.3-70b-versatile",
      "api_key": "gsk_..."
    }
  ]
}

How Failover Works¶

The first provider in the list is the active provider
On each API call, Qanot tries the active provider first
If it fails with a classified error, the next available provider is tried
Successful calls reset the failure state for that provider
Failed providers enter a cooldown period

Error Classification¶

Errors are classified into categories that determine retry behavior:

Error Type	HTTP Codes	Behavior
`rate_limit`	429	Transient -- try next provider, cooldown
`overloaded`	503, 529	Transient -- try next provider, cooldown
`timeout`	408, 500, 502, 504	Transient -- try next provider, cooldown
`not_found`	404	Transient -- try next provider
`auth`	401, 403	Permanent -- provider disabled until restart
`billing`	402	Permanent -- provider disabled until restart
`unknown`	Other	Not retried, error raised

Cooldown Mechanism¶

Transient failures: Provider enters cooldown for 120 * failure_count seconds (max 600s)
Permanent failures: Provider is disabled for the session lifetime
Success: Resets failure count and cooldown for that provider

Provider Initialization¶

Providers are lazily initialized. The second and third providers are only created when first needed (on failover), reducing startup time and memory usage.

Adding Custom Providers¶

Any provider that speaks the OpenAI chat completions API can be used through the openai provider type with a custom base_url:

{
  "provider": "openai",
  "model": "your-model-name",
  "api_key": "your-key",
  "base_url": "https://your-api.example.com/v1"
}

This works for: - OpenRouter - Azure OpenAI - Local models (vLLM, llama.cpp server) - Any OpenAI-compatible API

Ollama Native API¶

For Ollama, Qanot uses the native /api/chat endpoint with think=false instead of the OpenAI-compatible endpoint. This provides approximately 30x faster inference by disabling the thinking/reasoning step that Ollama's OpenAI compatibility layer does not support efficiently. The native API is automatically selected when Ollama is detected (by API key or base URL containing port 11434).

FastEmbed for RAG with Ollama¶

When Ollama is your LLM provider, Qanot automatically selects FastEmbed (CPU-based, ONNX runtime) for RAG embeddings instead of requiring a separate embedding API. This avoids GPU VRAM conflicts between the chat model and the embedding model. Install with pip install fastembed. If FastEmbed is not installed, Qanot falls back to using Ollama's own embedding endpoint via the OpenAI-compatible API.

For providers with significant API differences, you can subclass LLMProvider:

from qanot.providers.base import LLMProvider, ProviderResponse, StreamEvent

class MyProvider(LLMProvider):
    def __init__(self, api_key: str, model: str):
        self.model = model
        # Initialize your client

    async def chat(self, messages, tools=None, system=None) -> ProviderResponse:
        # Implement chat
        return ProviderResponse(content="Hello", stop_reason="end_turn")

    async def chat_stream(self, messages, tools=None, system=None):
        # Optional: implement streaming
        # Default falls back to chat() if not overridden
        yield StreamEvent(type="text_delta", text="Hello")
        yield StreamEvent(type="done", response=ProviderResponse(content="Hello"))

Register it by modifying _create_single_provider in qanot/providers/failover.py.