AI Gateway
One OpenAI-compatible endpoint that fans out to OpenAI, Anthropic, and Google Gemini — with BYO API keys, automatic failover, and per-org usage tracking.
The AI Gateway is an OpenAI-compatible HTTP layer that sits in front of every LLM provider you’ve configured. Your applications, workflows, and agents call one endpoint, and the gateway routes each request to the right provider — falling back to alternates when one fails, enforcing rate limits, and emitting usage events for billing and analytics.
How it works
The gateway is embedded in the main proxifai binary at internal/llmgateway/embedded. It’s mounted under /api/v1/llm on the same HTTP server that serves the rest of the API — there’s no separate gateway service to deploy. Every org configures providers in the model_provider table; the gateway resolves them at request time, caches the resolution for 60 seconds, and dispatches.
┌──────────────┐ POST /api/v1/llm/v1/chat/completions ┌─────────────────┐
│ your client │ ───────────────────────────────────────► │ AI Gateway │
│ (curl, SDK, │ │ (chi sub-router│
│ workflow, │ ◄─────────────────────────────────────── │ in proxifai) │
│ agent) │ 200 OK · streamed tokens │ │
└──────────────┘ └────────┬────────┘
│
▼
┌──────────┬──────────┬──────────┐
│ OpenAI │ Anthropic│ Gemini │
│ API │ API │ API │
└──────────┴──────────┴──────────┘
When a request lands:
- Auth —
Authorization: Bearer <key>(or Anthropic-stylex-api-key) is verified against the configured static keys, or against an HMAC-signedpfai_<execID>_<sig>token issued by the workflow runtime. - Org resolution — the org is read from the
X-Org-Idheader (set upstream by the main API) or defaults to"default". - Provider lookup — providers + model-to-provider mappings are loaded from the database (cached 60s).
- BYO override — if the user has personal keys in
user_provider_key, those get prepended to the routing chain so they’re tried before org-level keys. - Routing — for the requested model, the circuit breaker filters out providers currently in the open state; the rest are tried in order.
- Retry — failed attempts retry up to 2 times (500 ms initial wait, 3 s max) before falling back to the next provider.
- Usage event — on success, an event is published to the
LLM_USAGENATS stream for downstream tracking.
Supported providers
Four provider types are supported by the gateway runtime (internal/llmgateway/dbprovider/resolver.go):
provider_type | Backed by | Notes |
|---|---|---|
openai | api.openai.com (or any OpenAI-compatible endpoint via base_url) | Native OpenAI Chat Completions transport |
openai-compatible | Any OpenAI-shaped API (vLLM, Ollama, OpenRouter, Together, Groq, …) | Same transport as openai, used semantically to flag a custom endpoint |
anthropic | api.anthropic.com | Native Anthropic Messages transport |
gemini | generativelanguage.googleapis.com | Google’s Gemini API |
The default model catalog below is what each provider type can serve. To enable a subset, set the models JSONB column on the model_provider row to a JSON array of model IDs.
Default model catalog
| Provider type | Models |
|---|---|
openai | gpt-4o, gpt-4o-mini, gpt-4-turbo, gpt-4.1, gpt-4.1-mini, gpt-4.1-nano, o1, o1-mini, o3, o3-mini, o4-mini |
anthropic | claude-haiku-4.5, claude-sonnet-4.5, claude-sonnet-4.6, claude-opus-4.6 |
gemini | gemini-2.0-flash, gemini-2.0-pro, gemini-2.5-flash, gemini-2.5-pro |
openai-compatible | Whatever you list in the models column — defined per row |
Out of the box that’s 19 models across three first-party providers. Adding an openai-compatible row lets you plug in any other model your endpoint exposes — Llama, Mistral, DeepSeek, Qwen, local Ollama instances, OpenRouter, etc. — under whatever model IDs you choose.
Configuring a provider
Each row in model_provider represents one provider. Add them through the Settings → Model Providers UI or via the management API; the API key is encrypted at rest using the workspace’s encryption key.
| Column | Purpose |
|---|---|
provider_type | One of the four supported types above |
name | Display name (e.g. “openai-prod”) |
api_key_encrypted | Provider API key, encrypted via internal/crypto |
base_url | Optional override; e.g. https://openrouter.ai/api/v1 for OpenRouter |
models | Optional JSON array of model IDs to expose; empty means “use default catalog” |
is_enabled | Skip the row if false |
After a write, call the management API’s invalidate endpoint or wait up to 60 seconds for the resolver cache to expire.
Two API surfaces
The gateway accepts requests in either OpenAI or Anthropic format, regardless of which provider ultimately handles them. Pick the surface that matches the SDK you already have.
curl http://localhost:3000/api/v1/llm/v1/chat/completions \
-H "Authorization: Bearer $PROXIFAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "claude-sonnet-4.6",
"messages": [
{"role": "user", "content": "Explain the CAP theorem in two sentences."}
],
"stream": true
}'Any OpenAI SDK works by pointing base_url at http://<your-host>/api/v1/llm/v1.
curl http://localhost:3000/api/v1/llm/v1/messages \
-H "x-api-key: $PROXIFAI_API_KEY" \
-H "anthropic-version: 2023-06-01" \
-H "Content-Type: application/json" \
-d '{
"model": "claude-sonnet-4.6",
"max_tokens": 256,
"messages": [
{"role": "user", "content": "Explain the CAP theorem in two sentences."}
],
"stream": true
}'The Anthropic SDK works by pointing base_url at http://<your-host>/api/v1/llm.
Both surfaces accept any model from any configured provider — you can ask for gpt-4o through the Anthropic-format endpoint and the gateway will translate. Streaming uses Server-Sent Events on both surfaces.
Authentication
| Key type | Header | Issued by | Use case |
|---|---|---|---|
| Static API key | Authorization: Bearer … or x-api-key: … | Workspace admin | Apps, scripts, manual testing |
Workflow execution key (pfai_<execID>_<sig>) | Same | Workflow runtime | Auto-injected into agent containers as PFAI_TOKEN; HMAC-signed with JWT_SECRET |
Workflow keys carry the executing workflow’s identity, so usage events are attributed to the right run, and credit checks (Enterprise) can deduct against the right org.
Resilience
The defaults wired up in internal/llmgateway/embedded/embedded.go:
| Mechanism | Setting | Behavior |
|---|---|---|
| Circuit breaker | 5 consecutive failures → open · 30 s reset · 1 half-open probe | Skips a provider that’s failing; auto-recovers when it succeeds again |
| Retry | 2 max attempts · 500 ms → 3 s exponential | Retries the same provider before falling back to the next in the chain |
| Rate limit | 120 req/min, burst 20, per API key | Returns 429 Too Many Requests past the limit |
| Response cache | 1000-entry LRU, 5 min TTL | Caches non-streaming completions keyed by the request body hash |
| Provider cache | 60 s TTL, per org | Avoids reading model_provider on every request |
If every provider for a model trips its circuit, the gateway returns 503 Service Unavailable with {"error":{"type":"gateway_error"}}. This is the only place a request fails outright — partial failures within the fallback chain are transparent to the caller.
Bring your own key (BYOK)
BYOK is two-layered:
- Workspace-level — keys configured in
model_providerapply to everyone in the org. - User-level — a row in
user_provider_keyfor(user_id, org_id, provider_type)overrides the workspace key for that user, and is prepended to the routing chain so it’s tried first.
This lets individual contributors use their own enterprise/Anthropic/OpenAI accounts (volume discounts, separate quotas, evaluation credits) while the team’s shared key remains the fallback.
Set a user-level OpenAI-compatible key with a base_url of https://openrouter.ai/api/v1 and the gateway routes that user’s traffic through OpenRouter — useful for trying a model that’s not yet in the default catalog.
Usage tracking
Successful requests publish a usage.Event to the LLM_USAGE JetStream stream:
{
"executionId": "exec_…",
"provider": "anthropic_…",
"model": "claude-sonnet-4.6",
"promptTokens": 412,
"completionTokens": 128,
"totalTokens": 540,
"streaming": true,
"cacheHit": false,
"estimatedCostUsd": 0,
"timestamp": "2026-05-04T22:14:00Z"
}
A consumer in internal/llmusage writes these to the database. The management API exposes them at /api/v1/gateway/usage*:
| Endpoint | Returns |
|---|---|
GET /api/v1/gateway/usage | Totals per period |
GET /api/v1/gateway/usage/by-model | Token + cost split by model |
GET /api/v1/gateway/usage/by-workflow | Attribution to chat sessions, agent runs, workflow executions |
GET /api/v1/gateway/usage/timeline | Time-series for charts |
GET /api/v1/gateway/usage/log | Raw event log |
GET /api/v1/gateway/rate-limits · POST · DELETE /:id | Manage per-team / per-user / per-project caps |
GET /api/v1/gateway/budgets | Budget alerts and enforcement |
The Settings → AI Gateway page in the web UI is built on these endpoints.
Pricing and cost reporting
In the OSS build, pricing.CalculateCost returns 0 because no cost calculator is registered (internal/llmgateway/pricing/pricing.go). Token counts are accurate; the estimatedCostUsd field will be zero. The Enterprise build registers a calculator with current per-model rates so the same field reflects real USD.
To add custom pricing in OSS, register a calculator at startup:
import "github.com/proxifai/proxifai-oss/internal/llmgateway/pricing"
pricing.CostCalculator = func(model string, prompt, completion int) float64 {
// your rate table
}
Configuration reference
| Env var | Default | Effect |
|---|---|---|
LLM_GATEWAY_HMAC_SECRET | falls back to JWT_SECRET | Used to verify pfai_… workflow tokens |
JWT_SECRET | auto-generated on first boot | Doubles as the gateway HMAC secret when the dedicated var is unset |
NATS_URL | embedded | Where usage events land; embedded NATS works out of the box |
Provider keys, model lists, and rate limits live in the database, not env vars — so configuration changes don’t require a restart.
Endpoint reference
| Method · Path | Auth | Purpose |
|---|---|---|
GET /api/v1/llm/health | none | Liveness probe |
GET /api/v1/llm/cache-stats | none | Response cache hit/miss counters |
POST /api/v1/llm/v1/chat/completions | required | OpenAI-format completions |
POST /api/v1/llm/v1/messages | required | Anthropic-format messages |
See also
The user-facing chat modes (Ask, Plan, Code, Build) — all backed by this gateway.
How the workflow runtime injects pfai_ tokens into agent containers so they can call the gateway.
The embedding pipeline that uses this gateway for embedding model calls.
Where the gateway sits in the single-binary topology.