GitHub
Reference

AI Gateway

One OpenAI-compatible endpoint that fans out to OpenAI, Anthropic, and Google Gemini — with BYO API keys, automatic failover, and per-org usage tracking.

The AI Gateway is an OpenAI-compatible HTTP layer that sits in front of every LLM provider you’ve configured. Your applications, workflows, and agents call one endpoint, and the gateway routes each request to the right provider — falling back to alternates when one fails, enforcing rate limits, and emitting usage events for billing and analytics.

How it works

The gateway is embedded in the main proxifai binary at internal/llmgateway/embedded. It’s mounted under /api/v1/llm on the same HTTP server that serves the rest of the API — there’s no separate gateway service to deploy. Every org configures providers in the model_provider table; the gateway resolves them at request time, caches the resolution for 60 seconds, and dispatches.

┌──────────────┐   POST /api/v1/llm/v1/chat/completions   ┌─────────────────┐
│ your client  │ ───────────────────────────────────────► │   AI Gateway    │
│ (curl, SDK,  │                                          │  (chi sub-router│
│  workflow,   │ ◄─────────────────────────────────────── │   in proxifai)  │
│  agent)      │   200 OK · streamed tokens               │                 │
└──────────────┘                                          └────────┬────────┘


                                              ┌──────────┬──────────┬──────────┐
                                              │  OpenAI  │ Anthropic│  Gemini  │
                                              │   API    │   API    │   API    │
                                              └──────────┴──────────┴──────────┘

When a request lands:

  1. AuthAuthorization: Bearer <key> (or Anthropic-style x-api-key) is verified against the configured static keys, or against an HMAC-signed pfai_<execID>_<sig> token issued by the workflow runtime.
  2. Org resolution — the org is read from the X-Org-Id header (set upstream by the main API) or defaults to "default".
  3. Provider lookup — providers + model-to-provider mappings are loaded from the database (cached 60s).
  4. BYO override — if the user has personal keys in user_provider_key, those get prepended to the routing chain so they’re tried before org-level keys.
  5. Routing — for the requested model, the circuit breaker filters out providers currently in the open state; the rest are tried in order.
  6. Retry — failed attempts retry up to 2 times (500 ms initial wait, 3 s max) before falling back to the next provider.
  7. Usage event — on success, an event is published to the LLM_USAGE NATS stream for downstream tracking.

Supported providers

Four provider types are supported by the gateway runtime (internal/llmgateway/dbprovider/resolver.go):

provider_typeBacked byNotes
openaiapi.openai.com (or any OpenAI-compatible endpoint via base_url)Native OpenAI Chat Completions transport
openai-compatibleAny OpenAI-shaped API (vLLM, Ollama, OpenRouter, Together, Groq, …)Same transport as openai, used semantically to flag a custom endpoint
anthropicapi.anthropic.comNative Anthropic Messages transport
geminigenerativelanguage.googleapis.comGoogle’s Gemini API

The default model catalog below is what each provider type can serve. To enable a subset, set the models JSONB column on the model_provider row to a JSON array of model IDs.

Default model catalog

Provider typeModels
openaigpt-4o, gpt-4o-mini, gpt-4-turbo, gpt-4.1, gpt-4.1-mini, gpt-4.1-nano, o1, o1-mini, o3, o3-mini, o4-mini
anthropicclaude-haiku-4.5, claude-sonnet-4.5, claude-sonnet-4.6, claude-opus-4.6
geminigemini-2.0-flash, gemini-2.0-pro, gemini-2.5-flash, gemini-2.5-pro
openai-compatibleWhatever you list in the models column — defined per row

Out of the box that’s 19 models across three first-party providers. Adding an openai-compatible row lets you plug in any other model your endpoint exposes — Llama, Mistral, DeepSeek, Qwen, local Ollama instances, OpenRouter, etc. — under whatever model IDs you choose.

Configuring a provider

Each row in model_provider represents one provider. Add them through the Settings → Model Providers UI or via the management API; the API key is encrypted at rest using the workspace’s encryption key.

ColumnPurpose
provider_typeOne of the four supported types above
nameDisplay name (e.g. “openai-prod”)
api_key_encryptedProvider API key, encrypted via internal/crypto
base_urlOptional override; e.g. https://openrouter.ai/api/v1 for OpenRouter
modelsOptional JSON array of model IDs to expose; empty means “use default catalog”
is_enabledSkip the row if false

After a write, call the management API’s invalidate endpoint or wait up to 60 seconds for the resolver cache to expire.

Two API surfaces

The gateway accepts requests in either OpenAI or Anthropic format, regardless of which provider ultimately handles them. Pick the surface that matches the SDK you already have.

curl http://localhost:3000/api/v1/llm/v1/chat/completions \
  -H "Authorization: Bearer $PROXIFAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4.6",
    "messages": [
      {"role": "user", "content": "Explain the CAP theorem in two sentences."}
    ],
    "stream": true
  }'

Any OpenAI SDK works by pointing base_url at http://<your-host>/api/v1/llm/v1.

curl http://localhost:3000/api/v1/llm/v1/messages \
  -H "x-api-key: $PROXIFAI_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4.6",
    "max_tokens": 256,
    "messages": [
      {"role": "user", "content": "Explain the CAP theorem in two sentences."}
    ],
    "stream": true
  }'

The Anthropic SDK works by pointing base_url at http://<your-host>/api/v1/llm.

Both surfaces accept any model from any configured provider — you can ask for gpt-4o through the Anthropic-format endpoint and the gateway will translate. Streaming uses Server-Sent Events on both surfaces.

Authentication

Key typeHeaderIssued byUse case
Static API keyAuthorization: Bearer … or x-api-key: …Workspace adminApps, scripts, manual testing
Workflow execution key (pfai_<execID>_<sig>)SameWorkflow runtimeAuto-injected into agent containers as PFAI_TOKEN; HMAC-signed with JWT_SECRET

Workflow keys carry the executing workflow’s identity, so usage events are attributed to the right run, and credit checks (Enterprise) can deduct against the right org.

Resilience

The defaults wired up in internal/llmgateway/embedded/embedded.go:

MechanismSettingBehavior
Circuit breaker5 consecutive failures → open · 30 s reset · 1 half-open probeSkips a provider that’s failing; auto-recovers when it succeeds again
Retry2 max attempts · 500 ms → 3 s exponentialRetries the same provider before falling back to the next in the chain
Rate limit120 req/min, burst 20, per API keyReturns 429 Too Many Requests past the limit
Response cache1000-entry LRU, 5 min TTLCaches non-streaming completions keyed by the request body hash
Provider cache60 s TTL, per orgAvoids reading model_provider on every request

If every provider for a model trips its circuit, the gateway returns 503 Service Unavailable with {"error":{"type":"gateway_error"}}. This is the only place a request fails outright — partial failures within the fallback chain are transparent to the caller.

Bring your own key (BYOK)

BYOK is two-layered:

  • Workspace-level — keys configured in model_provider apply to everyone in the org.
  • User-level — a row in user_provider_key for (user_id, org_id, provider_type) overrides the workspace key for that user, and is prepended to the routing chain so it’s tried first.

This lets individual contributors use their own enterprise/Anthropic/OpenAI accounts (volume discounts, separate quotas, evaluation credits) while the team’s shared key remains the fallback.

Set a user-level OpenAI-compatible key with a base_url of https://openrouter.ai/api/v1 and the gateway routes that user’s traffic through OpenRouter — useful for trying a model that’s not yet in the default catalog.

Usage tracking

Successful requests publish a usage.Event to the LLM_USAGE JetStream stream:

{
  "executionId": "exec_…",
  "provider": "anthropic_…",
  "model": "claude-sonnet-4.6",
  "promptTokens": 412,
  "completionTokens": 128,
  "totalTokens": 540,
  "streaming": true,
  "cacheHit": false,
  "estimatedCostUsd": 0,
  "timestamp": "2026-05-04T22:14:00Z"
}

A consumer in internal/llmusage writes these to the database. The management API exposes them at /api/v1/gateway/usage*:

EndpointReturns
GET /api/v1/gateway/usageTotals per period
GET /api/v1/gateway/usage/by-modelToken + cost split by model
GET /api/v1/gateway/usage/by-workflowAttribution to chat sessions, agent runs, workflow executions
GET /api/v1/gateway/usage/timelineTime-series for charts
GET /api/v1/gateway/usage/logRaw event log
GET /api/v1/gateway/rate-limits · POST · DELETE /:idManage per-team / per-user / per-project caps
GET /api/v1/gateway/budgetsBudget alerts and enforcement

The Settings → AI Gateway page in the web UI is built on these endpoints.

Pricing and cost reporting

In the OSS build, pricing.CalculateCost returns 0 because no cost calculator is registered (internal/llmgateway/pricing/pricing.go). Token counts are accurate; the estimatedCostUsd field will be zero. The Enterprise build registers a calculator with current per-model rates so the same field reflects real USD.

To add custom pricing in OSS, register a calculator at startup:

import "github.com/proxifai/proxifai-oss/internal/llmgateway/pricing"

pricing.CostCalculator = func(model string, prompt, completion int) float64 {
    // your rate table
}

Configuration reference

Env varDefaultEffect
LLM_GATEWAY_HMAC_SECRETfalls back to JWT_SECRETUsed to verify pfai_… workflow tokens
JWT_SECRETauto-generated on first bootDoubles as the gateway HMAC secret when the dedicated var is unset
NATS_URLembeddedWhere usage events land; embedded NATS works out of the box

Provider keys, model lists, and rate limits live in the database, not env vars — so configuration changes don’t require a restart.

Endpoint reference

Method · PathAuthPurpose
GET /api/v1/llm/healthnoneLiveness probe
GET /api/v1/llm/cache-statsnoneResponse cache hit/miss counters
POST /api/v1/llm/v1/chat/completionsrequiredOpenAI-format completions
POST /api/v1/llm/v1/messagesrequiredAnthropic-format messages

See also