AI Gateway — ProxifAI Docs

The AI Gateway is an OpenAI-compatible HTTP layer that sits in front of every LLM provider you’ve configured. Your applications, workflows, and agents call one endpoint, and the gateway routes each request to the right provider — falling back to alternates when one fails, enforcing rate limits, and emitting usage events for billing and analytics.

How it works

The gateway is embedded in the main proxifai binary at internal/llmgateway/embedded. It’s mounted under /api/v1/llm on the same HTTP server that serves the rest of the API — there’s no separate gateway service to deploy. Every org configures providers in the model_provider table; the gateway resolves them at request time, caches the resolution for 60 seconds, and dispatches.

┌──────────────┐   POST /api/v1/llm/v1/chat/completions   ┌─────────────────┐
│ your client  │ ───────────────────────────────────────► │   AI Gateway    │
│ (curl, SDK,  │                                          │  (chi sub-router│
│  workflow,   │ ◄─────────────────────────────────────── │   in proxifai)  │
│  agent)      │   200 OK · streamed tokens               │                 │
└──────────────┘                                          └────────┬────────┘
                                                                   │
                                                                   ▼
                                              ┌──────────┬──────────┬──────────┐
                                              │  OpenAI  │ Anthropic│  Gemini  │
                                              │   API    │   API    │   API    │
                                              └──────────┴──────────┴──────────┘

When a request lands:

Auth — Authorization: Bearer <key> (or Anthropic-style x-api-key) is verified against the configured static keys, or against an HMAC-signed pfai_<execID>_<sig> token issued by the workflow runtime.
Org resolution — the org is read from the X-Org-Id header (set upstream by the main API) or defaults to "default".
Provider lookup — providers + model-to-provider mappings are loaded from the database (cached 60s).
BYO override — if the user has personal keys in user_provider_key, those get prepended to the routing chain so they’re tried before org-level keys.
Routing — for the requested model, the circuit breaker filters out providers currently in the open state; the rest are tried in order.
Retry — failed attempts retry up to 2 times (500 ms initial wait, 3 s max) before falling back to the next provider.
Usage event — on success, an event is published to the LLM_USAGE NATS stream for downstream tracking.

Supported providers

Four provider types are supported by the gateway runtime (internal/llmgateway/dbprovider/resolver.go):

`provider_type`	Backed by	Notes
`openai`	api.openai.com (or any OpenAI-compatible endpoint via `base_url`)	Native OpenAI Chat Completions transport
`openai-compatible`	Any OpenAI-shaped API (vLLM, Ollama, OpenRouter, Together, Groq, …)	Same transport as `openai`, used semantically to flag a custom endpoint
`anthropic`	api.anthropic.com	Native Anthropic Messages transport
`gemini`	generativelanguage.googleapis.com	Google’s Gemini API

The default model catalog below is what each provider type can serve. To enable a subset, set the models JSONB column on the model_provider row to a JSON array of model IDs.

Default model catalog

Provider type	Models
`openai`	`gpt-4o`, `gpt-4o-mini`, `gpt-4-turbo`, `gpt-4.1`, `gpt-4.1-mini`, `gpt-4.1-nano`, `o1`, `o1-mini`, `o3`, `o3-mini`, `o4-mini`
`anthropic`	`claude-haiku-4.5`, `claude-sonnet-4.5`, `claude-sonnet-4.6`, `claude-opus-4.6`
`gemini`	`gemini-2.0-flash`, `gemini-2.0-pro`, `gemini-2.5-flash`, `gemini-2.5-pro`
`openai-compatible`	Whatever you list in the `models` column — defined per row

Out of the box that’s 19 models across three first-party providers. Adding an openai-compatible row lets you plug in any other model your endpoint exposes — Llama, Mistral, DeepSeek, Qwen, local Ollama instances, OpenRouter, etc. — under whatever model IDs you choose.

Configuring a provider

Each row in model_provider represents one provider. Add them through the Settings → Model Providers UI or via the management API; the API key is encrypted at rest using the workspace’s encryption key.

Column	Purpose
`provider_type`	One of the four supported types above
`name`	Display name (e.g. “openai-prod”)
`api_key_encrypted`	Provider API key, encrypted via `internal/crypto`
`base_url`	Optional override; e.g. `https://openrouter.ai/api/v1` for OpenRouter
`models`	Optional JSON array of model IDs to expose; empty means “use default catalog”
`is_enabled`	Skip the row if false

After a write, call the management API’s invalidate endpoint or wait up to 60 seconds for the resolver cache to expire.

Two API surfaces

The gateway accepts requests in either OpenAI or Anthropic format, regardless of which provider ultimately handles them. Pick the surface that matches the SDK you already have.

curl http://localhost:3000/api/v1/llm/v1/chat/completions \
  -H "Authorization: Bearer $PROXIFAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4.6",
    "messages": [
      {"role": "user", "content": "Explain the CAP theorem in two sentences."}
    ],
    "stream": true
  }'

Any OpenAI SDK works by pointing base_url at http://<your-host>/api/v1/llm/v1.

curl http://localhost:3000/api/v1/llm/v1/messages \
  -H "x-api-key: $PROXIFAI_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4.6",
    "max_tokens": 256,
    "messages": [
      {"role": "user", "content": "Explain the CAP theorem in two sentences."}
    ],
    "stream": true
  }'

The Anthropic SDK works by pointing base_url at http://<your-host>/api/v1/llm.

Both surfaces accept any model from any configured provider — you can ask for gpt-4o through the Anthropic-format endpoint and the gateway will translate. Streaming uses Server-Sent Events on both surfaces.

Authentication

Key type	Header	Issued by	Use case
Static API key	`Authorization: Bearer …` or `x-api-key: …`	Workspace admin	Apps, scripts, manual testing
Workflow execution key (`pfai_<execID>_<sig>`)	Same	Workflow runtime	Auto-injected into agent containers as `PFAI_TOKEN`; HMAC-signed with `JWT_SECRET`

Workflow keys carry the executing workflow’s identity, so usage events are attributed to the right run, and credit checks (Enterprise) can deduct against the right org.

Resilience

The defaults wired up in internal/llmgateway/embedded/embedded.go:

Mechanism	Setting	Behavior
Circuit breaker	5 consecutive failures → open · 30 s reset · 1 half-open probe	Skips a provider that’s failing; auto-recovers when it succeeds again
Retry	2 max attempts · 500 ms → 3 s exponential	Retries the same provider before falling back to the next in the chain
Rate limit	120 req/min, burst 20, per API key	Returns `429 Too Many Requests` past the limit
Response cache	1000-entry LRU, 5 min TTL	Caches non-streaming completions keyed by the request body hash
Provider cache	60 s TTL, per org	Avoids reading `model_provider` on every request

If every provider for a model trips its circuit, the gateway returns 503 Service Unavailable with {"error":{"type":"gateway_error"}}. This is the only place a request fails outright — partial failures within the fallback chain are transparent to the caller.

Bring your own key (BYOK)

BYOK is two-layered:

Workspace-level — keys configured in model_provider apply to everyone in the org.
User-level — a row in user_provider_key for (user_id, org_id, provider_type) overrides the workspace key for that user, and is prepended to the routing chain so it’s tried first.

This lets individual contributors use their own enterprise/Anthropic/OpenAI accounts (volume discounts, separate quotas, evaluation credits) while the team’s shared key remains the fallback.

Set a user-level OpenAI-compatible key with a base_url of https://openrouter.ai/api/v1 and the gateway routes that user’s traffic through OpenRouter — useful for trying a model that’s not yet in the default catalog.

Usage tracking

Successful requests publish a usage.Event to the LLM_USAGE JetStream stream:

{
  "executionId": "exec_…",
  "provider": "anthropic_…",
  "model": "claude-sonnet-4.6",
  "promptTokens": 412,
  "completionTokens": 128,
  "totalTokens": 540,
  "streaming": true,
  "cacheHit": false,
  "estimatedCostUsd": 0,
  "timestamp": "2026-05-04T22:14:00Z"
}

A consumer in internal/llmusage writes these to the database. The management API exposes them at /api/v1/gateway/usage*:

Endpoint	Returns
`GET /api/v1/gateway/usage`	Totals per period
`GET /api/v1/gateway/usage/by-model`	Token + cost split by model
`GET /api/v1/gateway/usage/by-workflow`	Attribution to chat sessions, agent runs, workflow executions
`GET /api/v1/gateway/usage/timeline`	Time-series for charts
`GET /api/v1/gateway/usage/log`	Raw event log
`GET /api/v1/gateway/rate-limits` · `POST` · `DELETE /:id`	Manage per-team / per-user / per-project caps
`GET /api/v1/gateway/budgets`	Budget alerts and enforcement

The Settings → AI Gateway page in the web UI is built on these endpoints.

Pricing and cost reporting

In the OSS build, pricing.CalculateCost returns 0 because no cost calculator is registered (internal/llmgateway/pricing/pricing.go). Token counts are accurate; the estimatedCostUsd field will be zero. The Enterprise build registers a calculator with current per-model rates so the same field reflects real USD.

To add custom pricing in OSS, register a calculator at startup:

import "github.com/proxifai/proxifai-oss/internal/llmgateway/pricing"

pricing.CostCalculator = func(model string, prompt, completion int) float64 {
    // your rate table
}

Configuration reference

Env var	Default	Effect
`LLM_GATEWAY_HMAC_SECRET`	falls back to `JWT_SECRET`	Used to verify `pfai_…` workflow tokens
`JWT_SECRET`	auto-generated on first boot	Doubles as the gateway HMAC secret when the dedicated var is unset
`NATS_URL`	embedded	Where usage events land; embedded NATS works out of the box

Provider keys, model lists, and rate limits live in the database, not env vars — so configuration changes don’t require a restart.

Endpoint reference

Method · Path	Auth	Purpose
`GET /api/v1/llm/health`	none	Liveness probe
`GET /api/v1/llm/cache-stats`	none	Response cache hit/miss counters
`POST /api/v1/llm/v1/chat/completions`	required	OpenAI-format completions
`POST /api/v1/llm/v1/messages`	required	Anthropic-format messages