Rate limits
The core API is unmetered; the LLM gateway runs a 120-req/min token bucket per API key. Standard X-RateLimit-* headers are not emitted.
Most ProxifAI endpoints are not rate-limited in OSS — Postgres is the only backpressure. The exception is the LLM gateway, which runs a token bucket per API key to prevent runaway agent loops from saturating provider quotas. There’s no X-RateLimit-* header surface; the gateway returns 429 with a body when the bucket is empty and reads as 200 otherwise.
Where rate limits exist
| Surface | Limit | Configurable | Enforced by |
|---|---|---|---|
| Core REST API | None | — | — |
| LLM gateway | 120 req/min, burst 20, per API key | Yes — see below | internal/llmgateway/middleware/ratelimit.go |
| Inbound webhooks | None at platform level | — | Provider-specific (e.g. Slack signing rate) |
| MCP server | None | — | — |
| Git protocol | None | — | Postgres-bound |
If you’re rate-limited on a POST /api/v1/issues storm, the bottleneck is your DB, not a middleware. Right answer is to scale Postgres or batch differently — there’s no platform knob to turn.
LLM gateway specifics
Defaults configured in internal/llmgateway/embedded/embedded.go:
ratelimit.New(120, 20) // 120 requests/minute, burst 20
The bucket is keyed on the API key (or pfai_<execID> for workflow tokens) — every distinct key gets its own bucket. When a token is rejected, the gateway returns:
HTTP/1.1 429 Too Many Requests
Content-Type: application/json
{"error":{"message":"rate limit exceeded","type":"rate_limit_error"}}
There’s no Retry-After header by default. Back off ~500 ms and retry; the bucket refills at 2 tokens/sec.
Configurable per-key limits
Beyond the gateway-wide default, you can set finer-grained limits via the gateway management API:
pfai gateway rate-limits set \
--scope user --user [email protected] \
--requests-per-min 60 --burst 10
| Scope | Granularity |
|---|---|
org | Whole org’s traffic share (additive on top of gateway-wide) |
team | Specific team |
project | Specific project |
user | Specific user — useful for individual agents/contributors |
key | Specific PAT |
Limits compose multiplicatively — every request must clear every applicable bucket. Tightest scope wins.
REST surface (Loop 8: AI Gateway docs):
| Method · Path | Purpose |
|---|---|
GET /api/v1/gateway/rate-limits | List all configured limits |
POST /api/v1/gateway/rate-limits | Create — {scope, scopeId, requestsPerMin, burst} |
DELETE /api/v1/gateway/rate-limits/{id} | Remove |
Headers
The gateway does not emit standard X-RateLimit-Limit / X-RateLimit-Remaining / X-RateLimit-Reset headers in OSS. Track quota client-side (you know your key’s limit) or query GET /api/v1/gateway/usage/timeline for retrospective analysis.
CORS allows X-Request-ID and Idempotency-Key for instances API, but neither relates to rate-limit signaling.
Best practices
- Cache reads. Issues, projects, workflows rarely change; cache the response with the
updatedAttimestamp and refetch only on staleness. - Use streaming for chat. SSE streams from
/api/v1/llm/v1/chat/completionscount as one request — much better than chunked polling. - Batch where supported.
DELETE /api/v1/issues(no ID) accepts a body of IDs to delete in bulk. Saves N round-trips. - Use
pfai waitfor async. Polling a workflow execution every 100 ms wastes both client and server cycles.pfai waitand the SSE event stream do the right thing without a timer. - Respect circuit breakers. When the gateway returns
503because every provider for a model has tripped, don’t retry the same model — try a different one or back off.