Core Concepts

Rate Limits

Alloovium runs a per-credential token bucket. Each capability has a cost in tokens; each tier has a fill rate and a maximum burst capacity. Stay within your bucket and you'll never see a 429.

How the bucket works

Every API key and OAuth credential has its own token bucket. Tokens refill continuously at the tier's per-minute rate, up to the tier's burst capacity. When a request arrives, the capability's cost is deducted from the bucket. If the bucket does not have enough tokens, the request is rejected with code and must be retried later.

Token-bucket, not fixed window

A fixed-window limit of 60/min would allow 120 calls at the edge of a window. A token bucket with rate 60 and burst 120 smooths this out: you can burst to 120 instantly, but sustained traffic averages to 60/minute.

Tiers

Tier	Tokens / minute	Burst capacity	Typical use
free	10	20	Hobby projects and prototyping
standard	60	120	Most SaaS integrations
pro	300	600	High-volume automations and ETL
enterprise	1500	3000	Platform integrations, custom SLAs

The tier is set on each API key at creation time and can be upgraded from the dashboard. OAuth credentials inherit the tier of the user's tenant plan.

Capability costs

Cheap reads cost 1 token. Searches and structured queries cost 5. Chat and expensive LLM calls cost 10. Workflow runs and template fills cost 25. Use this table to budget:

Capability	Cost
meta.whoami	1
vault.list_projects	1
vault.get_project	1
vault.create_project	2
vault.list_documents	1
vault.get_document	1
vault.upload_document	10
vault.search	5
chat.list_conversations	1
chat.get_conversation	1
chat.ask	10
chat.ask_stream	10
templates.start_fill	25
templates.get_fill_status	1
workflows.list	1
workflows.run	25
workflows.get_run_status	1

Response headers

Every response — success or rate-limited — carries these headers:

Header	Meaning
X-RateLimit-Limit	Burst capacity of your bucket
X-RateLimit-Remaining	Tokens left after this call was accounted for
X-RateLimit-Reset	Unix timestamp when the bucket is expected to be full
X-RateLimit-Cost	Tokens this particular call deducted
Retry-After	Seconds until you can retry (only on 429)

Watch header proactively and back off before you hit zero — you will get cleaner behavior than reacting to 429s.

Hitting the limit

When you run out of tokens the API returns code with this envelope:

json
{
  "type": "https://api.alloovium.com/errors/rate_limited",
  "title": "Rate limit exceeded",
  "status": 429,
  "detail": "Rate limit exceeded. Retry in 12 seconds.",
  "code": "rate_limited",
  "retry_after_seconds": 12
}

The header header carries the same value. Respect it. Do not retry immediately in a tight loop — it will just extend the backoff and waste your quota.

Retry strategy

The recommended strategy for any client:

On 429, sleep for header seconds and retry.
On 5xx, retry with exponential backoff — start at 1s, double up to 30s, add 10% jitter. Cap at five attempts.
On every retry, keep the same header so the API can replay the original response instead of executing twice. See Idempotency.
Give up after repeated 4xx non-429 errors — those are not transient.

Example retry loop (pseudocode)

python
import time, httpx

def call_with_retry(client, method, url, *, body=None, idem_key=None, max_retries=5):
    headers = {"Authorization": f"Bearer {API_KEY}"}
    if idem_key:
        headers["Idempotency-Key"] = idem_key

    for attempt in range(max_retries):
        resp = client.request(method, url, json=body, headers=headers)
        if resp.status_code < 400:
            return resp.json()
        if resp.status_code == 429:
            time.sleep(int(resp.headers.get("Retry-After", "1")))
            continue
        if 500 <= resp.status_code < 600:
            time.sleep(min(30, 2 ** attempt) + (0.1 * attempt))
            continue
        # 4xx other than 429 — do not retry
        resp.raise_for_status()
    raise RuntimeError("max retries exhausted")

Quota engineering tips

Batch with search (cost 5) instead of multiple get calls (cost 1 each) when you need content.
For workflows, poll status every 2–5 seconds — do not tight-loop.
If you're fan-out ingesting, run upload calls concurrently up to formula, then rest until the bucket refills.
Use whoami (cost 1) as a health probe — not the more expensive endpoints.