#engineering#architecture

How NeuronGate Routes Every Request

A technical walkthrough of the NeuronGate proxy architecture, from balance reservation to upstream dispatch to settlement.

NeuronGate teamApril 18, 20263 min readShare on X

How NeuronGate Routes Every Request

When you call POST /v1/chat/completions on NeuronGate, a small but specific sequence of operations happens before your prompt ever reaches a model. This post walks through that sequence, not to impress you, but because understanding the path helps you reason about latency, errors, and billing.

The Pipeline in One Diagram

Client → Nginx → FastAPI auth middleware
              → balance reservation
              → upstream dispatch (OpenRouter)
              → streaming proxy
              → settlement (actual cost deducted)

Each arrow is a distinct step with its own failure modes. Let's go through them.

Step 1: Nginx Terminates TLS and Rate-Limits

Every request hits Nginx first. Nginx terminates TLS, applies gzip, and enforces per-IP connection limits before the request even reaches Python. This is cheap protection, it filters obviously malformed traffic without spending FastAPI cycles on it.

Nginx then proxies to FastAPI on the internal Docker network. The client never talks directly to the application server.

Step 2: Auth Middleware Validates the API Key

FastAPI's first middleware layer pulls the Authorization: Bearer header and hashes the key with SHA-256. That hash is looked up in the api_keys table.

key_hash = hashlib.sha256(raw_key.encode()).hexdigest()
record = await db.execute(
    select(APIKey).where(
        APIKey.key_hash == key_hash,
        APIKey.is_active == True
    )
)

If the key doesn't exist or is revoked, the request stops here with a 401. No balance is touched.

If the key has a model allowlist configured, the requested model is validated against it now. Requesting a model not on your allowlist returns a 403. See the API key settings to configure per-key model access.

Step 3: Balance Reservation

Before we forward the request, we reserve an estimated cost. This prevents concurrent requests from overdrawing a balance.

UPDATE users
SET balance_reserved_usd = balance_reserved_usd + :estimate
WHERE id = :user_id
  AND (balance_usd - balance_reserved_usd) >= :estimate
RETURNING balance_usd, balance_reserved_usd;

The estimate is based on the model's max context window and our per-token pricing. It's intentionally conservative, we'd rather over-reserve than under-reserve.

If this UPDATE returns zero rows, the user has insufficient available balance and we return a 402 immediately. No upstream call is made; no credits are spent.

Step 4: Upstream Dispatch to OpenRouter

With balance reserved, we forward the request to OpenRouter's /v1/chat/completions endpoint. We pass through the user's messages, model selection, and parameters. We do not pass through the user's NeuronGate key, we use our own OpenRouter key at the infra level.

Outbound headers are scrubbed: no Cookie, no X-Forwarded-For, no Referer. We add our own X-Request-Id for correlation.

headers = {
    "Authorization": f"Bearer {settings.OPENROUTER_API_KEY}",
    "X-Request-Id": request_id,
    "Content-Type": "application/json",
}

For non-streaming requests, we wait for the full response. For streaming requests (when the caller sets "stream": true), we proxy the SSE chunks back to the client as they arrive.

Step 5: Settlement

Once OpenRouter responds (or when the stream completes), we receive the actual token usage in the response body. OpenRouter surfaces this via usage.prompt_tokens and usage.completion_tokens.

We then:

  1. Calculate the exact cost at our published per-token rate for that model
  2. Deduct the exact cost from balance_usd
  3. Release the reservation from balance_reserved_usd
  4. Write a row to usage_logs with model, tokens, cost, latency, and request_id
actual_cost = (prompt_tokens * prompt_rate) + (completion_tokens * completion_rate)
await db.execute(
    update(User)
    .where(User.id == user_id)
    .values(
        balance_usd=User.balance_usd - actual_cost,
        balance_reserved_usd=User.balance_reserved_usd - estimate,
    )
)

The net effect: your balance decreases by the actual cost, not the estimate. If the estimate was higher, the difference is released back to your available balance.

What Happens on Errors

OpenRouter returns a 5xx: We release the reservation, log the error, and return a 502 to the client. Your balance is unchanged.

Stream aborts mid-response: We settle for whatever tokens were consumed up to that point. If we can't determine usage (no usage metadata in the partial stream), we settle for the full estimate. This is a known limitation we're working to improve in a future release.

NeuronGate process crashes: The reservation stays locked for up to 24 hours, then expires. Your effective available balance is temporarily reduced, but no money is lost. The usage_logs row may be missing for that request.

What This Means for You

The reserve → settle pattern means:

  • Concurrent requests are safe. Each request reserves before dispatching, so two requests can't both see the same balance and both proceed.
  • Exact cost is what you pay. The estimate is internal accounting. Your invoice shows actual token counts.
  • Failures don't cost you money. If OpenRouter is down, your reservation is released. You only pay for successful completions.

You can inspect every settled request in the console under usage history. Each row has the model, token counts, cost, and timestamp.


For a deeper look at which models are available and their per-token rates, see the model catalogue. If you want to start routing requests, the docs cover authentication setup end-to-end.