Virtual key management

Verified

Issue API keys with per-key token budgets and cost tracking (also known as virtual keys).

Issue API keys to users or applications and control token usage (also known as virtual keys).

About

Virtual key management allows you to issue API keys to users or applications, each with independent tracking and cost controls. Agentgateway achieves this by composing existing capabilities:

API key authentication: Identify incoming requests by API key
Token-based rate limiting: Enforce token budgets
Observability metrics: Track per-key spending and usage

How virtual keys work

    flowchart TD
  A[Request arrives with API key] --> B[Validate API key]
  B --> C{Key valid?}
  C -->|Yes| D[Check token budget]
  D --> E{Budget available?}
  E -->|Yes| F[Forward to LLM]
  F --> G[Track token usage]
  G --> H[Deduct from budget]
  E -->|No| I[Reject with 429]
  C -->|No| J[Reject with 401]
  subgraph refill["Budget refills periodically"]
    H
  end

Before you begin

Install the agentgateway binary.

Set up virtual keys

Step 1: Configure API key authentication

Create a configuration with API key authentication. This example creates two virtual keys for Alice and Bob.

cat <<'EOF' > config.yaml
# yaml-language-server: $schema=https://agentgateway.dev/schema/config

llm:
  policies:
    apiKey:
      mode: strict
      keys:
      - key: sk-alice-abc123def456
        metadata:
          user: alice
      - key: sk-bob-xyz789uvw012
        metadata:
          user: bob
  models:
  - name: "*"
    provider: openAI
    params:
      apiKey: "$OPENAI_API_KEY"
EOF

Setting	Description
`apiKey.mode`	Set to `strict` to require a valid API key for all requests. Use `optional` to allow unauthenticated requests.
`apiKey.keys`	List of API keys. Each key has a `key` value and optional `metadata`.
`key`	The API key value that users include in the `Authorization: Bearer <key>` header.
`metadata`	Optional metadata associated with the key, such as a user identifier or tier.

Step 2: Start agentgateway

agentgateway -f config.yaml

Step 3: Test the virtual keys

Send a request with Alice’s API key. Verify that the request succeeds.

curl -s http://localhost:4000/v1/chat/completions \
  -H "Authorization: Bearer sk-alice-abc123def456" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-3.5-turbo",
    "messages": [{"role": "user", "content": "Hello!"}]
  }' | jq .

Example successful response:

{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "Hello! How can I help you today?"
    }
  }],
  "usage": {
    "prompt_tokens": 10,
    "completion_tokens": 9,
    "total_tokens": 19
  }
}

Send a request without a valid API key. Verify that the request is rejected with a 401 status.

curl -s -o /dev/null -w "%{http_code}" http://localhost:4000/v1/chat/completions \
  -H "Authorization: Bearer invalid-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-3.5-turbo",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Expected response:

HTTP/1.1 401 Unauthorized

Configure token budgets

LLMs typically charge per input and output token. Without spending control, users can quickly generate large bills by submitting long prompts, streaming or retrying requests, or running recursive agent loops. To protect against unexpected bills, scaling surprises, and abuse, use token-based rate limits to cap the number of tokens that can be used.

How rate limiting works

Agentgateway checks token-based rate limits in two phases:

At request time:

When tokenize: true is not set or is set to false on the AI backend, the number of tokens that are used for the request cannot be calculated. Because of this, the request is always allowed, unless the rate limit is set to 0 tokens. The LLM typically returns the number of tokens that were used for the request when sending the response. Agentgateway verifies the number of tokens that were used in the request and the response to determine whether the rate limit was reached. By default, tokenize is set to false.
When tokenize: true is set, agentgateway estimates the number of tokens at request time. Because of that, the request is only allowed if the estimated number of tokens does not exceed the set rate limit.

At response time:

When the LLM returns a response, it typically provides the number of tokens that were used during the request and response. Agentgateway uses these numbers to determine if the rate limit was reached.

Note that this determination happens after the response is returned. Even, if the number of tokens that are used in the response exceeds the number of allowed tokens, the response is still returned to the user. Only subsequent requests are rate limited. If tokenize: true is set on the AI backend and tokens were estimated during the request, agentgateway verifies the actual number of tokens that were used for the request when the LLM returns its response. In the case the initial estimation was off, agentgateway adjusts the number of used tokens to count these against the set rate limit.

Step 1: Add a token budget

Update your configuration to include a localRateLimit policy. The following example builds on the virtual keys configuration from the previous section and adds a token budget.

cat <<'EOF' > config.yaml
# yaml-language-server: $schema=https://agentgateway.dev/schema/config

llm:
  policies:
    apiKey:
      mode: strict
      keys:
      - key: sk-alice-abc123def456
        metadata:
          user: alice
      - key: sk-bob-xyz789uvw012
        metadata:
          user: bob
    localRateLimit:
    - maxTokens: 10
      tokensPerFill: 1
      fillInterval: 60s
      type: tokens
  models:
  - name: "*"
    provider: openAI
    params:
      apiKey: "$OPENAI_API_KEY"
EOF

Setting	Description
`localRateLimit`	Applies a token-based rate limit to all incoming LLM requests.
`maxTokens`	The maximum number of tokens that are available to use.
`tokensPerFill`	The number of tokens that are added during a refill.
`fillInterval`	The number of seconds after which the token bucket is refilled.
`type`	The type of rate limiting to apply. Use `tokens` for token-based rate limiting, or `requests` for request-based rate limiting.

Step 2: Verify rate limits

Start agentgateway with the updated configuration.
```
agentgateway -f config.yaml
```

Send a prompt to the LLM. At the time the prompt is sent, the number of tokens required for the completion is unknown. Because tokenize: true is not set on the model, the prompt count is not estimated. As a result, the prompt is allowed.

The LLM typically returns the number of tokens required for completion in its response. Agentgateway uses this number and counts it against the rate limit.

curl http://localhost:4000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "gpt-3.5-turbo",
    "messages": [
      {
        "role": "user",
        "content": "Tell me a short story"
      }
    ]
  }'

Example output:

{
  "choices": [
    {
      "message": {
        "content": "Once upon a time, in a small village nestled between towering mountains...",
        "role": "assistant"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 12,
    "completion_tokens": 248,
    "total_tokens": 260
  }
}

Repeat the same request. This time, the request is rate limited because the tokens used in the first request exceeded the budget.

curl http://localhost:4000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "gpt-3.5-turbo",
    "messages": [
      {
        "role": "user",
        "content": "Tell me a short story"
      }
    ]
  }'

Example output:

rate limit exceeded

Step 3: Enable request-time token estimation

By default, agentgateway does not estimate token counts at request time. To reject requests before they reach the LLM, set tokenize: true on your model.

cat <<'EOF' > config.yaml
# yaml-language-server: $schema=https://agentgateway.dev/schema/config

llm:
  policies:
    apiKey:
      mode: strict
      keys:
      - key: sk-alice-abc123def456
        metadata:
          user: alice
      - key: sk-bob-xyz789uvw012
        metadata:
          user: bob
    localRateLimit:
    - maxTokens: 10
      tokensPerFill: 1
      fillInterval: 60s
      type: tokens
  models:
  - name: "*"
    provider: openAI
    params:
      apiKey: "$OPENAI_API_KEY"
      tokenize: true
EOF

With this setting, requests are denied immediately if the estimated prompt token count exceeds the available budget.

Add a global token budget

localRateLimit is a gateway-wide limit, not a per-key limit. It enforces a single shared token budget across all requests and API keys.

To add a token budget that limits total token usage across all requests using more advanced routing options, use the routing-based configuration format with localRateLimit.

Rate limiting requires the binds/listeners/routes configuration format because localRateLimit is an HTTP-level policy. For more information, see the Routing-based configuration guide.

cat <<'EOF' > config.yaml
# yaml-language-server: $schema=https://agentgateway.dev/schema/config

binds:
- port: 4000
  listeners:
  - routes:
    - backends:
      - ai:
          name: openai
          provider:
            openAI:
              model: gpt-3.5-turbo
      policies:
        apiKey:
          mode: strict
          keys:
          - key: sk-alice-abc123def456
            metadata:
              user: alice
          - key: sk-bob-xyz789uvw012
            metadata:
              user: bob
        backendAuth:
          key: "$OPENAI_API_KEY"
        localRateLimit:
        - maxTokens: 100000
          tokensPerFill: 100000
          fillInterval: 86400s
          type: tokens
EOF

Setting	Description
`backendAuth`	The API key used to authenticate with the LLM provider backend. For configuration options, see Manage API keys.
`localRateLimit`	Token-based rate limiting applied globally to all requests through this route, regardless of which API key is used.
`maxTokens`	The maximum number of tokens available in the shared budget.
`tokensPerFill`	The number of tokens added during each refill.
`fillInterval`	The interval between refills. Use `86400s` for a daily budget.
`type`	Set to `tokens` for token-based limits. Use `requests` for request-based limits.

For more information about rate limiting configuration options, see Rate limits.

Monitor per-key spending

Track token usage and spending for each virtual key using Prometheus metrics exposed by agentgateway.

Access the agentgateway metrics endpoint.
```
curl http://localhost:15000/metrics
```

Query token usage metrics.

# Total tokens consumed over the last 24 hours
sum(
  increase(agentgateway_gen_ai_client_token_usage_sum{gen_ai_token_type="input"}[24h]) +
  increase(agentgateway_gen_ai_client_token_usage_sum{gen_ai_token_type="output"}[24h])
)

Calculate costs by multiplying token counts by your provider’s pricing. For example, with OpenAI GPT-3.5:

# Estimated cost (assuming $0.50 per 1M input tokens, $1.50 per 1M output tokens)
sum(
  ((rate(agentgateway_gen_ai_client_token_usage_sum{gen_ai_token_type="input"}[24h]) / 1000000) * 0.50) +
  ((rate(agentgateway_gen_ai_client_token_usage_sum{gen_ai_token_type="output"}[24h]) / 1000000) * 1.50)
)

What’s next

Manage API keys for detailed authentication configuration
Rate limits for advanced rate limiting configuration
Set up observability to view token usage metrics and logs

Providers Manage API keys

Was this page helpful?