Skip to content
✨ agentgateway has joined the Agentic AI Foundation (AAIF) — Learn more

For the complete documentation index, see llms.txt. Markdown versions of all docs pages are available by appending .md to any docs URL.

Page as Markdown

Virtual key management

Verified Code examples on this page have been automatically tested and verified.

Issue API keys with per-key token budgets and cost tracking (also known as virtual keys).

Issue API keys to users or applications and control token usage (also known as virtual keys).

About

Virtual key management allows you to issue API keys to users or applications, each with independent tracking and cost controls. Agentgateway achieves this by composing existing capabilities:

  • API key authentication: Identify incoming requests by API key
  • Token-based rate limiting: Enforce token budgets
  • Observability metrics: Track per-key spending and usage

How virtual keys work

    flowchart TD
  A[Request arrives with API key] --> B[Validate API key]
  B --> C{Key valid?}
  C -->|Yes| D[Check token budget]
  D --> E{Budget available?}
  E -->|Yes| F[Forward to LLM]
  F --> G[Track token usage]
  G --> H[Deduct from budget]
  E -->|No| I[Reject with 429]
  C -->|No| J[Reject with 401]
  subgraph refill["Budget refills periodically"]
    H
  end
  

Before you begin

Install the agentgateway binary.

Set up virtual keys

Step 1: Configure API key authentication

Create a configuration with API key authentication. This example creates two virtual keys for Alice and Bob.

cat <<'EOF' > config.yaml
# yaml-language-server: $schema=https://agentgateway.dev/schema/config

llm:
  policies:
    apiKey:
      mode: strict
      keys:
      - key: sk-alice-abc123def456
        metadata:
          user: alice
      - key: sk-bob-xyz789uvw012
        metadata:
          user: bob
  models:
  - name: "*"
    provider: openAI
    params:
      apiKey: "$OPENAI_API_KEY"
EOF
SettingDescription
apiKey.modeSet to strict to require a valid API key for all requests. Use optional to allow unauthenticated requests.
apiKey.keysList of API keys. Each key has a key value and optional metadata.
keyThe API key value that users include in the Authorization: Bearer <key> header.
metadataOptional metadata associated with the key, such as a user identifier or tier.

Step 2: Start agentgateway

agentgateway -f config.yaml

Step 3: Test the virtual keys

  1. Send a request with Alice’s API key. Verify that the request succeeds.

    curl -s http://localhost:4000/v1/chat/completions \
      -H "Authorization: Bearer sk-alice-abc123def456" \
      -H "Content-Type: application/json" \
      -d '{
        "model": "gpt-3.5-turbo",
        "messages": [{"role": "user", "content": "Hello!"}]
      }' | jq .

    Example successful response:

    {
      "choices": [{
        "message": {
          "role": "assistant",
          "content": "Hello! How can I help you today?"
        }
      }],
      "usage": {
        "prompt_tokens": 10,
        "completion_tokens": 9,
        "total_tokens": 19
      }
    }
  2. Send a request without a valid API key. Verify that the request is rejected with a 401 status.

    curl -s -o /dev/null -w "%{http_code}" http://localhost:4000/v1/chat/completions \
      -H "Authorization: Bearer invalid-key" \
      -H "Content-Type: application/json" \
      -d '{
        "model": "gpt-3.5-turbo",
        "messages": [{"role": "user", "content": "Hello!"}]
      }'

    Expected response:

    HTTP/1.1 401 Unauthorized

Configure token budgets

LLMs typically charge per input and output token. Without spending control, users can quickly generate large bills by submitting long prompts, streaming or retrying requests, or running recursive agent loops. To protect against unexpected bills, scaling surprises, and abuse, use token-based rate limits to cap the number of tokens that can be used.

How rate limiting works

Agentgateway checks token-based rate limits in two phases:

At request time:

  • When tokenize: true is not set or is set to false on the AI backend, the number of tokens that are used for the request cannot be calculated. Because of this, the request is always allowed, unless the rate limit is set to 0 tokens. The LLM typically returns the number of tokens that were used for the request when sending the response. Agentgateway verifies the number of tokens that were used in the request and the response to determine whether the rate limit was reached. By default, tokenize is set to false.
  • When tokenize: true is set, agentgateway estimates the number of tokens at request time. Because of that, the request is only allowed if the estimated number of tokens does not exceed the set rate limit.

At response time:

When the LLM returns a response, it typically provides the number of tokens that were used during the request and response. Agentgateway uses these numbers to determine if the rate limit was reached.

Note that this determination happens after the response is returned. Even, if the number of tokens that are used in the response exceeds the number of allowed tokens, the response is still returned to the user. Only subsequent requests are rate limited. If tokenize: true is set on the AI backend and tokens were estimated during the request, agentgateway verifies the actual number of tokens that were used for the request when the LLM returns its response. In the case the initial estimation was off, agentgateway adjusts the number of used tokens to count these against the set rate limit.

Step 1: Add a token budget

Update your configuration to include a localRateLimit policy. The following example builds on the virtual keys configuration from the previous section and adds a token budget.

cat <<'EOF' > config.yaml
# yaml-language-server: $schema=https://agentgateway.dev/schema/config

llm:
  policies:
    apiKey:
      mode: strict
      keys:
      - key: sk-alice-abc123def456
        metadata:
          user: alice
      - key: sk-bob-xyz789uvw012
        metadata:
          user: bob
    localRateLimit:
    - maxTokens: 10
      tokensPerFill: 1
      fillInterval: 60s
      type: tokens
  models:
  - name: "*"
    provider: openAI
    params:
      apiKey: "$OPENAI_API_KEY"
EOF
SettingDescription
localRateLimitApplies a token-based rate limit to all incoming LLM requests.
maxTokensThe maximum number of tokens that are available to use.
tokensPerFillThe number of tokens that are added during a refill.
fillIntervalThe number of seconds after which the token bucket is refilled.
typeThe type of rate limiting to apply. Use tokens for token-based rate limiting, or requests for request-based rate limiting.

Step 2: Verify rate limits

  1. Start agentgateway with the updated configuration.

    agentgateway -f config.yaml
  2. Send a prompt to the LLM. At the time the prompt is sent, the number of tokens required for the completion is unknown. Because tokenize: true is not set on the model, the prompt count is not estimated. As a result, the prompt is allowed.

    The LLM typically returns the number of tokens required for completion in its response. Agentgateway uses this number and counts it against the rate limit.
    curl http://localhost:4000/v1/chat/completions \
      -H 'Content-Type: application/json' \
      -d '{
        "model": "gpt-3.5-turbo",
        "messages": [
          {
            "role": "user",
            "content": "Tell me a short story"
          }
        ]
      }'

    Example output:

    {
      "choices": [
        {
          "message": {
            "content": "Once upon a time, in a small village nestled between towering mountains...",
            "role": "assistant"
          },
          "finish_reason": "stop"
        }
      ],
      "usage": {
        "prompt_tokens": 12,
        "completion_tokens": 248,
        "total_tokens": 260
      }
    }
  3. Repeat the same request. This time, the request is rate limited because the tokens used in the first request exceeded the budget.

    curl http://localhost:4000/v1/chat/completions \
      -H 'Content-Type: application/json' \
      -d '{
        "model": "gpt-3.5-turbo",
        "messages": [
          {
            "role": "user",
            "content": "Tell me a short story"
          }
        ]
      }'

    Example output:

    rate limit exceeded

Step 3: Enable request-time token estimation

By default, agentgateway does not estimate token counts at request time. To reject requests before they reach the LLM, set tokenize: true on your model.

cat <<'EOF' > config.yaml
# yaml-language-server: $schema=https://agentgateway.dev/schema/config

llm:
  policies:
    apiKey:
      mode: strict
      keys:
      - key: sk-alice-abc123def456
        metadata:
          user: alice
      - key: sk-bob-xyz789uvw012
        metadata:
          user: bob
    localRateLimit:
    - maxTokens: 10
      tokensPerFill: 1
      fillInterval: 60s
      type: tokens
  models:
  - name: "*"
    provider: openAI
    params:
      apiKey: "$OPENAI_API_KEY"
      tokenize: true
EOF

With this setting, requests are denied immediately if the estimated prompt token count exceeds the available budget.

Add a global token budget

localRateLimit is a gateway-wide limit, not a per-key limit. It enforces a single shared token budget across all requests and API keys.

To add a token budget that limits total token usage across all requests using more advanced routing options, use the routing-based configuration format with localRateLimit.

Rate limiting requires the binds/listeners/routes configuration format because localRateLimit is an HTTP-level policy. For more information, see the Routing-based configuration guide.
cat <<'EOF' > config.yaml
# yaml-language-server: $schema=https://agentgateway.dev/schema/config

binds:
- port: 4000
  listeners:
  - routes:
    - backends:
      - ai:
          name: openai
          provider:
            openAI:
              model: gpt-3.5-turbo
      policies:
        apiKey:
          mode: strict
          keys:
          - key: sk-alice-abc123def456
            metadata:
              user: alice
          - key: sk-bob-xyz789uvw012
            metadata:
              user: bob
        backendAuth:
          key: "$OPENAI_API_KEY"
        localRateLimit:
        - maxTokens: 100000
          tokensPerFill: 100000
          fillInterval: 86400s
          type: tokens
EOF
SettingDescription
backendAuthThe API key used to authenticate with the LLM provider backend. For configuration options, see Manage API keys.
localRateLimitToken-based rate limiting applied globally to all requests through this route, regardless of which API key is used.
maxTokensThe maximum number of tokens available in the shared budget.
tokensPerFillThe number of tokens added during each refill.
fillIntervalThe interval between refills. Use 86400s for a daily budget.
typeSet to tokens for token-based limits. Use requests for request-based limits.

For more information about rate limiting configuration options, see Rate limits.

Monitor per-key spending

Track token usage and spending for each virtual key using Prometheus metrics exposed by agentgateway.

  1. Access the agentgateway metrics endpoint.

    curl http://localhost:15000/metrics
  2. Query token usage metrics.

    # Total tokens consumed over the last 24 hours
    sum(
      increase(agentgateway_gen_ai_client_token_usage_sum{gen_ai_token_type="input"}[24h]) +
      increase(agentgateway_gen_ai_client_token_usage_sum{gen_ai_token_type="output"}[24h])
    )
  3. Calculate costs by multiplying token counts by your provider’s pricing. For example, with OpenAI GPT-3.5:

    # Estimated cost (assuming $0.50 per 1M input tokens, $1.50 per 1M output tokens)
    sum(
      ((rate(agentgateway_gen_ai_client_token_usage_sum{gen_ai_token_type="input"}[24h]) / 1000000) * 0.50) +
      ((rate(agentgateway_gen_ai_client_token_usage_sum{gen_ai_token_type="output"}[24h]) / 1000000) * 1.50)
    )

What’s next

Was this page helpful?
Agentgateway assistant

Ask me anything about agentgateway configuration, features, or usage.

Note: AI-generated content might contain errors; please verify and test all returned information.

Tip: one topic per conversation gives the best results. Use the + button in the chat header to start a new conversation.

Switching topics? Starting a new conversation improves accuracy.
↑↓ navigate select esc dismiss

What could be improved?

Your feedback helps us improve assistant answers and identify docs gaps we should fix.

Need more help? Join us on Discord: https://discord.gg/y9efgEmppm

Want to use your own agent? Add the Solo MCP server to query our docs directly. Get started here: https://search.solo.io/.