Contents Menu Expand Light mode Dark mode Auto light/dark, in light mode Auto light/dark, in dark mode Skip to content
veeksha
veeksha

Getting Started

  • Getting Started
    • Why Veeksha?
    • Installation
    • Quick start

User Guide

  • User Guide
    • Configuration System
    • Workload recipes
    • Output Files
    • Capacity Search
    • Configuration Sweeps
    • Server Management
    • Weights & Biases Integration
    • Microbenchmarks

Design

  • Design & Architecture
    • System Architecture
    • Sessions and Graphs
    • Content Generation
    • Traffic Scheduling
    • Evaluation and Metrics

Reference

  • Reference
    • Programmatic usage
    • Configuration Reference
      • API Reference
        • Base Session Generator
        • Lmeval Session Generator
        • Synthetic Session Generator
        • Trace Session Generator
        • Base Traffic
        • Concurrent Traffic
        • Rate Traffic
        • Sequential Launch Traffic
        • Base Client
        • OpenAI Chat Completions Client
        • OpenAI Completions Client
        • OpenAI Router Client
        • Base Server
        • SGLang Server
        • Vajra Server
        • vLLM Server
        • Base Evaluator
        • LMEval Accuracy Evaluator
        • Performance Evaluator
        • Audio Channel Performance
        • Image Channel Performance
        • Text Channel Performance
        • Video Channel Performance
        • Base Length Generator
        • Fixed Length Generator
        • Inverse Gaussian Length Generator
        • Stair Length Generator
        • Uniform Length Generator
        • Zipf Length Generator
        • Base Interval Generator
        • Fixed Interval Generator
        • Gamma Interval Generator
        • Poisson Interval Generator
        • Audio Channel Generator
        • Base Channel Generator
        • Image Channel Generator
        • Text Channel Generator
        • Video Channel Generator
        • Base Session Graph Generator
        • Branching Session Graph Generator
        • Linear Session Graph Generator
        • Single Request Session Graph Generator
        • Base Trace Flavor
        • RAG Trace Flavor
        • Request Log Trace Flavor
        • Shared Prefix Trace Flavor
        • Timed Synthetic Session Trace Flavor
        • Untimed Content Multi Turn Trace Flavor
        • Base Slo
        • Constant Slo
        • Audio Output Spec
        • Decode Window
        • Image Output Spec
        • Output Spec
        • Runtime
        • Text Output Spec
        • Trace Recorder
        • Video Output Spec
        • Wandb
Back to top
View this page

Capacity Search¶

Capacity search automatically finds the maximum sustainable session rate or concurrency that meets your latency service level objectives (SLOs). This is essential for capacity planning and performance regression testing.

How it works¶

Veeksha uses an adaptive two-phase algorithm:

Phase 1: Exponential Probing

Start at a low value and exponentially increase until SLOs are violated. This quickly finds the approximate capacity ceiling.

Phase 2: Binary Search

Perform binary search between the last passing and first failing values to converge on the precise capacity.

Example: Finding max rate

Phase 1 (Probe):
  Rate 5.0 → PASS
  Rate 10.0 → PASS
  Rate 20.0 → PASS
  Rate 40.0 → FAIL  ← ceiling found

Phase 2 (Binary):
  Rate 30.0 → PASS
  Rate 35.0 → PASS
  Rate 37.5 → FAIL
  Rate 36.25 → PASS
  → Converged at 36.25

Running capacity search¶

Create a capacity search configuration:

# capacity_search.veeksha.yml
output_dir: capacity_search_output

# Search parameters
start_value: 5.0        # Initial probe value
max_value: 100.0        # Maximum to search
expansion_factor: 2.0   # Multiply by this during probing
max_iterations: 20      # Maximum iterations
precision: 2            # Decimal places for rate

# Benchmark configuration (used for each iteration)
benchmark_config:
  seed: 42

  traffic_scheduler:
    type: rate
    interval_generator:
      type: gamma  # rate is set by capacity search
    cancel_session_on_failure: false

  session_generator:
    type: synthetic
    session_graph:
      type: linear
      inherit_history: true
    channels:
      - type: text
        body_length_generator:
          type: uniform
          min: 100
          max: 500

  client:
    type: openai_chat_completions
    api_base: http://localhost:8000/v1
    model: meta-llama/Llama-3-8B-Instruct

  runtime:
    max_sessions: -1
    benchmark_timeout: 60

  evaluators:
    - type: performance
      target_channels: ["text"]
      slos:
        - name: "P99 TTFC"
          metric: ttfc
          percentile: 0.99
          value: 0.5
          type: constant
        - name: "P90 TBC"
          metric: tbc
          percentile: 0.90
          value: 0.05
          type: constant

Run the search:

uvx veeksha capacity-search --config capacity_search.veeksha.yml

Rate-based vs concurrency-based searches¶

Rate-Based (finding max sessions/second)

Use when you want to find the maximum arrival rate:

benchmark_config:
  traffic_scheduler:
    type: rate
    interval_generator:
      type: gamma  # or poisson

Capacity search sets the arrival_rate parameter.

Concurrency-Based (finding max concurrent sessions)

Use when you want to find maximum sustainable concurrency:

benchmark_config:
  traffic_scheduler:
    type: concurrent
    # target_concurrent_sessions and rampup_seconds are set by search

Capacity search sets target_concurrent_sessions.

Configuration reference¶

output_dir: capacity_search_output    # Base output directory

start_value: 5.0       # Initial probe value
max_value: 100.0       # Maximum value to search
expansion_factor: 2.0  # Probe multiplier (default: 2.0)
max_iterations: 20     # Max total iterations
precision: 2           # Decimal precision for rate searches

benchmark_config:
  # Full benchmark configuration
  # See /config_reference/benchmark for all options
start_value

Initial value for probing. Choose a value likely to pass.

max_value

Upper bound for the search. Probing won’t exceed this.

expansion_factor

How aggressively to probe (2.0 = double each time).

precision

For rate-based searches, how many decimal places to use. Set to 0 for integer concurrency searches.

Defining SLOs¶

SLOs determine pass/fail for each iteration. Define them in the evaluator:

evaluators:
  - type: performance
    slos:
      - name: "P99 TTFC under 500ms"
        metric: ttfc
        percentile: 0.99
        value: 0.5      # 500ms in seconds
        type: constant

      - name: "P99 TBC under 50ms"
        metric: tbc
        percentile: 0.99
        value: 0.05

      - name: "P95 E2E under 10s"
        metric: e2e
        percentile: 0.95
        value: 10.0

Available metrics:

  • ttfc: Time to first chunk/token

  • tbc: Time between chunks

  • tpot: Time per output token

  • e2e: End-to-end latency

An iteration passes only if all SLOs are met.

Output structure¶

capacity_search_output/
└── 08:01:2026-17:25:35-b0fc8e1d/
    ├── config.yml                       # Search configuration
    ├── capacity_search_results.json     # Final results
    └── runs/                            # Individual benchmark runs
        ├── 08:01:2026-17:25:35-f24d1805/
        │   ├── config.yml
        │   ├── metrics/
        │   └── ...
        └── 08:01:2026-17:25:50-4ea72acb/
            └── ...

capacity_search_results.json contains the search outcome:

{
  "traffic_scheduler_type": "concurrent",
  "searched_knob": "traffic_scheduler.target_concurrent_sessions",
  "best_value": 20.0,
  "best_run_dir": "capacity_search_output/.../runs/...",
  "history": [
    {
      "value": 10.0,
      "all_slos_met": true,
      "run_dir": ".../runs/08:01:2026-17:25:35-f24d1805",
      "slo_results": { ... }
    },
    {
      "value": 20.0,
      "all_slos_met": true,
      "run_dir": ".../runs/08:01:2026-17:25:50-4ea72acb",
      "slo_results": { ... }
    }
  ]
}

WandB integration¶

Enable WandB to track all iterations and get a summary run:

benchmark_config:
  wandb:
    enabled: true
    project: veeksha
    group: cap-search-llama-8b

The summary run includes:

  • Final capacity value

  • All iterations plotted

  • Comparison table

  • “BEST_CONFIG” tag on the optimal run

Example: Production capacity planning¶

Find the maximum rate for a latency-sensitive deployment:

output_dir: capacity_search_prod

start_value: 10.0
max_value: 500.0
expansion_factor: 2.0
max_iterations: 15
precision: 1

benchmark_config:
  traffic_scheduler:
    type: rate
    interval_generator:
      type: poisson

  session_generator:
    type: trace
    trace_file: production_traffic.jsonl
    flavor:
      type: timed_synthetic_session

  client:
    type: openai_chat_completions
    api_base: http://prod-server:8000/v1
    model: meta-llama/Llama-3-70B-Instruct

  runtime:
    benchmark_timeout: 120
    max_sessions: -1

  evaluators:
    - type: performance
      slos:
        - name: "P99 TTFT < 2s"
          metric: ttfc
          percentile: 0.99
          value: 2.0
        - name: "P50 TBC < 30ms"
          metric: tbc
          percentile: 0.50
          value: 0.03

This uses real production traces and strict SLOs for accurate capacity planning.

Example: Concurrency capacity search¶

Find the maximum concurrency for throughput-oriented deployments:

# capacity_concurrent.veeksha.yml
output_dir: capacity_search_output

start_value: 4
max_value: 128
expansion_factor: 2.0
max_iterations: 15
precision: 0           # Integer concurrency values

benchmark_config:
  traffic_scheduler:
    type: concurrent
    # target_concurrent_sessions set by capacity search
    cancel_session_on_failure: false

  session_generator:
    type: synthetic
    session_graph:
      type: single_request
    channels:
      - type: text
        body_length_generator:
          type: fixed
          value: 512
    output_spec:
      text:
        output_length_generator:
          type: fixed
          value: 256

  client:
    type: openai_chat_completions
    api_base: http://localhost:8000/v1
    model: meta-llama/Llama-3-8B-Instruct

  runtime:
    benchmark_timeout: 60
    max_sessions: -1

  evaluators:
    - type: performance
      slos:
        - name: "P99 TTFC < 5s"
          metric: ttfc
          percentile: 0.99
          value: 5.0
          type: constant
        - name: "P99 TBC < 100ms"
          metric: tbc
          percentile: 0.99
          value: 0.1
          type: constant
uvx veeksha capacity-search --config capacity_concurrent.veeksha.yml

Tip

Use precision: 0 for concurrency searches (integer values) and precision: 2 for rate searches (fractional rates).

Next
Configuration Sweeps
Previous
Output Files
Copyright © 2024-onwards Systems for AI Lab, Georgia Institute of Technology
Made with Sphinx and @pradyunsg's Furo
On this page
  • Capacity Search
    • How it works
    • Running capacity search
    • Rate-based vs concurrency-based searches
    • Configuration reference
    • Defining SLOs
    • Output structure
    • WandB integration
    • Example: Production capacity planning
    • Example: Concurrency capacity search