Capacity Search =============== Capacity search automatically finds the maximum sustainable session rate or concurrency that meets your latency service level objectives (SLOs). This is essential for capacity planning and performance regression testing. How it works ------------ Veeksha uses an **adaptive two-phase algorithm**: **Phase 1: Exponential Probing** Start at a low value and exponentially increase until SLOs are violated. This quickly finds the approximate capacity ceiling. **Phase 2: Binary Search** Perform binary search between the last passing and first failing values to converge on the precise capacity. .. code-block:: text Example: Finding max rate Phase 1 (Probe): Rate 5.0 → PASS Rate 10.0 → PASS Rate 20.0 → PASS Rate 40.0 → FAIL ← ceiling found Phase 2 (Binary): Rate 30.0 → PASS Rate 35.0 → PASS Rate 37.5 → FAIL Rate 36.25 → PASS → Converged at 36.25 Running capacity search ----------------------- Create a capacity search configuration: .. code-block:: yaml # capacity_search.veeksha.yml output_dir: capacity_search_output # Search parameters start_value: 5.0 # Initial probe value max_value: 100.0 # Maximum to search expansion_factor: 2.0 # Multiply by this during probing max_iterations: 20 # Maximum iterations precision: 2 # Decimal places for rate # Benchmark configuration (used for each iteration) benchmark_config: seed: 42 traffic_scheduler: type: rate interval_generator: type: gamma # rate is set by capacity search cancel_session_on_failure: false session_generator: type: synthetic session_graph: type: linear inherit_history: true channels: - type: text body_length_generator: type: uniform min: 100 max: 500 client: type: openai_chat_completions api_base: http://localhost:8000/v1 model: meta-llama/Llama-3-8B-Instruct runtime: max_sessions: -1 benchmark_timeout: 60 evaluators: - type: performance target_channels: ["text"] slos: - name: "P99 TTFC" metric: ttfc percentile: 0.99 value: 0.5 type: constant - name: "P90 TBC" metric: tbc percentile: 0.90 value: 0.05 type: constant Run the search: .. code-block:: bash uvx veeksha capacity-search --config capacity_search.veeksha.yml Rate-based vs concurrency-based searches ---------------------------------------- **Rate-Based** (finding max sessions/second) Use when you want to find the maximum arrival rate: .. code-block:: yaml benchmark_config: traffic_scheduler: type: rate interval_generator: type: gamma # or poisson Capacity search sets the ``arrival_rate`` parameter. **Concurrency-Based** (finding max concurrent sessions) Use when you want to find maximum sustainable concurrency: .. code-block:: yaml benchmark_config: traffic_scheduler: type: concurrent # target_concurrent_sessions and rampup_seconds are set by search Capacity search sets ``target_concurrent_sessions``. Configuration reference ----------------------- .. code-block:: yaml output_dir: capacity_search_output # Base output directory start_value: 5.0 # Initial probe value max_value: 100.0 # Maximum value to search expansion_factor: 2.0 # Probe multiplier (default: 2.0) max_iterations: 20 # Max total iterations precision: 2 # Decimal precision for rate searches benchmark_config: # Full benchmark configuration # See /config_reference/benchmark for all options ``start_value`` Initial value for probing. Choose a value likely to pass. ``max_value`` Upper bound for the search. Probing won't exceed this. ``expansion_factor`` How aggressively to probe (2.0 = double each time). ``precision`` For rate-based searches, how many decimal places to use. Set to 0 for integer concurrency searches. Defining SLOs ------------- SLOs determine pass/fail for each iteration. Define them in the evaluator: .. code-block:: yaml evaluators: - type: performance slos: - name: "P99 TTFC under 500ms" metric: ttfc percentile: 0.99 value: 0.5 # 500ms in seconds type: constant - name: "P99 TBC under 50ms" metric: tbc percentile: 0.99 value: 0.05 - name: "P95 E2E under 10s" metric: e2e percentile: 0.95 value: 10.0 Available metrics: - ``ttfc``: Time to first chunk/token - ``tbc``: Time between chunks - ``tpot``: Time per output token - ``e2e``: End-to-end latency An iteration **passes** only if **all** SLOs are met. Output structure ---------------- .. code-block:: text capacity_search_output/ └── 08:01:2026-17:25:35-b0fc8e1d/ ├── config.yml # Search configuration ├── capacity_search_results.json # Final results └── runs/ # Individual benchmark runs ├── 08:01:2026-17:25:35-f24d1805/ │ ├── config.yml │ ├── metrics/ │ └── ... └── 08:01:2026-17:25:50-4ea72acb/ └── ... **capacity_search_results.json** contains the search outcome: .. code-block:: json { "traffic_scheduler_type": "concurrent", "searched_knob": "traffic_scheduler.target_concurrent_sessions", "best_value": 20.0, "best_run_dir": "capacity_search_output/.../runs/...", "history": [ { "value": 10.0, "all_slos_met": true, "run_dir": ".../runs/08:01:2026-17:25:35-f24d1805", "slo_results": { ... } }, { "value": 20.0, "all_slos_met": true, "run_dir": ".../runs/08:01:2026-17:25:50-4ea72acb", "slo_results": { ... } } ] } WandB integration ----------------- Enable WandB to track all iterations and get a summary run: .. code-block:: yaml benchmark_config: wandb: enabled: true project: veeksha group: cap-search-llama-8b The summary run includes: - Final capacity value - All iterations plotted - Comparison table - "BEST_CONFIG" tag on the optimal run Example: Production capacity planning ------------------------------------- Find the maximum rate for a latency-sensitive deployment: .. code-block:: yaml output_dir: capacity_search_prod start_value: 10.0 max_value: 500.0 expansion_factor: 2.0 max_iterations: 15 precision: 1 benchmark_config: traffic_scheduler: type: rate interval_generator: type: poisson session_generator: type: trace trace_file: production_traffic.jsonl flavor: type: timed_synthetic_session client: type: openai_chat_completions api_base: http://prod-server:8000/v1 model: meta-llama/Llama-3-70B-Instruct runtime: benchmark_timeout: 120 max_sessions: -1 evaluators: - type: performance slos: - name: "P99 TTFT < 2s" metric: ttfc percentile: 0.99 value: 2.0 - name: "P50 TBC < 30ms" metric: tbc percentile: 0.50 value: 0.03 This uses real production traces and strict SLOs for accurate capacity planning. Example: Concurrency capacity search ------------------------------------- Find the maximum concurrency for throughput-oriented deployments: .. code-block:: yaml # capacity_concurrent.veeksha.yml output_dir: capacity_search_output start_value: 4 max_value: 128 expansion_factor: 2.0 max_iterations: 15 precision: 0 # Integer concurrency values benchmark_config: traffic_scheduler: type: concurrent # target_concurrent_sessions set by capacity search cancel_session_on_failure: false session_generator: type: synthetic session_graph: type: single_request channels: - type: text body_length_generator: type: fixed value: 512 output_spec: text: output_length_generator: type: fixed value: 256 client: type: openai_chat_completions api_base: http://localhost:8000/v1 model: meta-llama/Llama-3-8B-Instruct runtime: benchmark_timeout: 60 max_sessions: -1 evaluators: - type: performance slos: - name: "P99 TTFC < 5s" metric: ttfc percentile: 0.99 value: 5.0 type: constant - name: "P99 TBC < 100ms" metric: tbc percentile: 0.99 value: 0.1 type: constant .. code-block:: bash uvx veeksha capacity-search --config capacity_concurrent.veeksha.yml .. tip:: Use ``precision: 0`` for concurrency searches (integer values) and ``precision: 2`` for rate searches (fractional rates).