Benchmark Types =============== This page maps common benchmark shapes to their canonical Veeksha patterns so you can quickly see how Veeksha fits the way you benchmark today. It is not exhaustive; Veeksha is composable, so the same building blocks can be combined into many other valid configurations. Pick the benchmark ------------------ .. list-table:: :header-rows: 1 :widths: 45 55 * - If you want to measure... - Use this in Veeksha * - Open-loop request-rate latency - ``benchmark`` with ``single_request`` sessions and ``rate`` traffic * - Closed-loop fixed-concurrency throughput - ``benchmark`` with ``single_request`` sessions and ``concurrent`` traffic * - TTFC vs prompt length - ``prefill`` * - TBT/TPOT vs batch size - ``decode`` * - Throughput vs latency curve - ``stress`` * - Max sustainable rate or concurrency under SLOs - ``capacity-search`` * - Replay a request log or conversation dataset - ``benchmark`` with ``trace`` sessions For most request-level benchmarks, ``benchmark`` is the right command. Veeksha models traffic as sessions, but ``single_request`` sessions make it behave like a traditional request dispatcher. The examples below show canonical starting points rather than the only possible configurations. More specialized workload patterns appear later on this page. Open-loop request-rate latency test ----------------------------------- Use this when you would normally run a fixed-QPS or Poisson-arrival benchmark. .. code-block:: yaml # rate_single_request.veeksha.yml client: type: openai_chat_completions api_base: http://localhost:8000/v1 model: meta-llama/Llama-3-8B-Instruct session_generator: type: synthetic session_graph: type: single_request channels: - type: text body_length_generator: type: fixed value: 256 output_spec: text: output_length_generator: type: fixed value: 128 traffic_scheduler: type: rate interval_generator: type: poisson arrival_rate: 5.0 runtime: benchmark_timeout: 60 max_sessions: -1 evaluators: - type: performance target_channels: ["text"] .. code-block:: bash uvx -p 3.14t veeksha benchmark --config rate_single_request.veeksha.yml Closed-loop fixed-concurrency throughput test --------------------------------------------- Use this when you want to hold a target concurrency and push for throughput. .. code-block:: yaml # concurrent_single_request.veeksha.yml client: type: openai_chat_completions api_base: http://localhost:8000/v1 model: meta-llama/Llama-3-8B-Instruct session_generator: type: synthetic session_graph: type: single_request channels: - type: text body_length_generator: type: fixed value: 512 output_spec: text: output_length_generator: type: fixed value: 256 traffic_scheduler: type: concurrent target_concurrent_sessions: 16 rampup_seconds: 10 runtime: benchmark_timeout: 120 max_sessions: -1 evaluators: - type: performance target_channels: ["text"] .. code-block:: bash uvx -p 3.14t veeksha benchmark --config concurrent_single_request.veeksha.yml TTFC vs prompt length --------------------- Use this when you want isolated prefill measurements. .. code-block:: bash uvx -p 3.14t veeksha prefill \ --api_base http://localhost:8000/v1 \ --model meta-llama/Llama-3-8B-Instruct \ --input_lengths 128 256 512 1024 2048 \ --output_tokens 1 \ --samples_per_length 10 \ --output_dir microbench_output This sweeps prompt length and keeps decode minimal so you can see how TTFC scales with prefill work. TBT/TPOT vs batch size ---------------------- Use this when you want isolated decode measurements. .. code-block:: bash uvx -p 3.14t veeksha decode \ --api_base http://localhost:8000/v1 \ --model meta-llama/Llama-3-8B-Instruct \ --batch_sizes 1 2 4 8 16 \ --input_lengths 128 512 \ --samples_per_length 20 \ --engine_chunk_size 512 \ --output_dir microbench_output This measures steady-state decode behavior as batching increases. Throughput vs latency curve --------------------------- Use this when you want the classic operating curve for one fixed request shape. .. code-block:: bash uvx -p 3.14t veeksha stress \ --api_base http://localhost:8000/v1 \ --model meta-llama/Llama-3-8B-Instruct \ --input_length 512 \ --output_length 256 \ --mode.type manual \ --mode.concurrency_levels 1 2 4 8 16 32 \ --point_duration 120 \ --warmup_duration 10 \ --output_dir microbench_output This gives you throughput, end-to-end latency, TTFC, and interactivity at each concurrency level. Max sustainable load under SLOs ------------------------------- Use this when you want Veeksha to find the highest passing rate or concurrency automatically. .. code-block:: yaml # capacity_search.veeksha.yml output_dir: capacity_search_output start_value: 5.0 max_value: 100.0 expansion_factor: 2.0 precision: 1 benchmark_config: client: type: openai_chat_completions api_base: http://localhost:8000/v1 model: meta-llama/Llama-3-8B-Instruct session_generator: type: synthetic session_graph: type: single_request channels: - type: text body_length_generator: type: fixed value: 256 output_spec: text: output_length_generator: type: fixed value: 128 traffic_scheduler: type: rate interval_generator: type: poisson runtime: benchmark_timeout: 60 max_sessions: -1 evaluators: - type: performance target_channels: ["text"] slos: - name: "P99 TTFC < 500ms" metric: ttfc percentile: 0.99 value: 0.5 type: constant - name: "P99 TBC < 50ms" metric: tbc percentile: 0.99 value: 0.05 type: constant .. code-block:: bash uvx -p 3.14t veeksha capacity-search --config capacity_search.veeksha.yml For concurrency instead of rate, change ``benchmark_config.traffic_scheduler.type`` to ``concurrent`` and set ``precision: 0``. Replay a request log -------------------- Use this when you already have a CSV or JSONL file with input and output lengths. .. code-block:: yaml # replay_request_log.veeksha.yml client: type: openai_chat_completions api_base: http://localhost:8000/v1 model: meta-llama/Llama-3-8B-Instruct session_generator: type: trace trace_file: requests.csv wrap_mode: true flavor: type: request_log traffic_scheduler: type: rate interval_generator: type: poisson arrival_rate: 10.0 runtime: benchmark_timeout: 120 max_sessions: -1 evaluators: - type: performance target_channels: ["text"] .. code-block:: bash uvx -p 3.14t veeksha benchmark --config replay_request_log.veeksha.yml Your trace file should contain ``input_length`` and ``output_length`` columns. If you need a multi-turn conversation trace, a timed session trace, a shared-prefix trace, or a RAG trace, see :doc:`/user_guide/trace_flavors` for a flavor-by-flavor comparison and minimal trace examples. Multi-turn conversations (synthetic) ------------------------------------ Use this when you want generated multi-turn chat rather than independent requests: .. code-block:: yaml session_generator: type: synthetic session_graph: type: linear inherit_history: true num_request_generator: type: uniform min: 2 max: 4 Everything else stays the same. This turns a request benchmark into a real conversation benchmark. .. _workload-recipes: Advanced workload patterns -------------------------- These examples cover more specialized benchmark types. Treat them as canonical starting points, not as an exhaustive list of every supported configuration. Unless noted otherwise, run them with: .. code-block:: bash uvx -p 3.14t veeksha benchmark --config .veeksha.yml For trace-based workloads beyond simple request-log replay, including conversation datasets, timed multi-turn traces, and shared-prefix traces, see :doc:`/user_guide/trace_flavors`. Agentic workloads (branching sessions) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Simulate agentic tool-calling patterns with fan-out/fan-in DAG structure: .. code-block:: yaml # agentic.veeksha.yml seed: 42 session_generator: type: synthetic session_graph: type: branching num_layers_generator: type: uniform min: 3 max: 5 layer_width_generator: type: uniform min: 2 max: 6 fan_out_generator: type: uniform min: 1 max: 5 fan_in_generator: type: uniform min: 1 max: 4 connection_dist_generator: type: uniform min: 1 max: 2 # Allow skip connections single_root: true inherit_history: true request_wait_generator: type: poisson arrival_rate: 3 channels: - type: text body_length_generator: type: uniform min: 50 max: 200 output_spec: text: output_length_generator: type: uniform min: 100 max: 300 traffic_scheduler: type: rate interval_generator: type: poisson arrival_rate: 5.0 client: type: openai_chat_completions api_base: http://localhost:8000/v1 model: meta-llama/Llama-3-8B-Instruct runtime: max_sessions: 100 benchmark_timeout: 120 evaluators: - type: performance target_channels: ["text"] LM-Eval accuracy benchmarks ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Run standardized evaluation tasks from the `lm-evaluation-harness `_: .. code-block:: yaml # lmeval.veeksha.yml seed: 42 session_generator: type: lmeval tasks: ["triviaqa", "truthfulqa_gen"] num_fewshot: 0 traffic_scheduler: type: concurrent target_concurrent_sessions: 4 rampup_seconds: 0 cancel_session_on_failure: false evaluators: - type: performance target_channels: ["text"] - type: accuracy_lmeval bootstrap_iters: 200 client: type: openai_completions # Note: completions, not chat api_base: http://localhost:8000/v1 model: meta-llama/Llama-3-8B-Instruct request_timeout: 240 max_tokens_param: max_tokens additional_sampling_params: '{"temperature": 0}' runtime: max_sessions: 40 benchmark_timeout: 1200 .. note:: LM-Eval uses ``openai_completions`` (not ``openai_chat_completions``) for generation tasks. The ``accuracy_lmeval`` evaluator computes task-specific metrics alongside the standard performance evaluator. See also -------- - :doc:`quick_start` for a first end-to-end benchmark run - :doc:`/user_guide/configuration` for the full config model - :doc:`/user_guide/trace_flavors` for trace flavor details and input formats - :doc:`/user_guide/microbenchmarks` for full prefill, decode, and stress details - :doc:`/user_guide/capacity_search` for full capacity-search behavior