Benchmark Types¶
This page maps common benchmark shapes to their canonical Veeksha patterns so you can quickly see how Veeksha fits the way you benchmark today. It is not exhaustive; Veeksha is composable, so the same building blocks can be combined into many other valid configurations.
Pick the benchmark¶
If you want to measure… |
Use this in Veeksha |
|---|---|
Open-loop request-rate latency |
|
Closed-loop fixed-concurrency throughput |
|
TTFC vs prompt length |
|
TBT/TPOT vs batch size |
|
Throughput vs latency curve |
|
Max sustainable rate or concurrency under SLOs |
|
Replay a request log or conversation dataset |
|
For most request-level benchmarks, benchmark is the right command. Veeksha
models traffic as sessions, but single_request sessions make it behave like
a traditional request dispatcher.
The examples below show canonical starting points rather than the only possible configurations. More specialized workload patterns appear later on this page.
Open-loop request-rate latency test¶
Use this when you would normally run a fixed-QPS or Poisson-arrival benchmark.
# rate_single_request.veeksha.yml
client:
type: openai_chat_completions
api_base: http://localhost:8000/v1
model: meta-llama/Llama-3-8B-Instruct
session_generator:
type: synthetic
session_graph:
type: single_request
channels:
- type: text
body_length_generator:
type: fixed
value: 256
output_spec:
text:
output_length_generator:
type: fixed
value: 128
traffic_scheduler:
type: rate
interval_generator:
type: poisson
arrival_rate: 5.0
runtime:
benchmark_timeout: 60
max_sessions: -1
evaluators:
- type: performance
target_channels: ["text"]
uvx -p 3.14t veeksha benchmark --config rate_single_request.veeksha.yml
Closed-loop fixed-concurrency throughput test¶
Use this when you want to hold a target concurrency and push for throughput.
# concurrent_single_request.veeksha.yml
client:
type: openai_chat_completions
api_base: http://localhost:8000/v1
model: meta-llama/Llama-3-8B-Instruct
session_generator:
type: synthetic
session_graph:
type: single_request
channels:
- type: text
body_length_generator:
type: fixed
value: 512
output_spec:
text:
output_length_generator:
type: fixed
value: 256
traffic_scheduler:
type: concurrent
target_concurrent_sessions: 16
rampup_seconds: 10
runtime:
benchmark_timeout: 120
max_sessions: -1
evaluators:
- type: performance
target_channels: ["text"]
uvx -p 3.14t veeksha benchmark --config concurrent_single_request.veeksha.yml
TTFC vs prompt length¶
Use this when you want isolated prefill measurements.
uvx -p 3.14t veeksha prefill \
--api_base http://localhost:8000/v1 \
--model meta-llama/Llama-3-8B-Instruct \
--input_lengths 128 256 512 1024 2048 \
--output_tokens 1 \
--samples_per_length 10 \
--output_dir microbench_output
This sweeps prompt length and keeps decode minimal so you can see how TTFC scales with prefill work.
TBT/TPOT vs batch size¶
Use this when you want isolated decode measurements.
uvx -p 3.14t veeksha decode \
--api_base http://localhost:8000/v1 \
--model meta-llama/Llama-3-8B-Instruct \
--batch_sizes 1 2 4 8 16 \
--input_lengths 128 512 \
--samples_per_length 20 \
--engine_chunk_size 512 \
--output_dir microbench_output
This measures steady-state decode behavior as batching increases.
Throughput vs latency curve¶
Use this when you want the classic operating curve for one fixed request shape.
uvx -p 3.14t veeksha stress \
--api_base http://localhost:8000/v1 \
--model meta-llama/Llama-3-8B-Instruct \
--input_length 512 \
--output_length 256 \
--mode.type manual \
--mode.concurrency_levels 1 2 4 8 16 32 \
--point_duration 120 \
--warmup_duration 10 \
--output_dir microbench_output
This gives you throughput, end-to-end latency, TTFC, and interactivity at each concurrency level.
Max sustainable load under SLOs¶
Use this when you want Veeksha to find the highest passing rate or concurrency automatically.
# capacity_search.veeksha.yml
output_dir: capacity_search_output
start_value: 5.0
max_value: 100.0
expansion_factor: 2.0
precision: 1
benchmark_config:
client:
type: openai_chat_completions
api_base: http://localhost:8000/v1
model: meta-llama/Llama-3-8B-Instruct
session_generator:
type: synthetic
session_graph:
type: single_request
channels:
- type: text
body_length_generator:
type: fixed
value: 256
output_spec:
text:
output_length_generator:
type: fixed
value: 128
traffic_scheduler:
type: rate
interval_generator:
type: poisson
runtime:
benchmark_timeout: 60
max_sessions: -1
evaluators:
- type: performance
target_channels: ["text"]
slos:
- name: "P99 TTFC < 500ms"
metric: ttfc
percentile: 0.99
value: 0.5
type: constant
- name: "P99 TBC < 50ms"
metric: tbc
percentile: 0.99
value: 0.05
type: constant
uvx -p 3.14t veeksha capacity-search --config capacity_search.veeksha.yml
For concurrency instead of rate, change benchmark_config.traffic_scheduler.type
to concurrent and set precision: 0.
Replay a request log¶
Use this when you already have a CSV or JSONL file with input and output lengths.
# replay_request_log.veeksha.yml
client:
type: openai_chat_completions
api_base: http://localhost:8000/v1
model: meta-llama/Llama-3-8B-Instruct
session_generator:
type: trace
trace_file: requests.csv
wrap_mode: true
flavor:
type: request_log
traffic_scheduler:
type: rate
interval_generator:
type: poisson
arrival_rate: 10.0
runtime:
benchmark_timeout: 120
max_sessions: -1
evaluators:
- type: performance
target_channels: ["text"]
uvx -p 3.14t veeksha benchmark --config replay_request_log.veeksha.yml
Your trace file should contain input_length and output_length columns.
If you need a multi-turn conversation trace, a timed session trace, a
shared-prefix trace, or a RAG trace, see Trace Flavors for a
flavor-by-flavor comparison and minimal trace examples.
Multi-turn conversations (synthetic)¶
Use this when you want generated multi-turn chat rather than independent requests:
session_generator:
type: synthetic
session_graph:
type: linear
inherit_history: true
num_request_generator:
type: uniform
min: 2
max: 4
Everything else stays the same. This turns a request benchmark into a real conversation benchmark.
Advanced workload patterns¶
These examples cover more specialized benchmark types. Treat them as canonical starting points, not as an exhaustive list of every supported configuration.
Unless noted otherwise, run them with:
uvx -p 3.14t veeksha benchmark --config <file>.veeksha.yml
For trace-based workloads beyond simple request-log replay, including conversation datasets, timed multi-turn traces, and shared-prefix traces, see Trace Flavors.
Agentic workloads (branching sessions)¶
Simulate agentic tool-calling patterns with fan-out/fan-in DAG structure:
# agentic.veeksha.yml
seed: 42
session_generator:
type: synthetic
session_graph:
type: branching
num_layers_generator:
type: uniform
min: 3
max: 5
layer_width_generator:
type: uniform
min: 2
max: 6
fan_out_generator:
type: uniform
min: 1
max: 5
fan_in_generator:
type: uniform
min: 1
max: 4
connection_dist_generator:
type: uniform
min: 1
max: 2 # Allow skip connections
single_root: true
inherit_history: true
request_wait_generator:
type: poisson
arrival_rate: 3
channels:
- type: text
body_length_generator:
type: uniform
min: 50
max: 200
output_spec:
text:
output_length_generator:
type: uniform
min: 100
max: 300
traffic_scheduler:
type: rate
interval_generator:
type: poisson
arrival_rate: 5.0
client:
type: openai_chat_completions
api_base: http://localhost:8000/v1
model: meta-llama/Llama-3-8B-Instruct
runtime:
max_sessions: 100
benchmark_timeout: 120
evaluators:
- type: performance
target_channels: ["text"]
LM-Eval accuracy benchmarks¶
Run standardized evaluation tasks from the lm-evaluation-harness:
# lmeval.veeksha.yml
seed: 42
session_generator:
type: lmeval
tasks: ["triviaqa", "truthfulqa_gen"]
num_fewshot: 0
traffic_scheduler:
type: concurrent
target_concurrent_sessions: 4
rampup_seconds: 0
cancel_session_on_failure: false
evaluators:
- type: performance
target_channels: ["text"]
- type: accuracy_lmeval
bootstrap_iters: 200
client:
type: openai_completions # Note: completions, not chat
api_base: http://localhost:8000/v1
model: meta-llama/Llama-3-8B-Instruct
request_timeout: 240
max_tokens_param: max_tokens
additional_sampling_params: '{"temperature": 0}'
runtime:
max_sessions: 40
benchmark_timeout: 1200
Note
LM-Eval uses openai_completions (not openai_chat_completions) for
generation tasks. The accuracy_lmeval evaluator computes task-specific
metrics alongside the standard performance evaluator.
See also¶
Quick start for a first end-to-end benchmark run
Configuration System for the full config model
Trace Flavors for trace flavor details and input formats
Microbenchmarks for full prefill, decode, and stress details
Capacity Search for full capacity-search behavior