Configuration System

Veeksha uses a flexible polymorphic configuration system that supports YAML files, CLI arguments, and programmatic access. This guide explains how the system works and how to navigate it effectively.

Configuration methods

YAML Files (recommended)

Create a .veeksha.yml file with your configuration:

seed: 42
client:
  type: openai_chat_completions
  api_base: http://localhost:8000/v1
  model: my-model
traffic_scheduler:
  type: rate
  interval_generator:
    type: poisson
    arrival_rate: 10.0
CLI Arguments

Override any option using dot notation:

uvx veeksha benchmark \
    --client.api_base http://localhost:8000/v1 \
    --traffic_scheduler.interval_generator.arrival_rate 20.0

Argument names mirror the YAML hierarchy with dots.

Combined (YAML + CLI)

CLI arguments override YAML values:

# Base config from file, override arrival rate
uvx veeksha benchmark \
    --config base.veeksha.yml \
    --traffic_scheduler.interval_generator.arrival_rate 30.0

Polymorphic options

Many options have a type field that selects a variant with its own options:

# Session generator can be: synthetic, trace, or lmeval
session_generator:
  type: synthetic        # Selects synthetic variant
  session_graph:         # Options specific to synthetic
    type: linear
  channels:
    - type: text

# Traffic scheduler can be: rate or concurrent
traffic_scheduler:
  type: rate             # Selects rate variant
  interval_generator:    # Options specific to rate
    type: poisson
    arrival_rate: 10.0

Each type exposes different options. See the Configuration Reference for the full list.

Exporting JSON schema

Export a JSON schema for YAML IDE autocompletion and linting:

uvx veeksha benchmark --export-json-schema veeksha-schema.json

Configure your IDE to use this schema. In VSCode and forks:

// .vscode/settings.json
{
    "yaml.schemas": {
        "./veeksha-schema.json": "*.veeksha.yml"
    },
    "yaml.customTags": [
        "!expand sequence"
    ]
}

Hint

The YAML IDE extension may be required for “yaml.schemas” to show up as a valid setting.

VSCode YAML integration example

The VSCode YAML extension providing autocompletion and documentation on hover.

Common configuration sections

client - API endpoint configuration

client:
  type: openai_chat_completions
  api_base: http://localhost:8000/v1
  model: meta-llama/Llama-3-8B-Instruct
  # api_key: optional, falls back to OPENAI_API_KEY env var
  request_timeout: 300
  max_tokens_param: max_completion_tokens
  min_tokens_param: min_tokens

traffic_scheduler - Traffic pattern

# Rate-based
traffic_scheduler:
  type: rate
  interval_generator:
    type: poisson
    arrival_rate: 10.0
  cancel_session_on_failure: true

# OR Concurrency-based
traffic_scheduler:
  type: concurrent
  target_concurrent_sessions: 8
  rampup_seconds: 10

session_generator - Content generation

session_generator:
  type: synthetic
  session_graph:
    type: linear
    num_request_generator:
      type: uniform
      min: 1
      max: 5
    inherit_history: true
  channels:
    - type: text
      body_length_generator:
        type: uniform
        min: 100
        max: 500
  output_spec:
    text:
      output_length_generator:
        type: uniform
        min: 50
        max: 200

runtime - Execution parameters

runtime:
  benchmark_timeout: 300      # Total benchmark duration
  max_sessions: 1000          # Maximum sessions (-1 = unlimited)
  post_timeout_grace_seconds: 10  # Wait for in-flight after timeout
  num_client_threads: 3       # Async HTTP client threads

evaluators - Metrics collection

evaluators:
  - type: performance
    target_channels: ["text"]
    stream_metrics: true
    slos:
      - name: "P99 TTFC"
        metric: ttfc
        percentile: 0.99
        value: 0.5
        type: constant

Environment variables

Veeksha automatically reads certain environment variables as fallbacks when configuration values are not explicitly set:

OPENAI_API_KEY

Used as the API key if client.api_key is not set in config.

OPENAI_API_BASE

Used as the API base URL if client.api_base is not set in config.

This allows you to set credentials once in your environment:

export OPENAI_API_KEY=your-api-key
export OPENAI_API_BASE=http://localhost:8000/v1

Then omit them from your config file:

# No need to specify api_key or api_base
client:
  type: openai_chat_completions
  model: meta-llama/Llama-3-8B-Instruct

This is especially useful for:

  • Avoiding committing secrets to version control

  • Sharing configs across environments with different servers

Veeksha also reads HF_TOKEN from the environment in order to access gated models.

Stop conditions

Benchmarks stop when either condition is met:

runtime:
  benchmark_timeout: 300    # Stop after 300 seconds
  max_sessions: 1000        # OR after 1000 sessions

Use -1 for unlimited:

runtime:
  benchmark_timeout: -1     # Run indefinitely
  max_sessions: 500         # Stop only after 500 sessions

When a timeout hits, Veeksha will record all in-flight requests and keep dispatching sessions as usual. Then, it will exit after post_timeout_grace_seconds have passed, only if the session limit is not reached before that.

runtime:
  benchmark_timeout: 60
  post_timeout_grace_seconds: 10  # Wait 10s for in-flight requests
  # -1 = wait indefinitely for all in-flight
  # 0 = exit immediately (cancel in-flight)

Output directory

Control where results are saved:

output_dir: benchmark_output

Results are saved to a timestamped subdirectory:

benchmark_output/
└── 09:01:2026-10:30:00-a1b2c3d4/
    ├── config.yml
    ├── metrics/
    └── traces/

The subdirectory name includes:

  • Date and time

  • Short hash of the configuration (for uniqueness)

Trace recording

Control what’s recorded for debugging:

trace_recorder:
  enabled: true          # Write trace file
  include_content: false # Exclude prompt/response content (smaller files)

Set include_content: true to record full request content for debugging.

Validation

Veeksha validates configurations at startup:

  • Type checking for all fields

  • Enum validation for type fields

  • Required field checking

  • Cross-field validation (e.g., min <= max)

Invalid configurations produce clear error messages:

ConfigurationError: traffic_scheduler.interval_generator.arrival_rate
must be positive, got -5.0

Splitting configuration across files

For better organization and reusability, you can split your configuration across multiple YAML files using the !include tag. This is useful when you want to:

  • Reuse client configuration across different benchmarks

  • Keep environment-specific settings (e.g., API endpoints) separate

  • Share traffic patterns across experiments

Example: Separate client and traffic configs

Create client.yml with just client settings:

# client.yml
type: openai_chat_completions
api_base: http://localhost:8000/v1
model: meta-llama/Llama-3-8B-Instruct

Create traffic.yml with traffic settings:

# traffic.yml
type: rate
interval_generator:
  type: poisson
  arrival_rate: 5.0

Create a main config that includes both:

# main_config.yml
seed: 42

client: !include client.yml
traffic_scheduler: !include traffic.yml

session_generator:
  type: synthetic
  channels:
    - type: text
      body_length_generator:
        type: uniform
        min: 50
        max: 200

runtime:
  benchmark_timeout: 60

Run the benchmark with a single --config flag:

uvx veeksha benchmark --config main_config.yml

CLI overrides still work

You can override any value from the included files using CLI arguments:

uvx veeksha benchmark \
    --config main_config.yml \
    --client.model llama-70b  # Override model from client.yml

Workload recipes

This section shows complete, runnable configurations for common benchmarking scenarios. Each recipe is a standalone YAML file — copy it, point api_base at your server, and run.

Replay a request log (CSV or JSONL)

Replay a simple trace of independent requests with just token length distributions. Works with CSV files directly — no conversion needed:

# trace_request_log.veeksha.yml
seed: 42

session_generator:
  type: trace
  trace_file: sharegpt_8k_filtered.csv   # CSV or JSONL
  wrap_mode: true
  flavor:
    type: request_log

traffic_scheduler:
  type: rate
  interval_generator:
    type: poisson
    arrival_rate: 10.0

client:
  type: openai_chat_completions
  api_base: http://localhost:8000/v1
  model: meta-llama/Llama-3-8B-Instruct

runtime:
  benchmark_timeout: 120
  max_sessions: -1

evaluators:
  - type: performance
    target_channels: ["text"]

The trace file needs input_length and output_length columns (or the common alternatives num_prefill_tokens / num_decode_tokens, which are auto-normalized). Each row becomes an independent single-request session.

Trace flavors:

request_log

Independent requests with just token lengths. No session structure, no corpus files. Supports CSV and JSONL. Best for replaying public benchmarking datasets (ShareGPT, etc.).

timed_synthetic_session

Timed session traces with synthetic content. Supports DAG replay through session_context and context caching via page_size. Best for testing KV-cache reuse across linear and non-linear sessions.

untimed_content_multi_turn

Replay conversation datasets with actual message content (ShareGPT, LMSYS-Chat, etc.). Configurable message schema for different dataset formats.

shared_prefix

Multi-turn conversation dataset. Uses hash-based deterministic content generation with configurable block_size.

rag

Single-turn retrieval-augmented generation. Includes num_documents warmup documents. Good for testing long-context prefill.

Replay conversation datasets

Replay datasets with actual conversation content (e.g. ShareGPT):

# conversation_replay.veeksha.yml
seed: 42

session_generator:
  type: trace
  trace_file: sharegpt_52k.jsonl
  wrap_mode: true
  flavor:
    type: untimed_content_multi_turn
    conversation_column: conversations
    role_key: from
    content_key: value
    user_role_value: human
    assistant_role_value: gpt

traffic_scheduler:
  type: rate
  interval_generator:
    type: poisson
    arrival_rate: 10.0

client:
  type: openai_chat_completions
  api_base: http://localhost:8000/v1
  model: meta-llama/Llama-3-8B-Instruct

runtime:
  benchmark_timeout: 120
  max_sessions: -1

evaluators:
  - type: performance
    target_channels: ["text"]

Each row in the trace file should contain a conversations column (configurable) with a list of message dicts. The schema keys (role_key, content_key, user_role_value, assistant_role_value) can be customized for different dataset formats. For LMSYS-Chat format, use role_key: role, content_key: content, user_role_value: user, assistant_role_value: assistant.

Replay timed multi-turn traces

Replay timed multi-turn coding assistant traces with context caching:

# trace_timed_synthetic_session.veeksha.yml
seed: 42

session_generator:
  type: trace
  trace_file: traces/timed_synthetic_trace.jsonl
  wrap_mode: true
  flavor:
    type: timed_synthetic_session
    corpus_file: traces/corpus.txt
    page_size: 16

traffic_scheduler:
  type: rate
  interval_generator:
    type: poisson
    arrival_rate: 5.0

client:
  type: openai_chat_completions
  api_base: http://localhost:8000/v1
  model: meta-llama/Llama-3-8B-Instruct
  request_timeout: 120
  max_tokens_param: max_completion_tokens

runtime:
  benchmark_timeout: 120
  max_sessions: -1

Replay shared-prefix traces

# trace_shared_prefix.veeksha.yml
session_generator:
  type: trace
  trace_file: traces/shared_prefix_trace.jsonl
  flavor:
    type: shared_prefix
    corpus_file: traces/corpus.txt
    block_size: 512

traffic_scheduler:
  type: concurrent
  target_concurrent_sessions: 32
  rampup_seconds: 10

client:
  type: openai_chat_completions
  api_base: http://localhost:8000/v1
  model: meta-llama/Llama-3-8B-Instruct

runtime:
  benchmark_timeout: 300
  max_sessions: -1

Multi-turn conversations (synthetic)

Generate multi-turn sessions with history accumulation and shared prefixes:

# multi_turn.veeksha.yml
seed: 42

session_generator:
  type: synthetic
  session_graph:
    type: linear
    inherit_history: true       # Each turn includes prior turns as context
    num_request_generator:
      type: uniform
      min: 2
      max: 4                    # 2-4 turns per session
    request_wait_generator:
      type: poisson
      arrival_rate: 5           # ~200ms think time between turns
  channels:
    - type: text
      shared_prefix_ratio: 0.2           # 20% of prompt is shared
      shared_prefix_probability: 0.5     # 50% of sessions share a prefix
      body_length_generator:
        type: uniform
        min: 100
        max: 500
  output_spec:
    text:
      output_length_generator:
        type: uniform
        min: 50
        max: 200

traffic_scheduler:
  type: rate
  interval_generator:
    type: poisson
    arrival_rate: 10.0

client:
  type: openai_chat_completions
  api_base: http://localhost:8000/v1
  model: meta-llama/Llama-3-8B-Instruct

runtime:
  benchmark_timeout: 60
  max_sessions: -1

evaluators:
  - type: performance
    target_channels: ["text"]
    slos:
      - name: "P99 TTFC under 500ms"
        metric: ttfc
        percentile: 0.99
        value: 0.5
        type: constant

Agentic workloads (branching sessions)

Simulate agentic tool-calling patterns with fan-out/fan-in DAG structure:

# agentic.veeksha.yml
seed: 42

session_generator:
  type: synthetic
  session_graph:
    type: branching
    num_layers_generator:
      type: uniform
      min: 3
      max: 5
    layer_width_generator:
      type: uniform
      min: 2
      max: 6
    fan_out_generator:
      type: uniform
      min: 1
      max: 5
    fan_in_generator:
      type: uniform
      min: 1
      max: 4
    connection_dist_generator:
      type: uniform
      min: 1
      max: 2          # Allow skip connections
    single_root: true
    inherit_history: true
    request_wait_generator:
      type: poisson
      arrival_rate: 3
  channels:
    - type: text
      body_length_generator:
        type: uniform
        min: 50
        max: 200
  output_spec:
    text:
      output_length_generator:
        type: uniform
        min: 100
        max: 300

traffic_scheduler:
  type: rate
  interval_generator:
    type: poisson
    arrival_rate: 5.0

client:
  type: openai_chat_completions
  api_base: http://localhost:8000/v1
  model: meta-llama/Llama-3-8B-Instruct

runtime:
  max_sessions: 100
  benchmark_timeout: 120

evaluators:
  - type: performance
    target_channels: ["text"]

LM-Eval accuracy benchmarks

Run standardized evaluation tasks from the lm-evaluation-harness:

# lmeval.veeksha.yml
seed: 42

session_generator:
  type: lmeval
  tasks: ["triviaqa", "truthfulqa_gen"]
  num_fewshot: 0

traffic_scheduler:
  type: concurrent
  target_concurrent_sessions: 4
  rampup_seconds: 0
  cancel_session_on_failure: false

evaluators:
  - type: performance
    target_channels: ["text"]
  - type: accuracy_lmeval
    bootstrap_iters: 200

client:
  type: openai_completions            # Note: completions, not chat
  api_base: http://localhost:8000/v1
  model: meta-llama/Llama-3-8B-Instruct
  request_timeout: 240
  max_tokens_param: max_tokens
  additional_sampling_params: '{"temperature": 0}'

runtime:
  max_sessions: 40
  benchmark_timeout: 1200

Note

LM-Eval uses openai_completions (not openai_chat_completions) for generation tasks. The accuracy_lmeval evaluator computes task-specific metrics alongside the standard performance evaluator.

Throughput saturation test

Push the server to maximum throughput using closed-loop concurrency:

# throughput.veeksha.yml
seed: 42

session_generator:
  type: synthetic
  session_graph:
    type: single_request
  channels:
    - type: text
      body_length_generator:
        type: fixed
        value: 512
  output_spec:
    text:
      output_length_generator:
        type: fixed
        value: 256

traffic_scheduler:
  type: concurrent
  target_concurrent_sessions: 32
  rampup_seconds: 10

client:
  type: openai_chat_completions
  api_base: http://localhost:8000/v1
  model: meta-llama/Llama-3-8B-Instruct

runtime:
  benchmark_timeout: 120
  max_sessions: -1

evaluators:
  - type: performance
    target_channels: ["text"]

See also