Weights & Biases Integration

Veeksha integrates with Weights & Biases (WandB) for experiment tracking, metric visualization, and artifact storage. This guide covers how to enable and use the integration.

Enabling WandB

Add a wandb section to your configuration:

wandb:
  enabled: true
  project: my-llm-benchmarks

Run the benchmark as usual:

uvx veeksha benchmark --config my_benchmark.veeksha.yml

Veeksha will:

  1. Initialize a WandB run

  2. Log metrics throughout the benchmark

  3. Upload artifacts at completion

  4. Provide a link to the run dashboard

Configuration options

wandb:
  enabled: true              # Enable WandB logging
  project: veeksha           # WandB project name
  entity: my-team            # WandB entity (team/user), optional
  group: capacity-search-1   # Group related runs together
  run_name: null             # Custom run name (default: output dir name)
  tags: ["production", "llama-8b"]  # Tags for filtering
  notes: "Testing new server config"  # Run description
  mode: null                 # "online", "offline", or "disabled"
  log_artifacts: true        # Upload output files as artifacts

Key options:

project

WandB project name. Can also be set via WANDB_PROJECT env var.

entity

Team or user account. Defaults to your default WandB entity.

group

Groups related runs (e.g., all runs in a sweep or capacity search).

tags

List of tags for filtering runs in the WandB UI.

What gets logged

Scalar Metrics

Summary statistics are logged as WandB metrics:

  • Request/session counts

  • Error rates

  • Throughput (tokens/second)

  • Observed dispatch rate

SLO Results

If SLOs are configured, their pass/fail status and observed values.

Configuration

The full resolved configuration is logged (with secrets redacted).

Artifacts

When log_artifacts: true, these files are uploaded:

  • config.yml - Configuration

  • metrics/*.json - All JSON metrics

  • metrics/*.csv - Percentile distributions

  • metrics/*.png - Distribution plots

  • health_check_results.txt - Verification results

Using with advanced features

WandB integrates seamlessly with Veeksha’s advanced features. For details on these workflows, see the corresponding documentation:

Parameter Sweeps

When running sweeps with the !expand tag, use group to organize all sweep runs together. See Configuration Sweeps for details.

Capacity Search

Capacity search automatically creates WandB runs for each iteration and tags the best configuration. See Capacity Search for details.

Viewing results in WandB

After a run completes, open the provided URL:

wandb: 🚀 View run at https://wandb.ai/my-team/veeksha/runs/abc123

In the WandB dashboard:

Overview Tab

Summary metrics, configuration, and run metadata.

Charts Tab

Visualizations of logged metrics over time.

Artifacts Tab

Download output files (metrics, plots, traces).

Files Tab

Browse uploaded files directly.

Filtering and comparing runs

Use tags and group names to filter runs:

  • Filter by tag: tags:production

  • Filter by group: group:capacity-search-1

  • Compare runs: Select multiple and use the comparison view

Create custom charts to compare metrics across runs:

  • TTFC p99 vs arrival rate

  • Throughput vs concurrency

  • Error rate trends

Offline mode

For environments without internet access:

wandb:
  enabled: true
  mode: offline

Runs are saved locally to wandb/ and can be synced later:

wandb sync benchmark_output/*/wandb/

Environment variables

WandB uses its standard environment variables. Set these if you don’t want to specify them in the config:

export WANDB_API_KEY=your-api-key

See WandB Environment Variables for the full list.

Example: Complete WandB config

seed: 42

wandb:
  enabled: true
  project: llm-benchmarks
  entity: ml-team
  group: weekly-regression
  tags: ["regression", "llama-3-8b", "vllm-0.4"]
  notes: "Weekly regression test for production config"
  log_artifacts: true

client:
  type: openai_chat_completions
  api_base: http://localhost:8000/v1
  model: meta-llama/Llama-3-8B-Instruct

traffic_scheduler:
  type: rate
  interval_generator:
    type: poisson
    arrival_rate: 10.0

session_generator:
  type: synthetic
  session_graph:
    type: linear
    inherit_history: true
  channels:
    - type: text
      body_length_generator:
        type: uniform
        min: 100
        max: 500

evaluators:
  - type: performance
    target_channels: ["text"]
    slos:
      - name: "P99 TTFC"
        metric: ttfc
        percentile: 0.99
        value: 0.5
        type: constant

runtime:
  benchmark_timeout: 300
  max_sessions: -1

See also