Weights & Biases Integration¶
Veeksha integrates with Weights & Biases (WandB) for experiment tracking, metric visualization, and artifact storage. This guide covers how to enable and use the integration.
Enabling WandB¶
Add a wandb section to your configuration:
wandb:
enabled: true
project: my-llm-benchmarks
Run the benchmark as usual:
uvx veeksha benchmark --config my_benchmark.veeksha.yml
Veeksha will:
Initialize a WandB run
Log metrics throughout the benchmark
Upload artifacts at completion
Provide a link to the run dashboard
Configuration options¶
wandb:
enabled: true # Enable WandB logging
project: veeksha # WandB project name
entity: my-team # WandB entity (team/user), optional
group: capacity-search-1 # Group related runs together
run_name: null # Custom run name (default: output dir name)
tags: ["production", "llama-8b"] # Tags for filtering
notes: "Testing new server config" # Run description
mode: null # "online", "offline", or "disabled"
log_artifacts: true # Upload output files as artifacts
Key options:
projectWandB project name. Can also be set via
WANDB_PROJECTenv var.entityTeam or user account. Defaults to your default WandB entity.
groupGroups related runs (e.g., all runs in a sweep or capacity search).
tagsList of tags for filtering runs in the WandB UI.
What gets logged¶
- Scalar Metrics
Summary statistics are logged as WandB metrics:
Request/session counts
Error rates
Throughput (tokens/second)
Observed dispatch rate
- SLO Results
If SLOs are configured, their pass/fail status and observed values.
- Configuration
The full resolved configuration is logged (with secrets redacted).
- Artifacts
When
log_artifacts: true, these files are uploaded:config.yml- Configurationmetrics/*.json- All JSON metricsmetrics/*.csv- Percentile distributionsmetrics/*.png- Distribution plotshealth_check_results.txt- Verification results
Using with advanced features¶
WandB integrates seamlessly with Veeksha’s advanced features. For details on these workflows, see the corresponding documentation:
- Parameter Sweeps
When running sweeps with the
!expandtag, usegroupto organize all sweep runs together. See Configuration Sweeps for details.- Capacity Search
Capacity search automatically creates WandB runs for each iteration and tags the best configuration. See Capacity Search for details.
Viewing results in WandB¶
After a run completes, open the provided URL:
wandb: 🚀 View run at https://wandb.ai/my-team/veeksha/runs/abc123
In the WandB dashboard:
- Overview Tab
Summary metrics, configuration, and run metadata.
- Charts Tab
Visualizations of logged metrics over time.
- Artifacts Tab
Download output files (metrics, plots, traces).
- Files Tab
Browse uploaded files directly.
Filtering and comparing runs¶
Use tags and group names to filter runs:
Filter by tag:
tags:productionFilter by group:
group:capacity-search-1Compare runs: Select multiple and use the comparison view
Create custom charts to compare metrics across runs:
TTFC p99 vs arrival rate
Throughput vs concurrency
Error rate trends
Offline mode¶
For environments without internet access:
wandb:
enabled: true
mode: offline
Runs are saved locally to wandb/ and can be synced later:
wandb sync benchmark_output/*/wandb/
Environment variables¶
WandB uses its standard environment variables. Set these if you don’t want to specify them in the config:
export WANDB_API_KEY=your-api-key
See WandB Environment Variables for the full list.
Example: Complete WandB config¶
seed: 42
wandb:
enabled: true
project: llm-benchmarks
entity: ml-team
group: weekly-regression
tags: ["regression", "llama-3-8b", "vllm-0.4"]
notes: "Weekly regression test for production config"
log_artifacts: true
client:
type: openai_chat_completions
api_base: http://localhost:8000/v1
model: meta-llama/Llama-3-8B-Instruct
traffic_scheduler:
type: rate
interval_generator:
type: poisson
arrival_rate: 10.0
session_generator:
type: synthetic
session_graph:
type: linear
inherit_history: true
channels:
- type: text
body_length_generator:
type: uniform
min: 100
max: 500
evaluators:
- type: performance
target_channels: ["text"]
slos:
- name: "P99 TTFC"
metric: ttfc
percentile: 0.99
value: 0.5
type: constant
runtime:
benchmark_timeout: 300
max_sessions: -1
See also¶
Configuration Sweeps - Running parameter sweeps
Capacity Search - Capacity search with WandB