Server Management¶
Veeksha can automatically launch and manage LLM inference servers, making benchmarks fully self-contained and reproducible. This is especially useful for CI pipelines and comparing different server configurations.
Supported servers¶
Veeksha currently supports:
Vajra
SGLang
vLLM
Basic configuration¶
Add a server section to your benchmark config:
server:
type: sglang # or vllm, vajra
env_path: sglang_env # Python environment with server installed
model: meta-llama/Llama-3-8B-Instruct
host: localhost
port: 30000
# Client settings are automatically configured by the server
client:
type: openai_chat_completions
request_timeout: 120
When server is configured:
Veeksha launches the server before the benchmark
Waits for the server to be healthy
Automatically sets
client.api_base,client.model, andclient.api_keyRuns the benchmark
Shuts down the server when complete
Server configuration options¶
All server types share these common options:
server:
type: sglang
env_path: /path/to/sglang_env # Python environment
model: meta-llama/Llama-3-8B-Instruct
# Network settings
host: localhost
port: 30000
api_key: token-abc123 # Generated API key
# GPU configuration
gpu_ids: [0, 1] # Specific GPUs (null = auto-assign)
tensor_parallel_size: 2 # Number of GPUs for TP
require_contiguous_gpus: true # Require consecutive GPU IDs
# Model settings
dtype: auto # float16, bfloat16, or auto
max_model_len: 8192 # Maximum context length
# Startup settings
startup_timeout: 300 # Seconds to wait for server
health_check_interval: 2.0 # Seconds between health checks
# Additional server arguments
additional_args: '{"enable_prefix_caching": true}'
env_pathPath to a Python virtual environment or conda environment containing the server installation. Can be relative or absolute.
gpu_idsExplicit list of GPU IDs to use. If
null, GPUs are auto-assigned based on availability andtensor_parallel_size.additional_argsJSON string or dict of extra arguments passed to the server command.
GPU resource management¶
Veeksha includes a resource manager for multi-GPU systems:
Auto-assignment
server:
type: vllm
tensor_parallel_size: 4
gpu_ids: null # Auto-assign 4 GPUs
require_contiguous_gpus: true
The resource manager finds 4 contiguous available GPUs.
Explicit assignment
server:
type: sglang
tensor_parallel_size: 2
gpu_ids: [2, 3] # Use GPUs 2 and 3 specifically
Non-contiguous GPUs (when supported)
server:
type: vllm
tensor_parallel_size: 2
cpu_ids: [0, 2] # Use GPUs 0 and 2
require_contiguous_gpus: false
Server logs¶
Server stdout/stderr are written to the benchmark output directory:
benchmark_output/09:01:2026-10:30:00-abc123/
├── server_logs_vajra_localhost_30003_20260109-110406.log
└── ...
This is useful for debugging server issues.
Example: Full managed benchmark¶
# managed_benchmark.veeksha.yml
seed: 42
output_dir: benchmark_output
server:
type: sglang
env_path: ~/envs/sglang
model: meta-llama/Llama-3-8B-Instruct
host: localhost
port: 30000
tensor_parallel_size: 1
max_model_len: 8192
startup_timeout: 300
client:
type: openai_chat_completions
request_timeout: 120
max_tokens_param: max_tokens
min_tokens_param: min_tokens
traffic_scheduler:
type: rate
interval_generator:
type: poisson
arrival_rate: 10.0
session_generator:
type: synthetic
session_graph:
type: linear
inherit_history: true
channels:
- type: text
body_length_generator:
type: uniform
min: 100
max: 500
output_spec:
text:
output_length_generator:
type: uniform
min: 100
max: 300
runtime:
benchmark_timeout: 60
max_sessions: -1
evaluators:
- type: performance
target_channels: ["text"]
Example: Comparing servers¶
Create a base config and run with different servers:
# base_config.yml
session_generator:
type: synthetic
session_graph:
type: linear
channels:
- type: text
body_length_generator:
type: fixed
value: 512
output_spec:
text:
output_length_generator:
type: fixed
value: 256
traffic_scheduler:
type: concurrent
target_concurrent_sessions: 8
rampup_seconds: 5
runtime:
benchmark_timeout: 120
# Run with vLLM
uvx veeksha benchmark \
--config base_config.yml \
--server.type vllm \
--server.env_path vllm_env \
--server.model meta-llama/Llama-3.2-1B-Instruct \
--output_dir results/vllm
# Run with SGLang
uvx veeksha benchmark \
--config base_config.yml \
--server.type sglang \
--server.env_path sglang_env \
--server.model meta-llama/Llama-3-8B-Instruct \
--output_dir results/sglang