Server Management ================= Veeksha can automatically launch and manage LLM inference servers, making benchmarks fully self-contained and reproducible. This is especially useful for CI pipelines and comparing different server configurations. Supported servers ----------------- Veeksha currently supports: - **Vajra** - **SGLang** - **vLLM** Basic configuration ------------------- Add a ``server`` section to your benchmark config: .. code-block:: yaml server: type: sglang # or vllm, vajra env_path: sglang_env # Python environment with server installed model: meta-llama/Llama-3-8B-Instruct host: localhost port: 30000 # Client settings are automatically configured by the server client: type: openai_chat_completions request_timeout: 120 When ``server`` is configured: 1. Veeksha launches the server before the benchmark 2. Waits for the server to be healthy 3. Automatically sets ``client.api_base``, ``client.model``, and ``client.api_key`` 4. Runs the benchmark 5. Shuts down the server when complete Server configuration options ---------------------------- All server types share these common options: .. code-block:: yaml server: type: sglang env_path: /path/to/sglang_env # Python environment model: meta-llama/Llama-3-8B-Instruct # Network settings host: localhost port: 30000 api_key: token-abc123 # Generated API key # GPU configuration gpu_ids: [0, 1] # Specific GPUs (null = auto-assign) tensor_parallel_size: 2 # Number of GPUs for TP require_contiguous_gpus: true # Require consecutive GPU IDs # Model settings dtype: auto # float16, bfloat16, or auto max_model_len: 8192 # Maximum context length # Startup settings startup_timeout: 300 # Seconds to wait for server health_check_interval: 2.0 # Seconds between health checks # Additional server arguments additional_args: '{"enable_prefix_caching": true}' ``env_path`` Path to a Python virtual environment or conda environment containing the server installation. Can be relative or absolute. ``gpu_ids`` Explicit list of GPU IDs to use. If ``null``, GPUs are auto-assigned based on availability and ``tensor_parallel_size``. ``additional_args`` JSON string or dict of extra arguments passed to the server command. GPU resource management ----------------------- Veeksha includes a resource manager for multi-GPU systems: **Auto-assignment** .. code-block:: yaml server: type: vllm tensor_parallel_size: 4 gpu_ids: null # Auto-assign 4 GPUs require_contiguous_gpus: true The resource manager finds 4 contiguous available GPUs. **Explicit assignment** .. code-block:: yaml server: type: sglang tensor_parallel_size: 2 gpu_ids: [2, 3] # Use GPUs 2 and 3 specifically **Non-contiguous GPUs** (when supported) .. code-block:: yaml server: type: vllm tensor_parallel_size: 2 cpu_ids: [0, 2] # Use GPUs 0 and 2 require_contiguous_gpus: false Server logs ----------- Server stdout/stderr are written to the benchmark output directory: .. code-block:: text benchmark_output/09:01:2026-10:30:00-abc123/ ├── server_logs_vajra_localhost_30003_20260109-110406.log └── ... This is useful for debugging server issues. Example: Full managed benchmark ------------------------------- .. code-block:: yaml # managed_benchmark.veeksha.yml seed: 42 output_dir: benchmark_output server: type: sglang env_path: ~/envs/sglang model: meta-llama/Llama-3-8B-Instruct host: localhost port: 30000 tensor_parallel_size: 1 max_model_len: 8192 startup_timeout: 300 client: type: openai_chat_completions request_timeout: 120 max_tokens_param: max_tokens min_tokens_param: min_tokens traffic_scheduler: type: rate interval_generator: type: poisson arrival_rate: 10.0 session_generator: type: synthetic session_graph: type: linear inherit_history: true channels: - type: text body_length_generator: type: uniform min: 100 max: 500 output_spec: text: output_length_generator: type: uniform min: 100 max: 300 runtime: benchmark_timeout: 60 max_sessions: -1 evaluators: - type: performance target_channels: ["text"] Example: Comparing servers -------------------------- Create a base config and run with different servers: .. code-block:: yaml # base_config.yml session_generator: type: synthetic session_graph: type: linear channels: - type: text body_length_generator: type: fixed value: 512 output_spec: text: output_length_generator: type: fixed value: 256 traffic_scheduler: type: concurrent target_concurrent_sessions: 8 rampup_seconds: 5 runtime: benchmark_timeout: 120 .. code-block:: bash # Run with vLLM uvx veeksha benchmark \ --config base_config.yml \ --server.type vllm \ --server.env_path vllm_env \ --server.model meta-llama/Llama-3.2-1B-Instruct \ --output_dir results/vllm # Run with SGLang uvx veeksha benchmark \ --config base_config.yml \ --server.type sglang \ --server.env_path sglang_env \ --server.model meta-llama/Llama-3-8B-Instruct \ --output_dir results/sglang