Veeksha Documentation¶

Veeksha is a high-fidelity benchmarking framework for LLM inference systems. Whether you’re optimizing a production deployment, comparing serving backends, or running capacity planning experiments, Veeksha lets you measure what matters to you: realistic multi-turn conversations, agentic workflows, high-frequency stress tests, or targeted microbenchmarks. One tool, any workload.

From isolated requests to complex agentic sessions, Veeksha captures the full complexity of modern LLM workloads.

👉 New here? Start with Benchmark Types if you want the shortest path to the most common benchmark recipes. Read Why Veeksha? to understand the model behind them.

Note

Veeksha (वीक्षा) means “observation” or “investigation” in Sanskrit.

Key features¶

Realistic workload modeling

DAG-based sessions: Model multi-turn conversations and complex agentic workflows as directed acyclic graphs with history inheritance, capturing real chat context accumulation
Shared prefix testing: Generate workloads with configurable prefix sharing to benchmark KV-cache efficiency
Trace replay: Replay production traces (Claude Code, RAG, conversational) with preserved timing and token distributions

Flexible traffic generation

Open-loop (rate-based): Poisson, gamma, or fixed arrival rates to measure latency under realistic bursty traffic
Closed-loop (concurrency-based): Maintain target concurrent sessions with ramp-up control for throughput testing

SLO-aware evaluation

Per-request metrics: TTFC, TBC, TPOT, and end-to-end latency with percentile distributions
Automated health checks: Validates prompt/output lengths, arrival rates, and request dependencies to ensure benchmark correctness
Capacity search: Adaptive probe-then-binary-search algorithm to find maximum sustainable throughput or rate meeting latency SLOs

Production-ready tooling

Managed server orchestration: Launch and manage inference servers automatically with health checks and log capture
Configuration sweeps: Use !expand YAML tag to run Cartesian product of parameter combinations with aggregated summaries
WandB integration: Automatic logging of metrics, artifacts, and experiment tracking with sweep/capacity-search summaries

Quick example¶

Run a simple benchmark with uvx against a running OpenAI-compatible endpoint:

uvx -p 3.14t veeksha benchmark \
    --client.type openai_chat_completions \
    --client.api_base http://localhost:8000/v1 \
    --client.model meta-llama/Llama-3.2-1B-Instruct \
    --traffic_scheduler.type rate \
    --traffic_scheduler.interval_generator.type poisson \
    --traffic_scheduler.interval_generator.arrival_rate 2.0 \
    --runtime.benchmark_timeout 30

Or use a YAML configuration file:

uvx -p 3.14t veeksha benchmark --config my_benchmark.veeksha.yml

Veeksha requires free-threaded Python 3.14 or newer.

Documentation¶

Getting Started

Getting Started

User Guide

User Guide

Design

Design & Architecture

Reference

Reference
- Programmatic usage
- Configuration Reference