Veeksha Documentation

Veeksha is a high-fidelity benchmarking framework for LLM inference systems. Whether you’re optimizing a production deployment, comparing serving backends, or running capacity planning experiments, Veeksha lets you measure what matters to you: realistic multi-turn conversations, agentic workflows, high-frequency stress tests, or targeted microbenchmarks. One tool, any workload.

From isolated requests to complex agentic sessions, Veeksha captures the full complexity of modern LLM workloads.

👉 New here? Start with Benchmark Types if you want the shortest path to the most common benchmark recipes. Read Why Veeksha? to understand the model behind them.

Note

Veeksha (वीक्षा) means “observation” or “investigation” in Sanskrit.

Key features

Realistic workload modeling
  • DAG-based sessions: Model multi-turn conversations and complex agentic workflows as directed acyclic graphs with history inheritance, capturing real chat context accumulation

  • Shared prefix testing: Generate workloads with configurable prefix sharing to benchmark KV-cache efficiency

  • Trace replay: Replay production traces (Claude Code, RAG, conversational) with preserved timing and token distributions

Flexible traffic generation
  • Open-loop (rate-based): Poisson, gamma, or fixed arrival rates to measure latency under realistic bursty traffic

  • Closed-loop (concurrency-based): Maintain target concurrent sessions with ramp-up control for throughput testing

SLO-aware evaluation
  • Per-request metrics: TTFC, TBC, TPOT, and end-to-end latency with percentile distributions

  • Automated health checks: Validates prompt/output lengths, arrival rates, and request dependencies to ensure benchmark correctness

  • Capacity search: Adaptive probe-then-binary-search algorithm to find maximum sustainable throughput or rate meeting latency SLOs

Production-ready tooling
  • Managed server orchestration: Launch and manage inference servers automatically with health checks and log capture

  • Configuration sweeps: Use !expand YAML tag to run Cartesian product of parameter combinations with aggregated summaries

  • WandB integration: Automatic logging of metrics, artifacts, and experiment tracking with sweep/capacity-search summaries

Quick example

Run a simple benchmark with uvx against a running OpenAI-compatible endpoint:

uvx -p 3.14t veeksha benchmark \
    --client.type openai_chat_completions \
    --client.api_base http://localhost:8000/v1 \
    --client.model meta-llama/Llama-3.2-1B-Instruct \
    --traffic_scheduler.type rate \
    --traffic_scheduler.interval_generator.type poisson \
    --traffic_scheduler.interval_generator.arrival_rate 2.0 \
    --runtime.benchmark_timeout 30

Or use a YAML configuration file:

uvx -p 3.14t veeksha benchmark --config my_benchmark.veeksha.yml

Veeksha requires free-threaded Python 3.14 or newer.

Documentation