System Architecture

This page describes Veeksha’s internal architecture, including how components interact and how requests flow through the system.

High-level components

Veeksha is composed of several key components that work together:

┌─────────────────────────────────────────────────────────────────────────┐
│                         Benchmark Runner                                │
│  ┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐       │
│  │ Session Generator│  │ Traffic Scheduler│  │     Evaluator    │       │
│  │  - synthetic     │  │  - rate-based    │  │  - performance   │       │
│  │  - trace         │  │  - concurrent    │  │  - accuracy      │       │
│  │  - lmeval        │  │                  │  │                  │       │
│  └────────┬─────────┘  └────────┬─────────┘  └────────▲─────────┘       │
│           │                     │                     │                 │
│           ▼                     ▼                     │                 │
│  ┌──────────────────────────────────────────┐         │                 │
│  │              Worker Pool                 │         │                 │
│  │  ┌──────────┐ ┌──────────┐ ┌──────────┐  │         │                 │
│  │  │ Prefetch │ │ Dispatch │ │Completion│  │         │                 │
│  │  │ Workers  │→│ Workers  │→│ Workers  │──┼─────────┘                 │
│  │  └──────────┘ └────┬─────┘ └──────────┘  │                           │
│  └────────────────────┼─────────────────────┘                           │
│                       │                                                 │
│                       ▼                                                 │
│  ┌──────────────────────────────────────────┐                           │
│  │             Client Runners               │                           │
│  │  - Async HTTP clients (httpx)            │                           │
│  │  - Streaming response handling           │                           │
│  └──────────────────────────────────────────┘                           │
└─────────────────────────────────────────────────────────────────────────┘
                                │
                                ▼
                    ┌───────────────────────┐
                    │   LLM Inference API   │
                    │  (vLLM, SGLang, etc.) │
                    └───────────────────────┘

Component descriptions

Session Generator

Creates Session objects representing user conversations or agentic flows. Each session contains a graph of requests with dependencies. Three types are available:

  • synthetic: Generates random content with configurable distributions

  • trace: Replays recorded conversation traces

  • lmeval: Generates evaluation prompts from lm-eval-harness tasks

Traffic Scheduler

Controls when sessions and their requests are dispatched. Handles:

  • Inter-session timing (arrival rate or target concurrency)

  • Intra-session dependencies (waiting for parent requests to complete)

  • History population (adding prior turns to request context)

Worker Pool

Thread-based workers that process requests through the pipeline:

  • Prefetch Workers: Pre-generate sessions to ensure work is always ready

  • Dispatch Workers: Wait for ready requests and send them to clients

  • Completion Workers: Process completed requests and trigger next steps

Client Runners

Async HTTP clients that actually communicate with the LLM inference API. Handle streaming responses and capture detailed timing information.

Evaluator

Consumes completed requests and computes metrics. Supports:

  • Performance metrics (TTFC, TBC, TPOT, throughput)

  • Accuracy evaluation (lm-eval integration)

  • SLO checking (latency percentile thresholds)

Request lifecycle

Every request goes through these stages with precise timestamp capture:

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│ scheduler_ready │────▶│scheduler_dispatch│────▶│ client_pickup   │
│      _at        │     │       _at        │     │      _at        │
└─────────────────┘     └──────────────────┘     └─────────────────┘
       │                        │                        │
       │                        │                        │
       ▼                        ▼                        ▼
Request dependencies     Dispatcher thread      Client runner picks
satisfied; request       pops from ready        up and sends HTTP
enters ready queue       queue and marks        request to server
                         dispatched

┌─────────────────┐     ┌─────────────────┐
│client_completed │────▶│result_processed │
│      _at        │     │      _at        │
└─────────────────┘     └─────────────────┘
       │                        │
       │                        │
       ▼                        ▼
Full response received;  Completion worker
client records final     processes result,
timing                   notifies scheduler

These timestamps enable computing:

  • Dispatch delay: scheduler_dispatched_at - scheduler_ready_at

  • Queue wait: client_picked_up_at - scheduler_dispatched_at

  • Processing delay: result_processed_at - client_completed_at

Threading model

Veeksha uses a multi-threaded architecture with configurable worker counts:

runtime:
  num_dispatcher_threads: 2   # Threads for dispatching requests
  num_completion_threads: 2   # Threads for processing completions
  num_client_threads: 3       # Async worker threads for HTTP clients
Dispatcher Threads

Wait on the traffic scheduler’s ready queue and dispatch requests to client runners. More threads help when dispatch overhead is significant.

Completion Threads

Process completed requests: update session state, notify the scheduler, and feed results to the evaluator. More threads help with high throughput.

Client Threads

Each runs an async event loop with an httpx.AsyncClient for making concurrent HTTP requests. More threads increase I/O parallelism.

Note

For optimal performance with free-threaded Python (3.14t), the GIL is disabled, allowing true parallelism across all worker threads.

Output pipeline

During and after the benchmark, several output mechanisms record data:

Trace Recorder

Writes dispatched requests to traces/trace.jsonl as they are sent. Includes session context and optionally full request content.

Evaluator

Accumulates metrics in memory and writes final results to metrics/:

  • request_level_metrics.jsonl: Per-request detailed data

  • *.csv: Percentile distributions for each metric

  • *.png: Distribution plots

  • summary_stats.json: Aggregate statistics

  • slo_results.json: SLO compliance results

Health Checker

Post-benchmark verification that validates:

  • Session dispatch rate matches configuration

  • Request dependencies were respected

  • Prompt/output lengths match targets

  • Lifecycle timing is reasonable