System Architecture¶

This page describes Veeksha’s internal architecture, including how components interact and how requests flow through the system.

High-level components¶

Veeksha is composed of several key components that work together:

┌─────────────────────────────────────────────────────────────────────────┐
│                         Benchmark Runner                                │
│  ┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐       │
│  │ Session Generator│  │ Traffic Scheduler│  │     Evaluator    │       │
│  │  - synthetic     │  │  - rate-based    │  │  - performance   │       │
│  │  - trace         │  │  - concurrent    │  │  - accuracy      │       │
│  │  - lmeval        │  │                  │  │                  │       │
│  └────────┬─────────┘  └────────┬─────────┘  └────────▲─────────┘       │
│           │                     │                     │                 │
│           ▼                     ▼                     │                 │
│  ┌──────────────────────────────────────────┐         │                 │
│  │              Worker Pool                 │         │                 │
│  │  ┌──────────┐ ┌──────────┐ ┌──────────┐  │         │                 │
│  │  │ Prefetch │ │ Dispatch │ │Completion│  │         │                 │
│  │  │ Workers  │→│ Workers  │→│ Workers  │──┼─────────┘                 │
│  │  └──────────┘ └────┬─────┘ └──────────┘  │                           │
│  └────────────────────┼─────────────────────┘                           │
│                       │                                                 │
│                       ▼                                                 │
│  ┌──────────────────────────────────────────┐                           │
│  │             Client Runners               │                           │
│  │  - Async HTTP clients (httpx)            │                           │
│  │  - Streaming response handling           │                           │
│  └──────────────────────────────────────────┘                           │
└─────────────────────────────────────────────────────────────────────────┘
                                │
                                ▼
                    ┌───────────────────────┐
                    │   LLM Inference API   │
                    │  (vLLM, SGLang, etc.) │
                    └───────────────────────┘

Component descriptions¶

Session Generator

Creates Session objects representing user conversations or agentic flows. Each session contains a graph of requests with dependencies. Three types are available:

synthetic: Generates random content with configurable distributions
trace: Replays recorded conversation traces
lmeval: Generates evaluation prompts from lm-eval-harness tasks

Traffic Scheduler

Controls when sessions and their requests are dispatched. Handles:

Inter-session timing (arrival rate or target concurrency)
Intra-session dependencies (waiting for parent requests to complete)
History population (adding prior turns to request context)

Worker Pool

Thread-based workers that process requests through the pipeline:

Prefetch Workers: Pre-generate sessions to ensure work is always ready
Dispatch Workers: Wait for ready requests and send them to clients
Completion Workers: Process completed requests and trigger next steps

Client Runners

Async HTTP clients that actually communicate with the LLM inference API. Handle streaming responses and capture detailed timing information.

Evaluator

Consumes completed requests and computes metrics. Supports:

Performance metrics (TTFC, TBC, TPOT, throughput)
Accuracy evaluation (lm-eval integration)
SLO checking (latency percentile thresholds)

Request lifecycle¶

Every request goes through these stages with precise timestamp capture:

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│ scheduler_ready │────▶│scheduler_dispatch│────▶│ client_pickup   │
│      _at        │     │       _at        │     │      _at        │
└─────────────────┘     └──────────────────┘     └─────────────────┘
       │                        │                        │
       │                        │                        │
       ▼                        ▼                        ▼
Request dependencies     Dispatcher thread      Client runner picks
satisfied; request       pops from ready        up and sends HTTP
enters ready queue       queue and marks        request to server
                         dispatched

┌─────────────────┐     ┌─────────────────┐
│client_completed │────▶│result_processed │
│      _at        │     │      _at        │
└─────────────────┘     └─────────────────┘
       │                        │
       │                        │
       ▼                        ▼
Full response received;  Completion worker
client records final     processes result,
timing                   notifies scheduler

These timestamps enable computing:

Dispatch delay: scheduler_dispatched_at - scheduler_ready_at
Queue wait: client_picked_up_at - scheduler_dispatched_at
Processing delay: result_processed_at - client_completed_at

Threading model¶

Veeksha uses a multi-threaded architecture with configurable worker counts:

runtime:
  num_dispatcher_threads: 2   # Threads for dispatching requests
  num_completion_threads: 2   # Threads for processing completions
  num_client_threads: 3       # Async worker threads for HTTP clients

Dispatcher Threads: Wait on the traffic scheduler’s ready queue and dispatch requests to client runners. More threads help when dispatch overhead is significant.
Completion Threads: Process completed requests: update session state, notify the scheduler, and feed results to the evaluator. More threads help with high throughput.
Client Threads: Each runs an async event loop with an httpx.AsyncClient for making concurrent HTTP requests. More threads increase I/O parallelism.

Note

For optimal performance with free-threaded Python (3.14t), the GIL is disabled, allowing true parallelism across all worker threads.

Output pipeline¶

During and after the benchmark, several output mechanisms record data:

Trace Recorder

Writes dispatched requests to traces/trace.jsonl as they are sent. Includes session context and optionally full request content.

Evaluator

Accumulates metrics in memory and writes final results to metrics/:

request_level_metrics.jsonl: Per-request detailed data
*.csv: Percentile distributions for each metric
*.png: Distribution plots
summary_stats.json: Aggregate statistics
slo_results.json: SLO compliance results

Health Checker

Post-benchmark verification that validates:

Session dispatch rate matches configuration
Request dependencies were respected
Prompt/output lengths match targets
Lifecycle timing is reasonable