System Architecture
===================

This page describes Veeksha's internal architecture, including how components
interact and how requests flow through the system.


High-level components
---------------------

Veeksha is composed of several key components that work together:

.. code-block:: text

    ┌─────────────────────────────────────────────────────────────────────────┐
    │                         Benchmark Runner                                │
    │  ┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐       │
    │  │ Session Generator│  │ Traffic Scheduler│  │     Evaluator    │       │
    │  │  - synthetic     │  │  - rate-based    │  │  - performance   │       │
    │  │  - trace         │  │  - concurrent    │  │  - accuracy      │       │
    │  │  - lmeval        │  │                  │  │                  │       │
    │  └────────┬─────────┘  └────────┬─────────┘  └────────▲─────────┘       │
    │           │                     │                     │                 │
    │           ▼                     ▼                     │                 │
    │  ┌──────────────────────────────────────────┐         │                 │
    │  │              Worker Pool                 │         │                 │
    │  │  ┌──────────┐ ┌──────────┐ ┌──────────┐  │         │                 │
    │  │  │ Prefetch │ │ Dispatch │ │Completion│  │         │                 │
    │  │  │ Workers  │→│ Workers  │→│ Workers  │──┼─────────┘                 │
    │  │  └──────────┘ └────┬─────┘ └──────────┘  │                           │
    │  └────────────────────┼─────────────────────┘                           │
    │                       │                                                 │
    │                       ▼                                                 │
    │  ┌──────────────────────────────────────────┐                           │
    │  │             Client Runners               │                           │
    │  │  - Async HTTP clients (httpx)            │                           │
    │  │  - Streaming response handling           │                           │
    │  └──────────────────────────────────────────┘                           │
    └─────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
                        ┌───────────────────────┐
                        │   LLM Inference API   │
                        │  (vLLM, SGLang, etc.) │
                        └───────────────────────┘


Component descriptions
----------------------

**Session Generator**
    Creates ``Session`` objects representing user conversations or agentic flows. Each session
    contains a graph of requests with dependencies. Three types are available:

    - ``synthetic``: Generates random content with configurable distributions
    - ``trace``: Replays recorded conversation traces
    - ``lmeval``: Generates evaluation prompts from lm-eval-harness tasks

**Traffic Scheduler**
    Controls when sessions and their requests are dispatched. Handles:

    - Inter-session timing (arrival rate or target concurrency)
    - Intra-session dependencies (waiting for parent requests to complete)
    - History population (adding prior turns to request context)

**Worker Pool**
    Thread-based workers that process requests through the pipeline:

    - **Prefetch Workers**: Pre-generate sessions to ensure work is always ready
    - **Dispatch Workers**: Wait for ready requests and send them to clients
    - **Completion Workers**: Process completed requests and trigger next steps

**Client Runners**
    Async HTTP clients that actually communicate with the LLM inference API.
    Handle streaming responses and capture detailed timing information.

**Evaluator**
    Consumes completed requests and computes metrics. Supports:

    - Performance metrics (TTFC, TBC, TPOT, throughput)
    - Accuracy evaluation (lm-eval integration)
    - SLO checking (latency percentile thresholds)


Request lifecycle
-----------------

Every request goes through these stages with precise timestamp capture:

.. code-block:: text

    ┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
    │ scheduler_ready │────▶│scheduler_dispatch│────▶│ client_pickup   │
    │      _at        │     │       _at        │     │      _at        │
    └─────────────────┘     └──────────────────┘     └─────────────────┘
           │                        │                        │
           │                        │                        │
           ▼                        ▼                        ▼
    Request dependencies     Dispatcher thread      Client runner picks
    satisfied; request       pops from ready        up and sends HTTP
    enters ready queue       queue and marks        request to server
                             dispatched

    ┌─────────────────┐     ┌─────────────────┐
    │client_completed │────▶│result_processed │
    │      _at        │     │      _at        │
    └─────────────────┘     └─────────────────┘
           │                        │
           │                        │
           ▼                        ▼
    Full response received;  Completion worker
    client records final     processes result,
    timing                   notifies scheduler

These timestamps enable computing:

- **Dispatch delay**: ``scheduler_dispatched_at - scheduler_ready_at``
- **Queue wait**: ``client_picked_up_at - scheduler_dispatched_at``
- **Processing delay**: ``result_processed_at - client_completed_at``


Threading model
---------------

Veeksha uses a multi-threaded architecture with configurable worker counts:

.. code-block:: yaml

    runtime:
      num_dispatcher_threads: 2   # Threads for dispatching requests
      num_completion_threads: 2   # Threads for processing completions
      num_client_threads: 3       # Async worker threads for HTTP clients

**Dispatcher Threads**
    Wait on the traffic scheduler's ready queue and dispatch requests to
    client runners. More threads help when dispatch overhead is significant.

**Completion Threads**
    Process completed requests: update session state, notify the scheduler,
    and feed results to the evaluator. More threads help with high throughput.

**Client Threads**
    Each runs an async event loop with an ``httpx.AsyncClient`` for making
    concurrent HTTP requests. More threads increase I/O parallelism.

.. note::

    For optimal performance with free-threaded Python (3.14t), the GIL is
    disabled, allowing true parallelism across all worker threads.


Output pipeline
---------------

During and after the benchmark, several output mechanisms record data:

**Trace Recorder**
    Writes dispatched requests to ``traces/trace.jsonl`` as they are sent.
    Includes session context and optionally full request content.

**Evaluator**
    Accumulates metrics in memory and writes final results to ``metrics/``:

    - ``request_level_metrics.jsonl``: Per-request detailed data
    - ``*.csv``: Percentile distributions for each metric
    - ``*.png``: Distribution plots
    - ``summary_stats.json``: Aggregate statistics
    - ``slo_results.json``: SLO compliance results

**Health Checker**
    Post-benchmark verification that validates:

    - Session dispatch rate matches configuration
    - Request dependencies were respected
    - Prompt/output lengths match targets
    - Lifecycle timing is reasonable