Traffic Scheduling
==================

Traffic scheduling controls **when** sessions start and **when** requests within
sessions are dispatched. Veeksha provides two fundamentally different scheduling
modes for different benchmarking scenarios.


Scheduling modes
----------------

**Rate-Based** (``type: rate``)
    Generates new sessions at a specified arrival rate, regardless of how many
    are currently in-flight. Models open-loop traffic.

**Concurrency-Based** (``type: concurrent``)
    Maintains a target number of active sessions. When one completes, another
    starts. Models closed-loop traffic.

.. list-table:: When to Use Each Mode
   :header-rows: 1
   :widths: 30 35 35

   * - Scenario
     - Mode
     - Rationale
   * - Latency under load
     - Rate-based
     - Measure how latency degrades as rate increases
   * - Maximum throughput
     - Concurrent
     - Saturate the system to find peak capacity
   * - Production traffic modeling
     - Rate-based (Poisson)
     - Poisson arrivals model realistic bursty traffic
   * - Capacity planning
     - Rate-based
     - Find the rate where latency SLOs are met
   * - Stress testing
     - Concurrent (high)
     - Push beyond normal operating conditions


Rate-based scheduling
---------------------

Sessions arrive according to an interval generator:

.. code-block:: yaml

    traffic_scheduler:
      type: rate
      interval_generator:
        type: poisson
        arrival_rate: 10.0
      cancel_session_on_failure: true

**How it works:**

1. `RateTrafficScheduler` generates inter-arrival times from the interval generator
2. Each session's root requests are scheduled at the computed arrival time
3. Sessions are dispatched regardless of current system load

**Interval Generators:**

``poisson`` (recommended for realism)
    Exponentially-distributed intervals with given mean rate:

    .. code-block:: yaml

        interval_generator:
          type: poisson
          arrival_rate: 10.0  # 10 sessions/second average

    Captures real-world bursty arrival patterns.

``gamma``
    Gamma-distributed intervals (generalization of Poisson):

    .. code-block:: yaml

        interval_generator:
          type: gamma
          arrival_rate: 10.0
          shape: 2.0  # Higher = less variance

``fixed``
    Constant intervals for uniform traffic:

    .. code-block:: yaml

        interval_generator:
          type: fixed
          interval: 0.1  # Exactly 100ms between sessions


Concurrency-based scheduling
----------------------------

Maintains a fixed number of concurrent sessions:

.. code-block:: yaml

    traffic_scheduler:
      type: concurrent
      target_concurrent_sessions: 8
      rampup_seconds: 10
      cancel_session_on_failure: true

**How it works:**

1. `ConcurrentTrafficScheduler` tracks active session count
2. When a session completes, it activates a pending one
3. Ramp-up gradually increases concurrency from 0 to target

**Ramp-up Behavior:**

.. code-block:: text

    Concurrency
        ▲
        │                    ┌────────────────────
      8 │                   ╱
        │                  ╱
      4 │                 ╱
        │                ╱
      0 │───────────────╱
        └──────────────────────────────────────▶ Time
        0            10s (rampup)         ...

During ramp-up, target concurrency increases linearly:

.. code-block:: python

    current_target = int(target * (elapsed_time / rampup_seconds))


Intra-session scheduling
------------------------

Within a session, requests are scheduled based on the session graph:

.. code-block:: text

    Session with 3 turns:
    
    t=0.0s: Root request dispatched (session arrives)
    t=1.2s: Root request completes
    t=1.7s: Turn 1 dispatched (0.5s wait_after_ready)
    t=2.1s: Turn 1 completes
    t=2.4s: Turn 2 dispatched (0.3s wait_after_ready)
    ...

The scheduler tracks session state:

.. code-block:: python

    class ScheduledSessionState:
        session: Session
        completed_nodes: Set[int]      # Finished request nodes
        in_flight_nodes: Set[int]      # Currently executing
        pending_nodes: Set[int]        # Waiting on dependencies
        completion_times: Dict[int, float]  # When each node finished

When a request completes:

1. Node is moved from ``in_flight_nodes`` to ``completed_nodes``
2. Child nodes are checked for readiness
3. Ready nodes are scheduled after their ``wait_after_ready`` delay
4. History is recorded if this node is a history parent


Session cancellation
--------------------

The ``cancel_session_on_failure`` option controls behavior when a request fails:

.. code-block:: yaml

    traffic_scheduler:
      cancel_session_on_failure: true  # Default

When ``true``, if any request in a session fails:

- All pending requests in that session are cancelled
- The session is marked as errored
- Resources are freed for new sessions

When ``false``:

- Remaining requests in the session are still attempted
- Useful for testing partial failure scenarios


Ready queue and dispatch
------------------------

Both schedulers maintain a **ready queue** of requests eligible for dispatch:

.. code-block:: text

    Ready Queue (min-heap by ready_at time):
    ┌─────────────────────────────────────────┐
    │ (ready_at=0.0, request_1)              │ ← Pop next
    │ (ready_at=0.1, request_5)              │
    │ (ready_at=0.2, request_3)              │
    │ (ready_at=0.5, request_8)              │
    └─────────────────────────────────────────┘

Dispatch workers call ``wait_for_ready()`` which:

1. Waits until the next ready time (or timeout)
2. Pops the request and marks it dispatched
3. Records ``scheduler_dispatched_at`` timestamp

This ensures requests are dispatched at the right time (not early, not late).


History population
------------------

When ``inherit_history: true`` in the session graph, the scheduler populates
request history from parent responses:

.. code-block:: python

    def _populate_history(self, request: Request, state: ScheduledSessionState, node_id: int):
        """Populate request history from parent nodes."""
        for edge in parents(state.session.session_graph, node_id):
            if edge.is_history_parent:
                parent_history = state.histories.get(edge.src)
                if parent_history:
                    request.populate_history(parent_history)

The history includes:

- Prior request content (prompts)
- Prior response content (model outputs)
- Enables accurate multi-turn conversation simulation


Timing verification
-------------------

Veeksha's health checker verifies scheduling accuracy:

**Session Dispatch Rate Check**
    Compares actual vs configured arrival rate:

    .. code-block:: text

        Expected Rate: 10.0 sessions/sec
        Actual Rate: 10.2 sessions/sec
        Error: 2.0%
        Threshold: 15%
        Result: PASSED

**Intra-Session Request Arrival Check**
    Verifies requests weren't dispatched before dependencies completed:

    .. code-block:: text

        Requests w/ Dependencies: 445
        Mean Delay: 0.0017s
        P99 Delay: 0.0788s
        Violations (>5s late): 0
        Result: PASSED

These checks help identify issues with benchmark configuration or execution.