Traffic Scheduling¶

Traffic scheduling controls when sessions start and when requests within sessions are dispatched. Veeksha provides two fundamentally different scheduling modes for different benchmarking scenarios.

Scheduling modes¶

Rate-Based (type: rate): Generates new sessions at a specified arrival rate, regardless of how many are currently in-flight. Models open-loop traffic.
Concurrency-Based (type: concurrent): Maintains a target number of active sessions. When one completes, another starts. Models closed-loop traffic.

When to Use Each Mode¶
Scenario	Mode	Rationale
Latency under load	Rate-based	Measure how latency degrades as rate increases
Maximum throughput	Concurrent	Saturate the system to find peak capacity
Production traffic modeling	Rate-based (Poisson)	Poisson arrivals model realistic bursty traffic
Capacity planning	Rate-based	Find the rate where latency SLOs are met
Stress testing	Concurrent (high)	Push beyond normal operating conditions

Rate-based scheduling¶

Sessions arrive according to an interval generator:

traffic_scheduler:
  type: rate
  interval_generator:
    type: poisson
    arrival_rate: 10.0
  cancel_session_on_failure: true

How it works:

RateTrafficScheduler generates inter-arrival times from the interval generator
Each session’s root requests are scheduled at the computed arrival time
Sessions are dispatched regardless of current system load

Interval Generators:

poisson (recommended for realism)

Exponentially-distributed intervals with given mean rate:

interval_generator:
  type: poisson
  arrival_rate: 10.0  # 10 sessions/second average

Captures real-world bursty arrival patterns.

gamma

Gamma-distributed intervals (generalization of Poisson):

interval_generator:
  type: gamma
  arrival_rate: 10.0
  shape: 2.0  # Higher = less variance

fixed

Constant intervals for uniform traffic:

interval_generator:
  type: fixed
  interval: 0.1  # Exactly 100ms between sessions

Concurrency-based scheduling¶

Maintains a fixed number of concurrent sessions:

traffic_scheduler:
  type: concurrent
  target_concurrent_sessions: 8
  rampup_seconds: 10
  cancel_session_on_failure: true

How it works:

ConcurrentTrafficScheduler tracks active session count
When a session completes, it activates a pending one
Ramp-up gradually increases concurrency from 0 to target

Ramp-up Behavior:

Concurrency
    ▲
    │                    ┌────────────────────
  8 │                   ╱
    │                  ╱
  4 │                 ╱
    │                ╱
  0 │───────────────╱
    └──────────────────────────────────────▶ Time
    0            10s (rampup)         ...

During ramp-up, target concurrency increases linearly:

current_target = int(target * (elapsed_time / rampup_seconds))

Intra-session scheduling¶

Within a session, requests are scheduled based on the session graph:

Session with 3 turns:

t=0.0s: Root request dispatched (session arrives)
t=1.2s: Root request completes
t=1.7s: Turn 1 dispatched (0.5s wait_after_ready)
t=2.1s: Turn 1 completes
t=2.4s: Turn 2 dispatched (0.3s wait_after_ready)
...

The scheduler tracks session state:

class ScheduledSessionState:
    session: Session
    completed_nodes: Set[int]      # Finished request nodes
    in_flight_nodes: Set[int]      # Currently executing
    pending_nodes: Set[int]        # Waiting on dependencies
    completion_times: Dict[int, float]  # When each node finished

When a request completes:

Node is moved from in_flight_nodes to completed_nodes
Child nodes are checked for readiness
Ready nodes are scheduled after their wait_after_ready delay
History is recorded if this node is a history parent

Session cancellation¶

The cancel_session_on_failure option controls behavior when a request fails:

traffic_scheduler:
  cancel_session_on_failure: true  # Default

When true, if any request in a session fails:

All pending requests in that session are cancelled
The session is marked as errored
Resources are freed for new sessions

When false:

Remaining requests in the session are still attempted
Useful for testing partial failure scenarios

Ready queue and dispatch¶

Both schedulers maintain a ready queue of requests eligible for dispatch:

Ready Queue (min-heap by ready_at time):
┌─────────────────────────────────────────┐
│ (ready_at=0.0, request_1)              │ ← Pop next
│ (ready_at=0.1, request_5)              │
│ (ready_at=0.2, request_3)              │
│ (ready_at=0.5, request_8)              │
└─────────────────────────────────────────┘

Dispatch workers call wait_for_ready() which:

Waits until the next ready time (or timeout)
Pops the request and marks it dispatched
Records scheduler_dispatched_at timestamp

This ensures requests are dispatched at the right time (not early, not late).

History population¶

When inherit_history: true in the session graph, the scheduler populates request history from parent responses:

def _populate_history(self, request: Request, state: ScheduledSessionState, node_id: int):
    """Populate request history from parent nodes."""
    for edge in parents(state.session.session_graph, node_id):
        if edge.is_history_parent:
            parent_history = state.histories.get(edge.src)
            if parent_history:
                request.populate_history(parent_history)

The history includes:

Prior request content (prompts)
Prior response content (model outputs)
Enables accurate multi-turn conversation simulation

Timing verification¶

Veeksha’s health checker verifies scheduling accuracy:

Session Dispatch Rate Check

Compares actual vs configured arrival rate:

Expected Rate: 10.0 sessions/sec
Actual Rate: 10.2 sessions/sec
Error: 2.0%
Threshold: 15%
Result: PASSED

Intra-Session Request Arrival Check

Verifies requests weren’t dispatched before dependencies completed:

Requests w/ Dependencies: 445
Mean Delay: 0.0017s
P99 Delay: 0.0788s
Violations (>5s late): 0
Result: PASSED

These checks help identify issues with benchmark configuration or execution.