Traffic Scheduling

Traffic scheduling controls when sessions start and when requests within sessions are dispatched. Veeksha provides two fundamentally different scheduling modes for different benchmarking scenarios.

Scheduling modes

Rate-Based (type: rate)

Generates new sessions at a specified arrival rate, regardless of how many are currently in-flight. Models open-loop traffic.

Concurrency-Based (type: concurrent)

Maintains a target number of active sessions. When one completes, another starts. Models closed-loop traffic.

When to Use Each Mode

Scenario

Mode

Rationale

Latency under load

Rate-based

Measure how latency degrades as rate increases

Maximum throughput

Concurrent

Saturate the system to find peak capacity

Production traffic modeling

Rate-based (Poisson)

Poisson arrivals model realistic bursty traffic

Capacity planning

Rate-based

Find the rate where latency SLOs are met

Stress testing

Concurrent (high)

Push beyond normal operating conditions

Rate-based scheduling

Sessions arrive according to an interval generator:

traffic_scheduler:
  type: rate
  interval_generator:
    type: poisson
    arrival_rate: 10.0
  cancel_session_on_failure: true

How it works:

  1. RateTrafficScheduler generates inter-arrival times from the interval generator

  2. Each session’s root requests are scheduled at the computed arrival time

  3. Sessions are dispatched regardless of current system load

Interval Generators:

poisson (recommended for realism)

Exponentially-distributed intervals with given mean rate:

interval_generator:
  type: poisson
  arrival_rate: 10.0  # 10 sessions/second average

Captures real-world bursty arrival patterns.

gamma

Gamma-distributed intervals (generalization of Poisson):

interval_generator:
  type: gamma
  arrival_rate: 10.0
  shape: 2.0  # Higher = less variance
fixed

Constant intervals for uniform traffic:

interval_generator:
  type: fixed
  interval: 0.1  # Exactly 100ms between sessions

Concurrency-based scheduling

Maintains a fixed number of concurrent sessions:

traffic_scheduler:
  type: concurrent
  target_concurrent_sessions: 8
  rampup_seconds: 10
  cancel_session_on_failure: true

How it works:

  1. ConcurrentTrafficScheduler tracks active session count

  2. When a session completes, it activates a pending one

  3. Ramp-up gradually increases concurrency from 0 to target

Ramp-up Behavior:

Concurrency
    ▲
    │                    ┌────────────────────
  8 │                   ╱
    │                  ╱
  4 │                 ╱
    │                ╱
  0 │───────────────╱
    └──────────────────────────────────────▶ Time
    0            10s (rampup)         ...

During ramp-up, target concurrency increases linearly:

current_target = int(target * (elapsed_time / rampup_seconds))

Intra-session scheduling

Within a session, requests are scheduled based on the session graph:

Session with 3 turns:

t=0.0s: Root request dispatched (session arrives)
t=1.2s: Root request completes
t=1.7s: Turn 1 dispatched (0.5s wait_after_ready)
t=2.1s: Turn 1 completes
t=2.4s: Turn 2 dispatched (0.3s wait_after_ready)
...

The scheduler tracks session state:

class ScheduledSessionState:
    session: Session
    completed_nodes: Set[int]      # Finished request nodes
    in_flight_nodes: Set[int]      # Currently executing
    pending_nodes: Set[int]        # Waiting on dependencies
    completion_times: Dict[int, float]  # When each node finished

When a request completes:

  1. Node is moved from in_flight_nodes to completed_nodes

  2. Child nodes are checked for readiness

  3. Ready nodes are scheduled after their wait_after_ready delay

  4. History is recorded if this node is a history parent

Session cancellation

The cancel_session_on_failure option controls behavior when a request fails:

traffic_scheduler:
  cancel_session_on_failure: true  # Default

When true, if any request in a session fails:

  • All pending requests in that session are cancelled

  • The session is marked as errored

  • Resources are freed for new sessions

When false:

  • Remaining requests in the session are still attempted

  • Useful for testing partial failure scenarios

Ready queue and dispatch

Both schedulers maintain a ready queue of requests eligible for dispatch:

Ready Queue (min-heap by ready_at time):
┌─────────────────────────────────────────┐
│ (ready_at=0.0, request_1)              │ ← Pop next
│ (ready_at=0.1, request_5)              │
│ (ready_at=0.2, request_3)              │
│ (ready_at=0.5, request_8)              │
└─────────────────────────────────────────┘

Dispatch workers call wait_for_ready() which:

  1. Waits until the next ready time (or timeout)

  2. Pops the request and marks it dispatched

  3. Records scheduler_dispatched_at timestamp

This ensures requests are dispatched at the right time (not early, not late).

History population

When inherit_history: true in the session graph, the scheduler populates request history from parent responses:

def _populate_history(self, request: Request, state: ScheduledSessionState, node_id: int):
    """Populate request history from parent nodes."""
    for edge in parents(state.session.session_graph, node_id):
        if edge.is_history_parent:
            parent_history = state.histories.get(edge.src)
            if parent_history:
                request.populate_history(parent_history)

The history includes:

  • Prior request content (prompts)

  • Prior response content (model outputs)

  • Enables accurate multi-turn conversation simulation

Timing verification

Veeksha’s health checker verifies scheduling accuracy:

Session Dispatch Rate Check

Compares actual vs configured arrival rate:

Expected Rate: 10.0 sessions/sec
Actual Rate: 10.2 sessions/sec
Error: 2.0%
Threshold: 15%
Result: PASSED
Intra-Session Request Arrival Check

Verifies requests weren’t dispatched before dependencies completed:

Requests w/ Dependencies: 445
Mean Delay: 0.0017s
P99 Delay: 0.0788s
Violations (>5s late): 0
Result: PASSED

These checks help identify issues with benchmark configuration or execution.