Traffic Scheduling¶
Traffic scheduling controls when sessions start and when requests within sessions are dispatched. Veeksha provides two fundamentally different scheduling modes for different benchmarking scenarios.
Scheduling modes¶
- Rate-Based (
type: rate) Generates new sessions at a specified arrival rate, regardless of how many are currently in-flight. Models open-loop traffic.
- Concurrency-Based (
type: concurrent) Maintains a target number of active sessions. When one completes, another starts. Models closed-loop traffic.
Scenario |
Mode |
Rationale |
|---|---|---|
Latency under load |
Rate-based |
Measure how latency degrades as rate increases |
Maximum throughput |
Concurrent |
Saturate the system to find peak capacity |
Production traffic modeling |
Rate-based (Poisson) |
Poisson arrivals model realistic bursty traffic |
Capacity planning |
Rate-based |
Find the rate where latency SLOs are met |
Stress testing |
Concurrent (high) |
Push beyond normal operating conditions |
Rate-based scheduling¶
Sessions arrive according to an interval generator:
traffic_scheduler:
type: rate
interval_generator:
type: poisson
arrival_rate: 10.0
cancel_session_on_failure: true
How it works:
RateTrafficScheduler generates inter-arrival times from the interval generator
Each session’s root requests are scheduled at the computed arrival time
Sessions are dispatched regardless of current system load
Interval Generators:
poisson(recommended for realism)Exponentially-distributed intervals with given mean rate:
interval_generator: type: poisson arrival_rate: 10.0 # 10 sessions/second average
Captures real-world bursty arrival patterns.
gammaGamma-distributed intervals (generalization of Poisson):
interval_generator: type: gamma arrival_rate: 10.0 shape: 2.0 # Higher = less variance
fixedConstant intervals for uniform traffic:
interval_generator: type: fixed interval: 0.1 # Exactly 100ms between sessions
Concurrency-based scheduling¶
Maintains a fixed number of concurrent sessions:
traffic_scheduler:
type: concurrent
target_concurrent_sessions: 8
rampup_seconds: 10
cancel_session_on_failure: true
How it works:
ConcurrentTrafficScheduler tracks active session count
When a session completes, it activates a pending one
Ramp-up gradually increases concurrency from 0 to target
Ramp-up Behavior:
Concurrency
▲
│ ┌────────────────────
8 │ ╱
│ ╱
4 │ ╱
│ ╱
0 │───────────────╱
└──────────────────────────────────────▶ Time
0 10s (rampup) ...
During ramp-up, target concurrency increases linearly:
current_target = int(target * (elapsed_time / rampup_seconds))
Intra-session scheduling¶
Within a session, requests are scheduled based on the session graph:
Session with 3 turns:
t=0.0s: Root request dispatched (session arrives)
t=1.2s: Root request completes
t=1.7s: Turn 1 dispatched (0.5s wait_after_ready)
t=2.1s: Turn 1 completes
t=2.4s: Turn 2 dispatched (0.3s wait_after_ready)
...
The scheduler tracks session state:
class ScheduledSessionState:
session: Session
completed_nodes: Set[int] # Finished request nodes
in_flight_nodes: Set[int] # Currently executing
pending_nodes: Set[int] # Waiting on dependencies
completion_times: Dict[int, float] # When each node finished
When a request completes:
Node is moved from
in_flight_nodestocompleted_nodesChild nodes are checked for readiness
Ready nodes are scheduled after their
wait_after_readydelayHistory is recorded if this node is a history parent
Session cancellation¶
The cancel_session_on_failure option controls behavior when a request fails:
traffic_scheduler:
cancel_session_on_failure: true # Default
When true, if any request in a session fails:
All pending requests in that session are cancelled
The session is marked as errored
Resources are freed for new sessions
When false:
Remaining requests in the session are still attempted
Useful for testing partial failure scenarios
Ready queue and dispatch¶
Both schedulers maintain a ready queue of requests eligible for dispatch:
Ready Queue (min-heap by ready_at time):
┌─────────────────────────────────────────┐
│ (ready_at=0.0, request_1) │ ← Pop next
│ (ready_at=0.1, request_5) │
│ (ready_at=0.2, request_3) │
│ (ready_at=0.5, request_8) │
└─────────────────────────────────────────┘
Dispatch workers call wait_for_ready() which:
Waits until the next ready time (or timeout)
Pops the request and marks it dispatched
Records
scheduler_dispatched_attimestamp
This ensures requests are dispatched at the right time (not early, not late).
History population¶
When inherit_history: true in the session graph, the scheduler populates
request history from parent responses:
def _populate_history(self, request: Request, state: ScheduledSessionState, node_id: int):
"""Populate request history from parent nodes."""
for edge in parents(state.session.session_graph, node_id):
if edge.is_history_parent:
parent_history = state.histories.get(edge.src)
if parent_history:
request.populate_history(parent_history)
The history includes:
Prior request content (prompts)
Prior response content (model outputs)
Enables accurate multi-turn conversation simulation
Timing verification¶
Veeksha’s health checker verifies scheduling accuracy:
- Session Dispatch Rate Check
Compares actual vs configured arrival rate:
Expected Rate: 10.0 sessions/sec Actual Rate: 10.2 sessions/sec Error: 2.0% Threshold: 15% Result: PASSED
- Intra-Session Request Arrival Check
Verifies requests weren’t dispatched before dependencies completed:
Requests w/ Dependencies: 445 Mean Delay: 0.0017s P99 Delay: 0.0788s Violations (>5s late): 0 Result: PASSED
These checks help identify issues with benchmark configuration or execution.