Sessions and Graphs =================== Veeksha models LLM interactions as **sessions** containing **requests** organized in a directed acyclic graph (DAG). This design captures the dependency structure of multi-turn conversations. The session model ----------------- A **Session** represents a complete user conversation or agentic workflow and contains: - A unique session ID - A **SessionGraph** defining the structure of requests - A dictionary of **Request** objects keyed by node ID .. code-block:: python @dataclass class Session: id: int session_graph: SessionGraph requests: Dict[int, Request] # node_id -> Request A **Request** represents a single interaction (prompt + expected response): .. code-block:: python @dataclass class Request: id: int # Unique global request ID channels: Dict[ChannelModality, Content] # Content per modality session_context: Dict[str, Any] # Graph metadata Session graphs as DAGs ---------------------- The **SessionGraph** models request dependencies using nodes and directed edges: .. image:: /_static/assets/annotated-linear-session.png :alt: Annotated Linear Session :align: center :width: 300px Each **SessionNode** contains: - ``id``: Node identifier within the session - ``wait_after_ready``: Delay (in seconds) after dependencies are satisfied Each **SessionEdge** contains: - ``src``, ``dst``: Source and destination node IDs - ``is_history_parent``: Whether parent's output should be included in context A branching session might look like this: .. _branching-session-graph: .. image:: /_static/assets/annotated-nonlinear-session.png :alt: Annotated Non-linear Session :align: center :width: 495px That is, each node has an independent ``wait_after_ready`` value and, if enabled, inherits history from one of its parents. Only after all its parents are finished can a request be considered for dispatch. Next, we talk more about these concepts. Linear sessions --------------- The most common pattern is a **linear session** representing a typical back-and-forth conversation: .. code-block:: yaml session_generator: type: synthetic session_graph: type: linear num_request_generator: type: uniform min: 2 max: 6 request_wait_generator: type: poisson arrival_rate: 1.0 inherit_history: true Configuration options: ``num_request_generator`` Controls how many turns (requests) each session contains. Supports distributions: ``fixed``, ``uniform``, ``zipf``, ``stair``. ``request_wait_generator`` Controls the "think time" between turns-how long after one request completes before the next is dispatched. Supports: ``fixed``, ``poisson``, ``gamma``. ``inherit_history`` If ``true``, each request includes the conversation history from its parent node(s), simulating chat context accumulation. History inheritance ------------------- When ``inherit_history: true``, the traffic scheduler populates each request's history based on edges marked as ``is_history_parent``: .. code-block:: text Turn 0: "What is Python?" ↓ (history edge) Turn 1: "What is Python?" → "Python is..." + "Tell me more" ↓ (history edge) Turn 2: [full history] + "Give me an example" The history is recorded when a request completes and includes: - The request content (prompt) - The response content (model output) - Timing information This accurately models how LLM chat APIs accumulate conversation context. Single-request sessions ----------------------- For scenarios where you need independent requests without any conversation dependencies, use the ``single_request`` graph type: .. code-block:: yaml session_generator: type: synthetic session_graph: type: single_request This creates sessions with exactly one node and no edges-ideal for: - Isolated API calls - Batch processing scenarios - Simple request/response workloads without multi-turn context Note how you can still make session root requests share a percentage of prefix by adjusting the channel configuration. Branching sessions ------------------ For complex workflows with parallel paths and dependencies, use the ``branching`` graph type: .. code-block:: yaml session_generator: type: synthetic session_graph: type: branching num_layers_generator: type: fixed value: 4 layer_width_generator: type: uniform min: 2 max: 3 fan_out_generator: type: fixed value: 2 fan_in_generator: type: fixed value: 1 single_root: true inherit_history: true Configuration options: ``num_layers_generator`` Controls the depth (number of layers) in the graph. ``layer_width`` Controls how many nodes per layer. Sampled independently for each layer. ``fan_out_generator`` Number of forward connections from each node. ``fan_in_generator`` Minimum incoming edges per node (ensures connectivity). ``connection_dist_generator`` (Advanced) Forward skip distance. Default is 1 (next layer only). Set higher to allow skip connections across layers. ``request_wait_generator`` Controls the "think time" between turns. ``single_root`` If ``true``, forces layer 0 to have exactly one node. ``inherit_history`` When enabled, exactly one parent per node is selected as the history provider. This ensures a clean, linear history context even in complex graphs. This models scenarios like: - Parallel tool calls - A/B testing different conversation paths - Multi-agent interactions - Scatter-gather workflows Following are two real examples of branching session generated by the branching generator. First, a simpler diamond pattern: .. _branching-session-graph-diamond: .. image:: /_static/assets/session_graph_branching_diamond.png :alt: Branching session graph. :align: center :width: 175px Where ``H`` indicates the history parent node, and ``(n seconds)`` indicates the wait_after_ready value. It was generated with the following configuration: .. code-block:: yaml session_generator: type: synthetic session_graph: type: branching num_layers_generator: type: fixed value: 4 layer_width_generator: type: uniform min: 1 max: 3 fan_out_generator: type: fixed value: 2 fan_in_generator: type: fixed value: 2 single_root: true And a more complex example with skip connections: .. _branching-session-graph-skip-connections: .. image:: /_static/assets/session_graph_skip_narrow_5layer.png :alt: Branching session graph with skip connections. :align: center :width: 450px Where dotted edges ``(+i)`` indicate a skip connection of ``i`` layers. It was generated with the following configuration: .. code-block:: yaml session_generator: type: synthetic session_graph: type: branching num_layers_generator: type: fixed value: 5 layer_width_generator: type: fixed value: 2 fan_out_generator: type: fixed value: 2 fan_in_generator: type: fixed value: 1 connection_dist_generator: type: fixed value: 2 single_root: true In theory, the branching generator can be used to generate both single-request and linear sessions, but in practice, using the dedicated generators for these cases requires less configuration. Session generators ------------------ Three session generator types are available: **Synthetic** (``type: synthetic``) Generates sessions with random but controlled content. Combines: - A session graph generator (linear) - Channel generators for request content Best for: Load testing with configurable workload characteristics. **Trace** (``type: trace``) Replays recorded conversation traces from JSONL files: .. code-block:: yaml session_generator: type: trace trace_file: traces/timed_synthetic_trace.jsonl flavor: type: timed_synthetic_session wrap_mode: true Supported trace flavors: - ``request_log``: Independent requests with token lengths only - ``timed_synthetic_session``: Timed session traces with synthetic content and context caching - ``untimed_content_multi_turn``: Replay conversation datasets with actual message content - ``shared_prefix``: Shared-prefix conversation traces - ``rag``: RAG (Retrieval-Augmented Generation) traces Best for: Realistic workload replay, production traffic analysis. **LM-Eval** (``type: lmeval``) Generates evaluation prompts from lm-eval-harness tasks: .. code-block:: yaml session_generator: type: lmeval tasks: ["hellaswag", "truthfulqa_gen"] num_fewshot: 5 Best for: Model accuracy evaluation under load. Request scheduling within sessions ---------------------------------- When a session is scheduled, its requests don't all dispatch immediately. The traffic scheduler respects the graph structure: 1. **Root nodes** (no incoming edges) are immediately ready 2. **Dependent nodes** wait for all parent nodes to complete 3. After parents complete, ``wait_after_ready`` delay is observed 4. Only then is the request marked ready for dispatch This is handled by the ``ScheduledSessionState`` class which tracks: - Completed node IDs - Pending node IDs - Per-node completion times and history The health checker verifies this timing with the "Intra-Session Request Arrival Check" that validates requests weren't dispatched before their dependencies completed.