Content Generation
==================

Veeksha uses a channel-based content generation system that supports multiple
modalities (text, images, audio, video) and provides fine-grained control over
request characteristics.


Channel architecture
--------------------

Request content is organized by **channel modality**:

.. code-block:: python

    class ChannelModality(IntEnum):
        TEXT = 1
        IMAGE = 2
        AUDIO = 3
        VIDEO = 4

Each request contains content for one or more channels:

.. code-block:: python

    @dataclass
    class Request:
        channels: Dict[ChannelModality, ChannelContent]
        # e.g., {ChannelModality.TEXT: TextContent(...)}

Currently, the **text channel** is fully implemented with others planned.


Text channel generator
----------------------

The text channel generator produces prompt content with configurable lengths
and optional shared prefixes:

.. code-block:: yaml

    session_generator:
      type: synthetic
      channels:
        - type: text
          body_length_generator:
            type: uniform
            min: 100
            max: 500
          shared_prefix_ratio: 0.2
          shared_prefix_probability: 0.5
      output_spec:
        text:
          output_length_generator:
            type: uniform
            min: 50
            max: 200

Key configuration options:

``body_length_generator``
    Controls the number of tokens in the prompt body (new content per turn).

``shared_prefix_ratio``
    Fraction of prompt tokens that should be identical across root requests.
    Useful for testing prefix caching.

``shared_prefix_probability``
    Probability that a root request uses the shared prefix.

Output specification is configured separately at the session generator level. Specified output
specs will only be relevant if the model supports the modality. For example:

``output_spec.text.output_length_generator``
    Controls the requested output length (``max_tokens`` / ``min_tokens``).


Length generators
-----------------

Length generators control numeric parameters like token counts:

**Fixed** (``type: fixed``)
    Returns a constant value:

    .. code-block:: yaml

        body_length_generator:
          type: fixed
          value: 256

**Uniform** (``type: uniform``)
    Random value in a range:

    .. code-block:: yaml

        body_length_generator:
          type: uniform
          min: 100
          max: 500

**Stair** (``type: fixed_stair``)
    Cycles through explicit values in order, useful for microbenchmarking:

    .. code-block:: yaml

        body_length_generator:
          type: fixed_stair
          values: [128, 256, 512, 1024]  # Values to cycle through
          repeat_each: 10                 # Repetitions per value
          wrap: true                      # Cycle back to start

    This generates: 10 requests at 128, then 10 at 256, then 10 at 512, etc.

**Zipf** (``type: zipf``)
    Power-law distribution modeling real-world length patterns:

    .. code-block:: yaml

        body_length_generator:
          type: zipf
          min: 50
          max: 2000
          alpha: 1.5


Content generation process
--------------------------

When a synthetic session is generated:

1. **Session graph** is created with the configured number of nodes
2. For each node, **channels** generate content:

   .. code-block:: python

       for channel_type, channel in self.channels.items():
           channels[channel_type] = channel.generate_content(
               is_root=is_root(session_graph, node_id)
           )

3. The ``is_root`` flag enables special handling (e.g., shared prefixes apply
   only to root requests)

4. The **output specification** is generated via ``OutputSpecGenerator``
   and attached to each request. This includes target output tokens.


Shared prefix for prefix caching
--------------------------------

Most LLM inference engines support **prefix caching** where repeated
prompt prefixes are cached in KV cache. Veeksha can generate workloads that
test this. First, by setting a shared prefix configuration:

.. code-block:: yaml

    channels:
      - type: text
        shared_prefix_ratio: 0.3
        shared_prefix_probability: 0.8

This configuration means:

- 80% of root requests will share a common prefix
- That prefix constitutes 30% of the total prompt tokens

When generating content:

1. A single shared prefix is generated once and cached
2. Root requests probabilistically use this prefix
3. The remaining tokens are generated uniquely per request

This accurately models scenarios like:

- System prompts shared across users
- RAG with common document prefixes
- Function calling with shared tool definitions

Another way in which Veeksha helps test prefix cache capabilities is by making session nodes inherit conversation
history. This is done by setting the ``inherit_history`` flag to ``true`` in the session generator configuration:

.. code-block:: yaml

    session_generator:
      type: synthetic
      inherit_history: true

A node can only inherit history from one of its parent nodes. 

Tokenizer integration
---------------------

Content generation requires tokenization to control token counts precisely.
Veeksha uses a **TokenizerProvider** pattern:

.. code-block:: python

    class TokenizerProvider:
        """Provides tokenizers for different modalities."""

        def for_modality(self, modality: ChannelModality) -> TokenizerHandle:
            ...

For text, this wraps a HuggingFace tokenizer (loaded based on the model name).
The tokenizer is used to:

1. Encode generated text to count tokens
2. Decode token IDs for prompt construction
3. Ensure prompt lengths match targets exactly

.. note::

    When running a benchmark, ensure the tokenizer matches the model being
    tested. Veeksha loads the tokenizer automatically based on
    ``client.model`` or ``server.model``.


Trace-based content
-------------------

For trace-based session generation, content comes from recorded conversations
stored in JSONL files. Each trace file contains metadata matching real production traffic.

.. code-block:: yaml

    session_generator:
      type: trace
      trace_file: traces/timed_synthetic_trace.jsonl
      flavor:
        type: timed_synthetic_session


Trace flavors
~~~~~~~~~~~~~

Different trace sources have different formats and characteristics. **Flavors**
define how to parse trace files and generate sessions from them. Each flavor
implements:

- Required column validation
- Session/request preparation from trace rows
- Wrapping behavior for looping through traces

We provide flavors for five different use cases. You can also implement your own flavor by creating a class that
inherits from ``TraceFlavorGeneratorBase`` and implements the methods ``required_columns``, ``prepare_session`` and ``wrap``.

Trace files can be in JSONL or CSV format. CSV columns are automatically
normalized (e.g. ``num_prefill_tokens`` → ``input_length``).

Comparison table:

.. list-table::
   :header-rows: 1
   :widths: 20 20 20 40

   * - Flavor
     - Turns
     - Prompt generation
     - Best for
   * - ``request_log``
     - Single-turn
     - Random tokens
     - Simple (input, output) length distributions (e.g. ShareGPT CSVs)
   * - ``timed_synthetic_session``
     - Multi-turn
     - Synthetic (from length)
     - Coding assistants, long-context chat, prefix caching
   * - ``untimed_content_multi_turn``
     - Multi-turn
     - Real content (from dataset)
     - Replaying conversation datasets (ShareGPT, LMSYS-Chat, etc.)
   * - ``rag``
     - Single-turn
     - From trace (text col)
     - RAG workloads, document caching, massive shared prefixes
   * - ``shared_prefix``
     - Multi-turn
     - Synthetic (from Hash IDs)
     - Replaying privacy-safe conversation structures

**request_log** (``type: request_log``)
    Independent requests with just token lengths. No session structure,
    no corpus files, no prompt materialization. Each row becomes a single-request
    session with a random-token prompt of the specified length.

    Required columns: ``input_length``, ``output_length``

    .. code-block:: yaml

        flavor:
          type: request_log

**timed_synthetic_session** (``type: timed_synthetic_session``)
    Timed session traces with context caching:

    - Replays linear or DAG sessions using ``session_context``
    - The first ``page_size`` tokens are guaranteed to be unique across history lineages for KV-cache diversity
    - Wait times between nodes preserved from trace

    Required columns: ``session_id``, ``input_length``, ``new_input_length``, ``output_length``
    Topology contract: ``session_context`` with ``node_id``, ``parent_nodes``,
    ``history_parent``, and ``wait_after_ready``. Legacy traces without
    ``session_context`` are interpreted as linear sessions by row order.

    .. code-block:: yaml

        flavor:
          type: timed_synthetic_session
          page_size: 16          # Token page size for prefix caching
          corpus_file: null      # Optional corpus for prompt generation

**untimed_content_multi_turn** (``type: untimed_content_multi_turn``)
    Replay datasets with actual conversation content (ShareGPT, LMSYS-Chat, etc.):

    - Each row contains a full conversation with real message text
    - Turns are split into individual requests with pre-populated history
    - No timestamps — history is pre-populated from the dataset
    - Configurable message schema (role/content keys, role value mappings)

    Required columns: ``conversations`` (configurable via ``conversation_column``)

    .. code-block:: yaml

        flavor:
          type: untimed_content_multi_turn
          conversation_column: conversations  # Column with message list
          role_key: from                      # Key for role in each message
          content_key: value                  # Key for content in each message
          user_role_value: human              # Value indicating user messages
          assistant_role_value: gpt           # Value indicating assistant messages

**rag** (``type: rag``)
    Retrieval-Augmented Generation workload traces:

    - Single-turn requests (one request per session)
    - Document-based filtering by frequency
    - Warmup sessions to pre-populate document cache
    - Suitable for testing prefix caching with shared documents

    Required columns: ``doc_id``, ``prompt_text``, ``input_length``, ``output_length``

    .. code-block:: yaml

        flavor:
          type: rag
          num_documents: 10      # Use top N most frequent documents

**shared_prefix** (``type: shared_prefix``)
    Shared-prefix conversation traces:

    - Multi-turn conversations with hash-based prompt generation
    - Uses ``hash_ids`` to reconstruct shared prefixes deterministically
    - Privacy-safe (no real text in trace)

    Required columns: ``session_id``, ``hash_ids``, ``new_input_length``, ``output_length``


Wrap mode
~~~~~~~~~

When ``wrap_mode: true`` (default), the trace loops indefinitely:

.. code-block:: yaml

    session_generator:
      type: trace
      trace_file: traces/production.jsonl
      wrap_mode: true    # Loop through trace when exhausted

On wrap, traces are reshuffled to provide different orderings. This enables
running benchmarks longer than the trace duration while maintaining realistic
content distributions.


Output length control
---------------------

Veeksha supports flexible output length control:

**Server-side** (preferred when supported):

.. code-block:: yaml

    client:
      max_tokens_param: max_completion_tokens
      min_tokens_param: min_tokens

The generator sets both ``max_tokens`` and ``min_tokens`` to the target,
forcing exact output lengths when the server supports it.

**Prompt-based fallback**:

.. code-block:: yaml

    client:
      use_min_tokens_prompt_fallback: true

Appends instructions like "Generate exactly 150 tokens" to the prompt.
Less reliable but works with servers lacking ``min_tokens`` support.

.. tip::

    For accurate benchmarks, use a server that supports ``min_tokens`` to control output lengths precisely.