Content Generation¶
Veeksha uses a channel-based content generation system that supports multiple modalities (text, images, audio, video) and provides fine-grained control over request characteristics.
Channel architecture¶
Request content is organized by channel modality:
class ChannelModality(IntEnum):
TEXT = 1
IMAGE = 2
AUDIO = 3
VIDEO = 4
Each request contains content for one or more channels:
@dataclass
class Request:
channels: Dict[ChannelModality, ChannelContent]
# e.g., {ChannelModality.TEXT: TextContent(...)}
Currently, the text channel is fully implemented with others planned.
Text channel generator¶
The text channel generator produces prompt content with configurable lengths and optional shared prefixes:
session_generator:
type: synthetic
channels:
- type: text
body_length_generator:
type: uniform
min: 100
max: 500
shared_prefix_ratio: 0.2
shared_prefix_probability: 0.5
output_spec:
text:
output_length_generator:
type: uniform
min: 50
max: 200
Key configuration options:
body_length_generatorControls the number of tokens in the prompt body (new content per turn).
shared_prefix_ratioFraction of prompt tokens that should be identical across root requests. Useful for testing prefix caching.
shared_prefix_probabilityProbability that a root request uses the shared prefix.
Output specification is configured separately at the session generator level. Specified output specs will only be relevant if the model supports the modality. For example:
output_spec.text.output_length_generatorControls the requested output length (
max_tokens/min_tokens).
Length generators¶
Length generators control numeric parameters like token counts:
- Fixed (
type: fixed) Returns a constant value:
body_length_generator: type: fixed value: 256
- Uniform (
type: uniform) Random value in a range:
body_length_generator: type: uniform min: 100 max: 500
- Stair (
type: fixed_stair) Cycles through explicit values in order, useful for microbenchmarking:
body_length_generator: type: fixed_stair values: [128, 256, 512, 1024] # Values to cycle through repeat_each: 10 # Repetitions per value wrap: true # Cycle back to start
This generates: 10 requests at 128, then 10 at 256, then 10 at 512, etc.
- Zipf (
type: zipf) Power-law distribution modeling real-world length patterns:
body_length_generator: type: zipf min: 50 max: 2000 alpha: 1.5
Content generation process¶
When a synthetic session is generated:
Session graph is created with the configured number of nodes
For each node, channels generate content:
for channel_type, channel in self.channels.items(): channels[channel_type] = channel.generate_content( is_root=is_root(session_graph, node_id) )
The
is_rootflag enables special handling (e.g., shared prefixes apply only to root requests)The output specification is generated via
OutputSpecGeneratorand attached to each request. This includes target output tokens.
Tokenizer integration¶
Content generation requires tokenization to control token counts precisely. Veeksha uses a TokenizerProvider pattern:
class TokenizerProvider:
"""Provides tokenizers for different modalities."""
def for_modality(self, modality: ChannelModality) -> TokenizerHandle:
...
For text, this wraps a HuggingFace tokenizer (loaded based on the model name). The tokenizer is used to:
Encode generated text to count tokens
Decode token IDs for prompt construction
Ensure prompt lengths match targets exactly
Note
When running a benchmark, ensure the tokenizer matches the model being
tested. Veeksha loads the tokenizer automatically based on
client.model or server.model.
Trace-based content¶
For trace-based session generation, content comes from recorded conversations stored in JSONL files. Each trace file contains metadata matching real production traffic.
session_generator:
type: trace
trace_file: traces/timed_synthetic_trace.jsonl
flavor:
type: timed_synthetic_session
Trace flavors¶
Different trace sources have different formats and characteristics. Flavors define how to parse trace files and generate sessions from them. Each flavor implements:
Required column validation
Session/request preparation from trace rows
Wrapping behavior for looping through traces
We provide flavors for five different use cases. You can also implement your own flavor by creating a class that
inherits from TraceFlavorGeneratorBase and implements the methods required_columns, prepare_session and wrap.
Trace files can be in JSONL or CSV format. CSV columns are automatically
normalized (e.g. num_prefill_tokens → input_length).
Comparison table:
Flavor |
Turns |
Prompt generation |
Best for |
|---|---|---|---|
|
Single-turn |
Random tokens |
Simple (input, output) length distributions (e.g. ShareGPT CSVs) |
|
Multi-turn |
Synthetic (from length) |
Coding assistants, long-context chat, prefix caching |
|
Multi-turn |
Real content (from dataset) |
Replaying conversation datasets (ShareGPT, LMSYS-Chat, etc.) |
|
Single-turn |
From trace (text col) |
RAG workloads, document caching, massive shared prefixes |
|
Multi-turn |
Synthetic (from Hash IDs) |
Replaying privacy-safe conversation structures |
- request_log (
type: request_log) Independent requests with just token lengths. No session structure, no corpus files, no prompt materialization. Each row becomes a single-request session with a random-token prompt of the specified length.
Required columns:
input_length,output_lengthflavor: type: request_log
- timed_synthetic_session (
type: timed_synthetic_session) Timed session traces with context caching:
Replays linear or DAG sessions using
session_contextThe first
page_sizetokens are guaranteed to be unique across history lineages for KV-cache diversityWait times between nodes preserved from trace
Required columns:
session_id,input_length,new_input_length,output_lengthTopology contract:session_contextwithnode_id,parent_nodes,history_parent, andwait_after_ready. Legacy traces withoutsession_contextare interpreted as linear sessions by row order.flavor: type: timed_synthetic_session page_size: 16 # Token page size for prefix caching corpus_file: null # Optional corpus for prompt generation
- untimed_content_multi_turn (
type: untimed_content_multi_turn) Replay datasets with actual conversation content (ShareGPT, LMSYS-Chat, etc.):
Each row contains a full conversation with real message text
Turns are split into individual requests with pre-populated history
No timestamps — history is pre-populated from the dataset
Configurable message schema (role/content keys, role value mappings)
Required columns:
conversations(configurable viaconversation_column)flavor: type: untimed_content_multi_turn conversation_column: conversations # Column with message list role_key: from # Key for role in each message content_key: value # Key for content in each message user_role_value: human # Value indicating user messages assistant_role_value: gpt # Value indicating assistant messages
- rag (
type: rag) Retrieval-Augmented Generation workload traces:
Single-turn requests (one request per session)
Document-based filtering by frequency
Warmup sessions to pre-populate document cache
Suitable for testing prefix caching with shared documents
Required columns:
doc_id,prompt_text,input_length,output_lengthflavor: type: rag num_documents: 10 # Use top N most frequent documents
- shared_prefix (
type: shared_prefix) Shared-prefix conversation traces:
Multi-turn conversations with hash-based prompt generation
Uses
hash_idsto reconstruct shared prefixes deterministicallyPrivacy-safe (no real text in trace)
Required columns:
session_id,hash_ids,new_input_length,output_length
Wrap mode¶
When wrap_mode: true (default), the trace loops indefinitely:
session_generator:
type: trace
trace_file: traces/production.jsonl
wrap_mode: true # Loop through trace when exhausted
On wrap, traces are reshuffled to provide different orderings. This enables running benchmarks longer than the trace duration while maintaining realistic content distributions.
Output length control¶
Veeksha supports flexible output length control:
Server-side (preferred when supported):
client:
max_tokens_param: max_completion_tokens
min_tokens_param: min_tokens
The generator sets both max_tokens and min_tokens to the target,
forcing exact output lengths when the server supports it.
Prompt-based fallback:
client:
use_min_tokens_prompt_fallback: true
Appends instructions like “Generate exactly 150 tokens” to the prompt.
Less reliable but works with servers lacking min_tokens support.
Tip
For accurate benchmarks, use a server that supports min_tokens to control output lengths precisely.