Evaluation and Metrics¶
Veeksha’s evaluation system collects detailed metrics for every request and computes aggregate statistics. This page explains the available metrics, how they’re computed, and how SLOs are evaluated.
Evaluator architecture¶
Evaluators consume completed requests and produce metrics:
class BaseEvaluator(ABC):
"""Lifecycle:
1. register_request() - when request is dispatched
2. record_request_completed() - when response is received
3. record_session_completed() - when session finishes
4. finalize() - compute aggregate metrics
5. save() - write to output directory
"""
Two evaluator types are available:
Performance (
type: performance): Latency, throughput, timing metricsAccuracy (
type: accuracy_lmeval): Model evaluation using lm-eval-harness
Performance metrics¶
The performance evaluator computes these key metrics:
- TTFC (Time to First Chunk/Token)
Time from request dispatch to receiving the first response token. Critical for user-perceived responsiveness.
TTFC = first_token_timestamp - scheduler_dispatched_at
- TBC (Time Between Chunks/Tokens)
Average time between consecutive tokens. Affects streaming experience.
TBC = (last_token_timestamp - first_token_timestamp) / (num_tokens - 1)
- TPOT (Time Per Output Token)
Average time per output token including TTFC. Overall generation speed.
TPOT = (client_completed_at - client_picked_up_at) / num_output_tokens
- E2E (End-to-End) Latency
Total time from dispatch to completion.
E2E = client_completed_at - scheduler_dispatched_at
- Throughput
Aggregate rates computed from all completed requests:
tpot_based_throughput: Output tokens / total timetbc_based_throughput: Tokens/sec based on average TBC
Configuring evaluators¶
Add evaluators to your benchmark configuration:
evaluators:
- type: performance
target_channels: ["text"]
stream_metrics: true
stream_metrics_interval: 5.0
slos:
- name: "P99 TTFC"
metric: ttfc
percentile: 0.99
value: 0.5
type: constant
- name: "P90 TBC"
metric: tbc
percentile: 0.90
value: 0.05
type: constant
target_channelsList of channel modalities to evaluate. Usually
["text"].stream_metricsIf
true, periodically logs metrics to console during the benchmark.stream_metrics_intervalSeconds between streaming metric updates.
SLO definitions¶
Service Level Objectives (SLOs) define acceptable performance thresholds:
slos:
- name: "P99 TTFC under 500ms"
metric: ttfc # Metric to check
percentile: 0.99 # Percentile level
value: 0.5 # Threshold in seconds
type: constant # SLO type
Available metrics for SLOs:
ttfc: Time to first chunktbc: Time between chunkstpot: Time per output tokene2e: End-to-end latency
SLO results are saved to metrics/slo_results.json:
{
"all_slos_met": true,
"results": [
{
"met": true,
"slo_metric_key": "ttfc_p99",
"observed_value": 0.055,
"threshold": 0.5,
"percentile": 0.99,
"metric": "ttfc",
"name": "P99 TTFC under 500ms",
"lower_is_better": true
}
]
}
Output files¶
The performance evaluator writes several files to metrics/:
Per-request data:
request_level_metrics.jsonlJSON Lines file with detailed per-request data:
{ "request_id": 75, "session_id": 8, "session_total_requests": 8, "scheduler_ready_at": 0.53709, "scheduler_dispatched_at": 0.53709, "client_picked_up_at": 0.53723, "client_completed_at": 0.60328, "result_processed_at": 0.60346, "num_delta_prompt_tokens": 6, "num_total_prompt_tokens": 6, "target_num_delta_prompt_tokens": 6, "num_output_tokens": 7, "num_requested_output_tokens": 7, "num_total_tokens": 13, "is_stream": true, "tpot": 0.00688, "ttfc": 0.0243, "end_to_end_latency": 0.06559, "normalized_end_to_end_latency": 0.00937, "output_throughput": 106.72167, "tbc": [0.00814, 0.0067, 0.00687, 0.00611, 0.00658, 0.00689] }
Aggregate statistics:
summary_stats.jsonHigh-level counts and rates:
{ "Number of Requests": 560, "Number of Completed Requests": 555, "Number of Errored Requests": 0, "Error Rate": 0.0, "Observed Session Dispatch Rate": 11.43 }
throughput_metrics.jsonThroughput measurements:
{ "tpot_based_throughput": 76.99, "tbc_based_throughput": 11.32 }
Distribution files:
For each metric (ttfc, tbc, tpot, e2e, etc.):
<metric>.csv: Percentile values (p50, p90, p95, p99, min, max, mean)<metric>.png: Distribution histogram
Prefill statistics¶
For understanding prefill latency scaling, the evaluator groups TTFC by prompt length:
prefill_stats.json:
{
"metric": "ttfc",
"group_by": "target_num_delta_prompt_tokens",
"groups": {
"128": {
"count": 50,
"mean": 0.034,
"p99": 0.055
},
"256": {
"count": 48,
"mean": 0.041,
"p99": 0.062
}
}
}
This helps analyze how prefill time scales with prompt length.
Accuracy evaluation¶
For model accuracy testing, use the lm-eval integration:
session_generator:
type: lmeval
tasks: ["hellaswag", "truthfulqa_gen"]
num_fewshot: 0
evaluators:
- type: performance
target_channels: ["text"]
- type: accuracy_lmeval
bootstrap_iters: 200
This runs lm-eval-harness tasks through Veeksha’s load generation, allowing simultaneous accuracy and performance measurement.
The accuracy evaluator outputs:
Standard lm-eval metrics (accuracy, perplexity)
Integration with standard lm-eval result formats
Streaming metrics¶
During a benchmark, enable real-time metric output:
evaluators:
- type: performance
stream_metrics: true
stream_metrics_interval: 5.0
This logs current statistics every 5 seconds:
[10.2s] Completed: 156 | TTFC p99: 45ms | TBC p99: 18ms | Throughput: 234 tok/s
Useful for monitoring long-running benchmarks without waiting for completion.
Health checks¶
After the benchmark, Veeksha runs health checks to validate correctness:
- Session Dispatch Rate Check
Verifies actual arrival rate matches configuration.
- Prompt Length Check
Verifies generated prompt lengths match targets.
- Output Length Check
Verifies output lengths match requested tokens (when server supports it).
- Lifecycle Timing Delays Check
Reports timing overhead at each pipeline stage.
- Intra-Session Request Arrival Check
Verifies request dependencies were respected.
Results are saved to health_check_results.txt:
Health checks help identify configuration issues or system problems that could invalidate benchmark results.