Evaluation and Metrics ====================== Veeksha's evaluation system collects detailed metrics for every request and computes aggregate statistics. This page explains the available metrics, how they're computed, and how SLOs are evaluated. Evaluator architecture ---------------------- Evaluators consume completed requests and produce metrics: .. code-block:: python class BaseEvaluator(ABC): """Lifecycle: 1. register_request() - when request is dispatched 2. record_request_completed() - when response is received 3. record_session_completed() - when session finishes 4. finalize() - compute aggregate metrics 5. save() - write to output directory """ Two evaluator types are available: - **Performance** (``type: performance``): Latency, throughput, timing metrics - **Accuracy** (``type: accuracy_lmeval``): Model evaluation using lm-eval-harness Performance metrics ------------------- The performance evaluator computes these key metrics: **TTFC (Time to First Chunk/Token)** Time from request dispatch to receiving the first response token. Critical for user-perceived responsiveness. .. code-block:: text TTFC = first_token_timestamp - scheduler_dispatched_at **TBC (Time Between Chunks/Tokens)** Average time between consecutive tokens. Affects streaming experience. .. code-block:: text TBC = (last_token_timestamp - first_token_timestamp) / (num_tokens - 1) **TPOT (Time Per Output Token)** Average time per output token including TTFC. Overall generation speed. .. code-block:: text TPOT = (client_completed_at - client_picked_up_at) / num_output_tokens **E2E (End-to-End) Latency** Total time from dispatch to completion. .. code-block:: text E2E = client_completed_at - scheduler_dispatched_at **Throughput** Aggregate rates computed from all completed requests: - ``tpot_based_throughput``: Output tokens / total time - ``tbc_based_throughput``: Tokens/sec based on average TBC Configuring evaluators ---------------------- Add evaluators to your benchmark configuration: .. code-block:: yaml evaluators: - type: performance target_channels: ["text"] stream_metrics: true stream_metrics_interval: 5.0 slos: - name: "P99 TTFC" metric: ttfc percentile: 0.99 value: 0.5 type: constant - name: "P90 TBC" metric: tbc percentile: 0.90 value: 0.05 type: constant ``target_channels`` List of channel modalities to evaluate. Usually ``["text"]``. ``stream_metrics`` If ``true``, periodically logs metrics to console during the benchmark. ``stream_metrics_interval`` Seconds between streaming metric updates. SLO definitions --------------- Service Level Objectives (SLOs) define acceptable performance thresholds: .. code-block:: yaml slos: - name: "P99 TTFC under 500ms" metric: ttfc # Metric to check percentile: 0.99 # Percentile level value: 0.5 # Threshold in seconds type: constant # SLO type Available metrics for SLOs: - ``ttfc``: Time to first chunk - ``tbc``: Time between chunks - ``tpot``: Time per output token - ``e2e``: End-to-end latency SLO results are saved to ``metrics/slo_results.json``: .. code-block:: json { "all_slos_met": true, "results": [ { "met": true, "slo_metric_key": "ttfc_p99", "observed_value": 0.055, "threshold": 0.5, "percentile": 0.99, "metric": "ttfc", "name": "P99 TTFC under 500ms", "lower_is_better": true } ] } Output files ------------ The performance evaluator writes several files to ``metrics/``: **Per-request data:** ``request_level_metrics.jsonl`` JSON Lines file with detailed per-request data: .. code-block:: json { "request_id": 75, "session_id": 8, "session_total_requests": 8, "scheduler_ready_at": 0.53709, "scheduler_dispatched_at": 0.53709, "client_picked_up_at": 0.53723, "client_completed_at": 0.60328, "result_processed_at": 0.60346, "num_delta_prompt_tokens": 6, "num_total_prompt_tokens": 6, "target_num_delta_prompt_tokens": 6, "num_output_tokens": 7, "num_requested_output_tokens": 7, "num_total_tokens": 13, "is_stream": true, "tpot": 0.00688, "ttfc": 0.0243, "end_to_end_latency": 0.06559, "normalized_end_to_end_latency": 0.00937, "output_throughput": 106.72167, "tbc": [0.00814, 0.0067, 0.00687, 0.00611, 0.00658, 0.00689] } **Aggregate statistics:** ``summary_stats.json`` High-level counts and rates: .. code-block:: json { "Number of Requests": 560, "Number of Completed Requests": 555, "Number of Errored Requests": 0, "Error Rate": 0.0, "Observed Session Dispatch Rate": 11.43 } ``throughput_metrics.json`` Throughput measurements: .. code-block:: json { "tpot_based_throughput": 76.99, "tbc_based_throughput": 11.32 } **Distribution files:** For each metric (``ttfc``, ``tbc``, ``tpot``, ``e2e``, etc.): - ``.csv``: Percentile values (p50, p90, p95, p99, min, max, mean) - ``.png``: Distribution histogram Prefill statistics ------------------ For understanding prefill latency scaling, the evaluator groups TTFC by prompt length: ``prefill_stats.json``: .. code-block:: json { "metric": "ttfc", "group_by": "target_num_delta_prompt_tokens", "groups": { "128": { "count": 50, "mean": 0.034, "p99": 0.055 }, "256": { "count": 48, "mean": 0.041, "p99": 0.062 } } } This helps analyze how prefill time scales with prompt length. Accuracy evaluation ------------------- For model accuracy testing, use the lm-eval integration: .. code-block:: yaml session_generator: type: lmeval tasks: ["hellaswag", "truthfulqa_gen"] num_fewshot: 0 evaluators: - type: performance target_channels: ["text"] - type: accuracy_lmeval bootstrap_iters: 200 This runs lm-eval-harness tasks through Veeksha's load generation, allowing simultaneous accuracy and performance measurement. The accuracy evaluator outputs: - Standard lm-eval metrics (accuracy, perplexity) - Integration with standard lm-eval result formats Streaming metrics ----------------- During a benchmark, enable real-time metric output: .. code-block:: yaml evaluators: - type: performance stream_metrics: true stream_metrics_interval: 5.0 This logs current statistics every 5 seconds: .. code-block:: text [10.2s] Completed: 156 | TTFC p99: 45ms | TBC p99: 18ms | Throughput: 234 tok/s Useful for monitoring long-running benchmarks without waiting for completion. Health checks ------------- After the benchmark, Veeksha runs health checks to validate correctness: **Session Dispatch Rate Check** Verifies actual arrival rate matches configuration. **Prompt Length Check** Verifies generated prompt lengths match targets. **Output Length Check** Verifies output lengths match requested tokens (when server supports it). **Lifecycle Timing Delays Check** Reports timing overhead at each pipeline stage. **Intra-Session Request Arrival Check** Verifies request dependencies were respected. Results are saved to ``health_check_results.txt``: Health checks help identify configuration issues or system problems that could invalidate benchmark results.