Evaluation and Metrics

Veeksha’s evaluation system collects detailed metrics for every request and computes aggregate statistics. This page explains the available metrics, how they’re computed, and how SLOs are evaluated.

Evaluator architecture

Evaluators consume completed requests and produce metrics:

class BaseEvaluator(ABC):
    """Lifecycle:
    1. register_request() - when request is dispatched
    2. record_request_completed() - when response is received
    3. record_session_completed() - when session finishes
    4. finalize() - compute aggregate metrics
    5. save() - write to output directory
    """

Two evaluator types are available:

  • Performance (type: performance): Latency, throughput, timing metrics

  • Accuracy (type: accuracy_lmeval): Model evaluation using lm-eval-harness

Performance metrics

The performance evaluator computes these key metrics:

TTFC (Time to First Chunk/Token)

Time from request dispatch to receiving the first response token. Critical for user-perceived responsiveness.

TTFC = first_token_timestamp - scheduler_dispatched_at
TBC (Time Between Chunks/Tokens)

Average time between consecutive tokens. Affects streaming experience.

TBC = (last_token_timestamp - first_token_timestamp) / (num_tokens - 1)
TPOT (Time Per Output Token)

Average time per output token including TTFC. Overall generation speed.

TPOT = (client_completed_at - client_picked_up_at) / num_output_tokens
E2E (End-to-End) Latency

Total time from dispatch to completion.

E2E = client_completed_at - scheduler_dispatched_at
Throughput

Aggregate rates computed from all completed requests:

  • tpot_based_throughput: Output tokens / total time

  • tbc_based_throughput: Tokens/sec based on average TBC

Configuring evaluators

Add evaluators to your benchmark configuration:

evaluators:
  - type: performance
    target_channels: ["text"]
    stream_metrics: true
    stream_metrics_interval: 5.0
    slos:
      - name: "P99 TTFC"
        metric: ttfc
        percentile: 0.99
        value: 0.5
        type: constant
      - name: "P90 TBC"
        metric: tbc
        percentile: 0.90
        value: 0.05
        type: constant
target_channels

List of channel modalities to evaluate. Usually ["text"].

stream_metrics

If true, periodically logs metrics to console during the benchmark.

stream_metrics_interval

Seconds between streaming metric updates.

SLO definitions

Service Level Objectives (SLOs) define acceptable performance thresholds:

slos:
  - name: "P99 TTFC under 500ms"
    metric: ttfc          # Metric to check
    percentile: 0.99      # Percentile level
    value: 0.5            # Threshold in seconds
    type: constant        # SLO type

Available metrics for SLOs:

  • ttfc: Time to first chunk

  • tbc: Time between chunks

  • tpot: Time per output token

  • e2e: End-to-end latency

SLO results are saved to metrics/slo_results.json:

{
  "all_slos_met": true,
  "results": [
    {
      "met": true,
      "slo_metric_key": "ttfc_p99",
      "observed_value": 0.055,
      "threshold": 0.5,
      "percentile": 0.99,
      "metric": "ttfc",
      "name": "P99 TTFC under 500ms",
      "lower_is_better": true
    }
  ]
}

Output files

The performance evaluator writes several files to metrics/:

Per-request data:

request_level_metrics.jsonl

JSON Lines file with detailed per-request data:

{
  "request_id": 75,
  "session_id": 8,
  "session_total_requests": 8,
  "scheduler_ready_at": 0.53709,
  "scheduler_dispatched_at": 0.53709,
  "client_picked_up_at": 0.53723,
  "client_completed_at": 0.60328,
  "result_processed_at": 0.60346,
  "num_delta_prompt_tokens": 6,
  "num_total_prompt_tokens": 6,
  "target_num_delta_prompt_tokens": 6,
  "num_output_tokens": 7,
  "num_requested_output_tokens": 7,
  "num_total_tokens": 13,
  "is_stream": true,
  "tpot": 0.00688,
  "ttfc": 0.0243,
  "end_to_end_latency": 0.06559,
  "normalized_end_to_end_latency": 0.00937,
  "output_throughput": 106.72167,
  "tbc": [0.00814, 0.0067, 0.00687, 0.00611, 0.00658, 0.00689]
}

Aggregate statistics:

summary_stats.json

High-level counts and rates:

{
  "Number of Requests": 560,
  "Number of Completed Requests": 555,
  "Number of Errored Requests": 0,
  "Error Rate": 0.0,
  "Observed Session Dispatch Rate": 11.43
}
throughput_metrics.json

Throughput measurements:

{
  "tpot_based_throughput": 76.99,
  "tbc_based_throughput": 11.32
}

Distribution files:

For each metric (ttfc, tbc, tpot, e2e, etc.):

  • <metric>.csv: Percentile values (p50, p90, p95, p99, min, max, mean)

  • <metric>.png: Distribution histogram

Prefill statistics

For understanding prefill latency scaling, the evaluator groups TTFC by prompt length:

prefill_stats.json:

{
  "metric": "ttfc",
  "group_by": "target_num_delta_prompt_tokens",
  "groups": {
    "128": {
      "count": 50,
      "mean": 0.034,
      "p99": 0.055
    },
    "256": {
      "count": 48,
      "mean": 0.041,
      "p99": 0.062
    }
  }
}

This helps analyze how prefill time scales with prompt length.

Accuracy evaluation

For model accuracy testing, use the lm-eval integration:

session_generator:
  type: lmeval
  tasks: ["hellaswag", "truthfulqa_gen"]
  num_fewshot: 0

evaluators:
  - type: performance
    target_channels: ["text"]
  - type: accuracy_lmeval
    bootstrap_iters: 200

This runs lm-eval-harness tasks through Veeksha’s load generation, allowing simultaneous accuracy and performance measurement.

The accuracy evaluator outputs:

  • Standard lm-eval metrics (accuracy, perplexity)

  • Integration with standard lm-eval result formats

Streaming metrics

During a benchmark, enable real-time metric output:

evaluators:
  - type: performance
    stream_metrics: true
    stream_metrics_interval: 5.0

This logs current statistics every 5 seconds:

[10.2s] Completed: 156 | TTFC p99: 45ms | TBC p99: 18ms | Throughput: 234 tok/s

Useful for monitoring long-running benchmarks without waiting for completion.

Health checks

After the benchmark, Veeksha runs health checks to validate correctness:

Session Dispatch Rate Check

Verifies actual arrival rate matches configuration.

Prompt Length Check

Verifies generated prompt lengths match targets.

Output Length Check

Verifies output lengths match requested tokens (when server supports it).

Lifecycle Timing Delays Check

Reports timing overhead at each pipeline stage.

Intra-Session Request Arrival Check

Verifies request dependencies were respected.

Results are saved to health_check_results.txt:

Health checks help identify configuration issues or system problems that could invalidate benchmark results.