Output Files

Every benchmark run creates a timestamped output directory containing configuration, metrics, traces, and verification results. This guide explains each file and how to use them.

Output directory structure

benchmark_output/
└── 09:01:2026-10:30:00-a1b2c3d4/
    ├── config.yml                    # Resolved configuration
    ├── health_check_results.txt      # Benchmark verification
    ├── wandb_run.json                # WandB run info (if enabled)
    ├── metrics/
    │   ├── request_level_metrics.jsonl
    │   ├── summary_stats.json
    │   ├── throughput_metrics.json
    │   ├── slo_results.json
    │   ├── prefill_stats.json
    │   ├── ttfc.csv / ttfc.png
    │   ├── tbc.csv / tbc.png
    │   ├── tpot.csv / tpot.png
    │   ├── end_to_end_latency.csv / .png
    │   └── ... (other metric files)
    ├── traces/
    │   └── trace.jsonl
    └── wandb/                        # WandB local files (if enabled)

The directory name format is: DD:MM:YYYY-HH:MM:SS-<config_hash>

Configuration file

config.yml

The fully resolved configuration used for the benchmark:

output_dir: benchmark_output/09:01:2026-10:30:00-a1b2c3d4
seed: 42
session_generator:
  type: synthetic
  session_graph:
    type: linear
    inherit_history: true
    num_request_generator:
      type: uniform
      min: 2
      max: 6
# ... all resolved values

This file enables exact reproduction of the benchmark.

Metrics directory

request_level_metrics.jsonl

JSON Lines file with one record per completed request:

{
  "request_id": 75,
  "session_id": 8,
  "session_total_requests": 8,
  "scheduler_ready_at": 0.53709,
  "scheduler_dispatched_at": 0.53709,
  "client_picked_up_at": 0.53723,
  "client_completed_at": 0.60328,
  "result_processed_at": 0.60346,
  "num_delta_prompt_tokens": 6,
  "num_total_prompt_tokens": 6,
  "target_num_delta_prompt_tokens": 6,
  "num_output_tokens": 7,
  "num_requested_output_tokens": 7,
  "num_total_tokens": 13,
  "is_stream": true,
  "tpot": 0.00688,
  "ttfc": 0.0243,
  "end_to_end_latency": 0.06559,
  "normalized_end_to_end_latency": 0.00937,
  "output_throughput": 106.72167,
  "tbc": [0.00814, 0.0067, 0.00687, 0.00611, 0.00658, 0.00689]
}

Key fields:

  • Timing: Timestamps at each lifecycle stage

  • Latencies: ttfc, tbc, tpot in seconds

  • Tokens: Prompt and output token counts

  • Status: success, error_code

summary_stats.json

Aggregate statistics:

{
  "Number of Requests": 560,
  "Number of Completed Requests": 555,
  "Number of Errored Requests": 0,
  "Number of Cancelled Requests": 5,
  "Error Rate": 0.0,
  "Cancellation Rate": 0.009,
  "Number of Sessions Seen": 110,
  "Successful Sessions": 100,
  "Errored Sessions": 0,
  "Cancelled Sessions": 0.0,
  "Incomplete Sessions": 10,
  "Observed Session Dispatch Rate": 11.43
}
throughput_metrics.json

Throughput measurements:

{
  "tpot_based_throughput": 76.99,
  "tbc_based_throughput": 11.32
}
  • tpot_based_throughput: Total output tokens / total time

  • tbc_based_throughput: Throughput based on average TBC

slo_results.json

SLO evaluation results (if SLOs were configured):

{
  "all_slos_met": true,
  "results": [
    {
      "met": true,
      "slo_metric_key": "ttfc_p99",
      "observed_value": 0.055,
      "threshold": 0.5,
      "percentile": 0.99,
      "metric": "ttfc",
      "name": "P99 TTFC",
      "lower_is_better": true
    }
  ]
}

Used by capacity search to determine pass/fail.

prefill_stats.json

TTFC grouped by prompt length for prefill analysis:

{
  "metric": "ttfc",
  "group_by": "target_num_delta_prompt_tokens",
  "groups": {
    "128": {"count": 50, "mean": 0.034, "p99": 0.055},
    "256": {"count": 48, "mean": 0.041, "p99": 0.062}
  }
}

Metric distribution files

For each metric, a CSV and PNG file are generated:

ttfc.csv - Time to First Chunk percentiles:

,cdf,Time to First Chunk
0,0.0,0.0169...
1,0.01,0.0182...
2,0.02,0.0187...
3,0.03,0.0191...
4,0.04,0.0195...
5,0.05,0.0203...
6,0.06,0.0209...
7,0.07,0.0213...
8,0.08,0.021...
9,0.09,0.0218...
...
100,1.0,0.176...

ttfc.png - Distribution histogram showing TTFC values across all requests.

(Similar files exist for tbc, tpot, e2e, session_duration, etc.)

Traces directory

trace.jsonl

Dispatched request traces (if trace_recorder.enabled: true):

{
  "request_id": 42,
  "session_id": 7,
  "session_size": 21
  "dispatched_at": 0.49085,
  "session_context": {
    "node_id": 0,
    "wait_after_ready": 0,
    "parent_nodes": [],
    "history_parent": null
  }
}

With include_content: true, also includes:

  • Full prompt text/tokens

  • Target token lengths

  • History from parent requests

Health check results

health_check_results.txt

Verification that the benchmark ran correctly:

============================================================
INTRA-SESSION REQUEST ARRIVAL CHECK
============================================================
Result: PASSED

Description:
  Metric                         Scheduler delay (actual_dispatch - ready_time)
  Ready Time                     Parent requests completion + wait_after_ready (if available)

Deviation Statistics (seconds):
  Requests w/ Dependencies       1616
  Min                            0.0000s
  Mean                           0.0010s
  Median                         0.0001s
  P95                            0.0002s
  P99                            0.0003s
  Max                            0.2864s
  Std Dev                        0.0140s

Violation Info:
  Late Threshold                 5.0s (dispatched >5.0s after ready)
  Violations                     0


============================================================
SESSION DISPATCH RATE CHECK
============================================================
Result: PASSED

Rate Statistics:
  Total Sessions                 344
  Measurement Duration           62.8338s
  Expected Rate                  5.0000 sessions/sec
  Actual Rate                    5.4588 sessions/sec
  Error                          9.18%
  Threshold                      15.0%

Inter-Arrival Time Statistics:
  Min                            0.0002s
  Mean                           0.1832s
  Median                         0.1346s
  P95                            0.5782s
  P99                            0.6000s
  Max                            0.6706s
  Std Dev                        0.1676s


============================================================
PROMPT LENGTH CHECK
============================================================
Result: PASSED

Description:
  Metric                         Prompt Length Deviation (Actual - Target)
  Target                         Specified target_prompt_tokens
  Threshold                      <= +/- 15.0

Statistics:
  Total Requests Checked         1960
  Exact Matches                  1960 (100.0%)
  Mismatches (All)               0 (0.0%)
  Violations (> +/-15)           0 (0.0%)
  Min Deviation                  0.0
  Mean Deviation                 0.00
  Median Deviation               0.0
  P95 Deviation                  0.0
  P99 Deviation                  0.0
  Max Deviation                  0.0
  Std Dev                        0.00


============================================================
OUTPUT LENGTH CHECK
============================================================
Result: PASSED

Description:
  Metric                         Output Length Deviation (Actual - Requested)
  Target                         num_requested_output_tokens
  Threshold                      <= +/- 15.0

Statistics:
  Total Requests Checked         1960
  Exact Matches                  1953 (99.6%)
  Mismatches (All)               7 (0.4%)
  Violations (> +/-15)           0 (0.0%)
  Min Deviation                  -1.0
  Mean Deviation                 -0.00
  Median Deviation               0.0
  P95 Deviation                  0.0
  P99 Deviation                  0.0
  Max Deviation                  1.0
  Std Dev                        0.06


============================================================
LIFECYCLE TIMING DELAYS CHECK
============================================================
Result: PASSED

Ready-to-Dispatch Delay (scheduler_dispatched_at - scheduler_ready_at):
  Count                          1960
  Min                            0.0000s
  Mean                           0.0000s
  Median                         0.0000s
  P95                            0.0000s
  P99                            0.0000s
  Max                            0.0000s
  Std Dev                        0.0000s

Dispatch-to-Pickup Delay (client_picked_up_at - scheduler_dispatched_at):
  Count                          1960
  Min                            0.0001s
  Mean                           0.0015s
  Median                         0.0004s
  P95                            0.0039s
  P99                            0.0255s
  Max                            0.1155s
  Std Dev                        0.0071s

Completion-to-Result-Processing Delay (result_processed_at - client_completed_at):
  Count                          1960
  Min                            0.0004s
  Mean                           0.0023s
  Median                         0.0012s
  P95                            0.0020s
  P99                            0.0490s
  Max                            0.1006s
  Std Dev                        0.0081s

Checks included:

  • Session Dispatch Rate: Arrival rate accuracy

  • Intra-Session Request Arrival: Dependency timing

  • Prompt Length: Target vs actual prompt tokens

  • Output Length: Target vs actual output tokens

  • Lifecycle Timing Delays: Pipeline overhead

WandB files

If WandB is enabled:

wandb_run.json

Basic run identifiers:

{
  "run_id": "abc123xyz",
  "run_name": "09:01:2026-10:30:00-a1b2c3d4",
  "run_url": "https://wandb.ai/entity/project/runs/abc123xyz"
}
wandb/

Local WandB sync directory containing logs and artifacts.

See also