gpu_energy_bench

Pytest, but for GPU energy.

Run small, reproducible GPU benchmarks. Measure joules, not just FLOPs. Set thresholds. Sweep power caps. Find the sweet spot between performance and power — automatically.

Quickstart See what it does

PyTorch

pynvml

Streamlit

SQLite

Plotly

Up and running in 30 seconds

Requires an NVIDIA GPU with NVML/nvidia-smi accessible, plus Python 3.10+. The full app lives in gpu_energy_bench/.

1
Install dependencies
Tiny set: pynvml, torch, streamlit, plotly, pyyaml, pandas.
2
Launch the bench
Streamlit serves a 7-tab control panel locally.
3
Define tests in tests.yaml
Declarative thresholds. Edit and rerun — no code changes.
4
Sweep power caps & export
Pareto-optimize energy vs. time. Share results as CSV/JSON/HTML.

terminal

# 1. install
pip install -r gpu_energy_bench/requirements.txt

# 2. run the bench (opens in browser)
streamlit run gpu_energy_bench/streamlit_app.py

# 3. add a test in gpu_energy_bench/tests.yaml
- name: matmul_medium_fp32
  kernel: matmul
  params:  { size: 4096, repetitions: 10, dtype: float32 }
  thresholds:
    min_gflops_per_s:    2000
    max_energy_per_gflop: 0.05
    max_temp_c:           85

What's in the box

Six things you'd otherwise glue together yourself.

Accurate energy measurement

Background NVML sampler integrates power over time (J = ∫P·dt) using trapezoidal integration around a synchronized timed region.

Pytest-style tests

Declare tests in tests.yaml with thresholds like min_gflops_per_s or max_energy_per_gflop. Each run renders a clear PASS/FAIL panel.

Power-cap sweep

Run the same kernel under multiple nvidia-smi power caps and auto-plot the time vs energy Pareto front.

Telemetry & history

Per-run power/util/temp time series, plus an SQLite history of every run for cross-config comparisons.

Export everything

Download metrics CSV/JSON, telemetry CSV, and a self-contained Plotly HTML for sharing — one click per run.

Pluggable kernels

Seven tabs. One workflow.

The Streamlit app is structured so each step of "measure → test → optimize → share" gets its own surface.

GPU infoMatrix benchmarkTest registryTelemetryPower limitsHistoryExport

14+

NVML metrics

up to 50 Hz

Sample rate

∫ P · dt

Energy method

Test result

matmul_medium_fp32

✅ ALL THRESHOLDS PASSED

min_gflops_per_s

actual 2,431 ≥ threshold 2,000

max_energy_per_gflop

actual 0.041 ≤ threshold 0.050

max_temp_c

actual 87.2 ≤ threshold 85

Each threshold becomes its own card so failures are unmissable.

experiment mode

Find the energy sweet spot — automatically

Pick a test, a min/max wattage and a number of steps. The bench applies each cap via nvidia-smi -pl, reruns the kernel under stable conditions, and plots the time-vs-energy Pareto front. Original cap restored on exit.

Cap vs. elapsed time + total energy
Energy per GFLOP across caps
Every sweep run saved to history

power_sweep.csv

cap_w	elapsed_s	energy_j	J/GFLOP
150	2.41	312.7	0.046
200	1.88	340.2	0.050
250	1.62	388.9	0.057
300	1.55	447.1	0.066

A 150 W cap traded ~55% more time for 30% less energy.

Add your own kernel

A kernel is any function that does its own warmup + torch.cuda.synchronize()and returns a KernelResult. Register it once — it shows up in the test registry and history automatically.

gpu_energy_bench/kernels.py

def _my_conv(device, size=512, channels=64, repetitions=20):
    import torch, time
    x = torch.randn(8, channels, size, size, device=device)
    w = torch.randn(channels, channels, 3, 3, device=device)

    for _ in range(2): torch.nn.functional.conv2d(x, w)
    torch.cuda.synchronize()

    t0 = time.perf_counter()
    for _ in range(repetitions):
        y = torch.nn.functional.conv2d(x, w)
    torch.cuda.synchronize()
    elapsed = time.perf_counter() - t0

    flops = 2 * channels * channels * 9 * size * size * 8 * repetitions
    return KernelResult(elapsed_s=elapsed, flops=flops,
                        repetitions=repetitions)

register("my_conv", _my_conv)