Run small, reproducible GPU benchmarks. Measure joules, not just FLOPs. Set thresholds. Sweep power caps. Find the sweet spot between performance and power — automatically.
Requires an NVIDIA GPU with NVML/nvidia-smi accessible, plus Python 3.10+. The full app lives in gpu_energy_bench/.
Install dependencies
Tiny set: pynvml, torch, streamlit, plotly, pyyaml, pandas.
Launch the bench
Streamlit serves a 7-tab control panel locally.
Define tests in tests.yaml
Declarative thresholds. Edit and rerun — no code changes.
Sweep power caps & export
Pareto-optimize energy vs. time. Share results as CSV/JSON/HTML.
# 1. install
pip install -r gpu_energy_bench/requirements.txt
# 2. run the bench (opens in browser)
streamlit run gpu_energy_bench/streamlit_app.py
# 3. add a test in gpu_energy_bench/tests.yaml
- name: matmul_medium_fp32
kernel: matmul
params: { size: 4096, repetitions: 10, dtype: float32 }
thresholds:
min_gflops_per_s: 2000
max_energy_per_gflop: 0.05
max_temp_c: 85Six things you'd otherwise glue together yourself.
Background NVML sampler integrates power over time (J = ∫P·dt) using trapezoidal integration around a synchronized timed region.
Declare tests in tests.yaml with thresholds like min_gflops_per_s or max_energy_per_gflop. Each run renders a clear PASS/FAIL panel.
Run the same kernel under multiple nvidia-smi power caps and auto-plot the time vs energy Pareto front.
Per-run power/util/temp time series, plus an SQLite history of every run for cross-config comparisons.
Download metrics CSV/JSON, telemetry CSV, and a self-contained Plotly HTML for sharing — one click per run.
Register a new kernel with one function call. Built-in matmul ships with FP32/FP16/BF16 and configurable size + reps.
The Streamlit app is structured so each step of "measure → test → optimize → share" gets its own surface.
14+
NVML metrics
up to 50 Hz
Sample rate
∫ P · dt
Energy method
min_gflops_per_s
actual 2,431 ≥ threshold 2,000
max_energy_per_gflop
actual 0.041 ≤ threshold 0.050
max_temp_c
actual 87.2 ≤ threshold 85
Each threshold becomes its own card so failures are unmissable.
Pick a test, a min/max wattage and a number of steps. The bench applies each cap via nvidia-smi -pl, reruns the kernel under stable conditions, and plots the time-vs-energy Pareto front. Original cap restored on exit.
| cap_w | elapsed_s | energy_j | J/GFLOP |
|---|---|---|---|
| 150 | 2.41 | 312.7 | 0.046 |
| 200 | 1.88 | 340.2 | 0.050 |
| 250 | 1.62 | 388.9 | 0.057 |
| 300 | 1.55 | 447.1 | 0.066 |
A 150 W cap traded ~55% more time for 30% less energy.
A kernel is any function that does its own warmup + torch.cuda.synchronize()and returns a KernelResult. Register it once — it shows up in the test registry and history automatically.
def _my_conv(device, size=512, channels=64, repetitions=20):
import torch, time
x = torch.randn(8, channels, size, size, device=device)
w = torch.randn(channels, channels, 3, 3, device=device)
for _ in range(2): torch.nn.functional.conv2d(x, w)
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(repetitions):
y = torch.nn.functional.conv2d(x, w)
torch.cuda.synchronize()
elapsed = time.perf_counter() - t0
flops = 2 * channels * channels * 9 * size * size * 8 * repetitions
return KernelResult(elapsed_s=elapsed, flops=flops,
repetitions=repetitions)
register("my_conv", _my_conv)