gpu_energy_bench

Pytest, but for GPU energy.

Run small, reproducible GPU benchmarks. Measure joules, not just FLOPs. Set thresholds. Sweep power caps. Find the sweet spot between performance and power — automatically.

PyTorch
pynvml
Streamlit
SQLite
Plotly

Up and running in 30 seconds

Requires an NVIDIA GPU with NVML/nvidia-smi accessible, plus Python 3.10+. The full app lives in gpu_energy_bench/.

  • 1

    Install dependencies

    Tiny set: pynvml, torch, streamlit, plotly, pyyaml, pandas.

  • 2

    Launch the bench

    Streamlit serves a 7-tab control panel locally.

  • 3

    Define tests in tests.yaml

    Declarative thresholds. Edit and rerun — no code changes.

  • 4

    Sweep power caps & export

    Pareto-optimize energy vs. time. Share results as CSV/JSON/HTML.

terminal
# 1. install
pip install -r gpu_energy_bench/requirements.txt

# 2. run the bench (opens in browser)
streamlit run gpu_energy_bench/streamlit_app.py

# 3. add a test in gpu_energy_bench/tests.yaml
- name: matmul_medium_fp32
  kernel: matmul
  params:  { size: 4096, repetitions: 10, dtype: float32 }
  thresholds:
    min_gflops_per_s:    2000
    max_energy_per_gflop: 0.05
    max_temp_c:           85

What's in the box

Six things you'd otherwise glue together yourself.

Accurate energy measurement

Background NVML sampler integrates power over time (J = ∫P·dt) using trapezoidal integration around a synchronized timed region.

Pytest-style tests

Declare tests in tests.yaml with thresholds like min_gflops_per_s or max_energy_per_gflop. Each run renders a clear PASS/FAIL panel.

Power-cap sweep

Run the same kernel under multiple nvidia-smi power caps and auto-plot the time vs energy Pareto front.

Telemetry & history

Per-run power/util/temp time series, plus an SQLite history of every run for cross-config comparisons.

Export everything

Download metrics CSV/JSON, telemetry CSV, and a self-contained Plotly HTML for sharing — one click per run.

Pluggable kernels

Register a new kernel with one function call. Built-in matmul ships with FP32/FP16/BF16 and configurable size + reps.

Seven tabs. One workflow.

The Streamlit app is structured so each step of "measure → test → optimize → share" gets its own surface.

GPU infoMatrix benchmarkTest registryTelemetryPower limitsHistoryExport

14+

NVML metrics

up to 50 Hz

Sample rate

∫ P · dt

Energy method

Test result

matmul_medium_fp32
✅ ALL THRESHOLDS PASSED

min_gflops_per_s

actual 2,431 threshold 2,000

max_energy_per_gflop

actual 0.041 threshold 0.050

max_temp_c

actual 87.2 threshold 85

Each threshold becomes its own card so failures are unmissable.

experiment mode

Find the energy sweet spot — automatically

Pick a test, a min/max wattage and a number of steps. The bench applies each cap via nvidia-smi -pl, reruns the kernel under stable conditions, and plots the time-vs-energy Pareto front. Original cap restored on exit.

  • Cap vs. elapsed time + total energy
  • Energy per GFLOP across caps
  • Every sweep run saved to history
power_sweep.csv
cap_welapsed_senergy_jJ/GFLOP
1502.41312.70.046
2001.88340.20.050
2501.62388.90.057
3001.55447.10.066

A 150 W cap traded ~55% more time for 30% less energy.

Add your own kernel

A kernel is any function that does its own warmup + torch.cuda.synchronize()and returns a KernelResult. Register it once — it shows up in the test registry and history automatically.

gpu_energy_bench/kernels.py
def _my_conv(device, size=512, channels=64, repetitions=20):
    import torch, time
    x = torch.randn(8, channels, size, size, device=device)
    w = torch.randn(channels, channels, 3, 3, device=device)

    for _ in range(2): torch.nn.functional.conv2d(x, w)
    torch.cuda.synchronize()

    t0 = time.perf_counter()
    for _ in range(repetitions):
        y = torch.nn.functional.conv2d(x, w)
    torch.cuda.synchronize()
    elapsed = time.perf_counter() - t0

    flops = 2 * channels * channels * 9 * size * size * 8 * repetitions
    return KernelResult(elapsed_s=elapsed, flops=flops,
                        repetitions=repetitions)

register("my_conv", _my_conv)