Profiling pipelines#
How to find where time goes in a TorchFX pipeline, and the usual fixes. For comparative measurement against a baseline see Benchmarking; this guide is about diagnosing a single pipeline.
Time it correctly first#
Three things will give you wrong numbers if you skip them:
Warm up. Filters compute their coefficients lazily on the first
forward, and the native extension’s first call pays one-time setup. Run the pipeline a few times before timing.Synchronise CUDA. Kernel launches are asynchronous, so
perf_counteraround a GPU call measures only the launch, not the work. Calltorch.cuda.synchronize()before reading the clock.Use the median, not a single sample — the OS scheduler adds multi-millisecond outliers (see the realtime tail discussion in the benchmarking guide).
import time
import torch
import torchfx as fx
wave = fx.Wave(torch.randn(2, 480_000), 48_000)
chain = fx.filter.HiButterworth(80, order=4) | fx.filter.LoButterworth(12_000, order=8)
for _ in range(3): # warm up (lazy coeffs + first-call setup)
_ = wave | chain
samples = []
for _ in range(20):
torch.cuda.synchronize() if wave.ys.is_cuda else None
t0 = time.perf_counter_ns()
_ = wave | chain
torch.cuda.synchronize() if wave.ys.is_cuda else None
samples.append(time.perf_counter_ns() - t0)
samples.sort()
print(f"median {samples[len(samples)//2] / 1e6:.3f} ms")
The kernel timeline with torch.profiler#
To see which operations dominate (and, on CUDA, the launch vs. compute split):
from torch.profiler import profile, ProfilerActivity
acts = [ProfilerActivity.CPU]
if wave.ys.is_cuda:
acts.append(ProfilerActivity.CUDA)
with profile(activities=acts, record_shapes=True) as prof:
for _ in range(10):
_ = wave | chain
print(prof.key_averages().table(sort_by="self_cuda_time_total", row_limit=15))
# CPU-only: sort_by="self_cpu_time_total"
Look for the torchfx_ext:: entries (sos_forward, biquad_forward, …) — those are the
native kernels. A long tail of many short native calls usually means the chain didn’t
fuse (see below). Export a Chrome trace with prof.export_chrome_trace("trace.json")
and open it in chrome://tracing for a visual timeline.
Counting native dispatches#
The single most useful TorchFX-specific check: how many times does the pipeline cross into the native extension? A fused IIR cascade should be one dispatch regardless of depth.
uv run python tools/count_kernel_launches.py --depths 2,5,10,20
It reports native_calls (Python→C++ dispatches) and cuda_launches for the fused vs.
unfused paths. In your own code, wrap the dispatch to count it (the mechanism
tests/test_perf_regression.py uses):
from torchfx import _ops
orig = _ops.parallel_iir_forward
calls = 0
def counted(*a, **k):
global calls; calls += 1
return orig(*a, **k)
_ops.parallel_iir_forward = counted
_ = wave | chain
print("SOS dispatches:", calls) # 1 if the cascade fused
_ops.parallel_iir_forward = orig
Common bottlenecks and fixes#
Symptom |
Likely cause |
Fix |
|---|---|---|
Many short |
IIR filters applied one at a time |
Build the chain with |
GPU slower than expected on float32 in |
Input upcast to float64 |
Pass float32 tensors — the kernels dispatch on dtype and run the FP32 path (≈3× on consumer GPUs). |
High per-chunk latency on short GPU chunks |
Per-section launch overhead dominates |
Use |
Throughput-bound over many short files |
One launch per file, low occupancy |
|
Sequential vs. parallel scan crossover |
Signal length near |
Tune the |
See also#
Benchmarking — comparative benchmarking and the suite layout.
Testing — the deterministic perf-regression gates that keep fusion from silently breaking.