# Profiling pipelines How to find where time goes in a TorchFX pipeline, and the usual fixes. For *comparative* measurement against a baseline see {doc}`/guides/developer/benchmarking`; this guide is about diagnosing a single pipeline. ## Time it correctly first Three things will give you wrong numbers if you skip them: - **Warm up.** Filters compute their coefficients lazily on the *first* `forward`, and the native extension's first call pays one-time setup. Run the pipeline a few times before timing. - **Synchronise CUDA.** Kernel launches are asynchronous, so `perf_counter` around a GPU call measures only the *launch*, not the work. Call `torch.cuda.synchronize()` before reading the clock. - **Use the median**, not a single sample — the OS scheduler adds multi-millisecond outliers (see the realtime tail discussion in the benchmarking guide). ```python import time import torch import torchfx as fx wave = fx.Wave(torch.randn(2, 480_000), 48_000) chain = fx.filter.HiButterworth(80, order=4) | fx.filter.LoButterworth(12_000, order=8) for _ in range(3): # warm up (lazy coeffs + first-call setup) _ = wave | chain samples = [] for _ in range(20): torch.cuda.synchronize() if wave.ys.is_cuda else None t0 = time.perf_counter_ns() _ = wave | chain torch.cuda.synchronize() if wave.ys.is_cuda else None samples.append(time.perf_counter_ns() - t0) samples.sort() print(f"median {samples[len(samples)//2] / 1e6:.3f} ms") ``` ## The kernel timeline with `torch.profiler` To see *which* operations dominate (and, on CUDA, the launch vs. compute split): ```python from torch.profiler import profile, ProfilerActivity acts = [ProfilerActivity.CPU] if wave.ys.is_cuda: acts.append(ProfilerActivity.CUDA) with profile(activities=acts, record_shapes=True) as prof: for _ in range(10): _ = wave | chain print(prof.key_averages().table(sort_by="self_cuda_time_total", row_limit=15)) # CPU-only: sort_by="self_cpu_time_total" ``` Look for the `torchfx_ext::` entries (`sos_forward`, `biquad_forward`, …) — those are the native kernels. A long tail of many short native calls usually means the chain **didn't fuse** (see below). Export a Chrome trace with `prof.export_chrome_trace("trace.json")` and open it in `chrome://tracing` for a visual timeline. ## Counting native dispatches The single most useful TorchFX-specific check: how many times does the pipeline cross into the native extension? A fused IIR cascade should be **one** dispatch regardless of depth. ```bash uv run python tools/count_kernel_launches.py --depths 2,5,10,20 ``` It reports `native_calls` (Python→C++ dispatches) and `cuda_launches` for the fused vs. unfused paths. In your own code, wrap the dispatch to count it (the mechanism `tests/test_perf_regression.py` uses): ```python from torchfx import _ops orig = _ops.parallel_iir_forward calls = 0 def counted(*a, **k): global calls; calls += 1 return orig(*a, **k) _ops.parallel_iir_forward = counted _ = wave | chain print("SOS dispatches:", calls) # 1 if the cascade fused _ops.parallel_iir_forward = orig ``` ## Common bottlenecks and fixes | Symptom | Likely cause | Fix | |---|---|---| | Many short `sos_forward` calls | IIR filters applied one at a time | Build the chain with `\|` so the deferred planner **fuses** consecutive IIR/biquads into one `FusedSOSCascade`; avoid forcing materialisation between them. | | GPU slower than expected on float32 in | Input upcast to float64 | Pass float32 tensors — the kernels dispatch on dtype and run the FP32 path (≈3× on consumer GPUs). | | High per-chunk latency on short GPU chunks | Per-section launch overhead dominates | Use `torchfx.realtime.CudaGraphRunner` to replay a fixed-shape forward as one graph launch. | | Throughput-bound over many short files | One launch per file, low occupancy | `torchfx.batch_process(waves, effect)` runs them in a single launch. | | Sequential vs. parallel scan crossover | Signal length near `PARALLEL_SCAN_THRESHOLD` | Tune the `threshold=` argument (dtype-aware default); `benchmarks/bench_threshold_sweep.py`. | ## See also - {doc}`/guides/developer/benchmarking` — comparative benchmarking and the suite layout. - {doc}`/guides/developer/testing` — the deterministic perf-regression gates that keep fusion from silently breaking.