TorchFX 0.6.0: FP32 on the GPU, CUDA Graphs, and a Hardened Realtime Path#
TorchFX 0.6.0 is a performance and realtime release. The headline is the GPU follow-up promised back in 0.5.4: the CUDA kernels now run natively in float32 instead of silently upcasting to float64, which is 3.0–3.6× faster on consumer GPUs and finally lets the GPU beat its own CPU on multichannel workloads. On top of that, a new CUDA Graph path collapses the per-chunk launch overhead for streaming — up to 4× lower latency on short chunks — and the realtime engine moved its DSP off the audio callback into a dedicated worker thread.
Everything from 0.5.x still works unchanged. If your code ran on 0.5.4, it runs on 0.6.0 — the new GPU behaviour is opt-in by the dtype you pass.
This post is the overview. Each big-ticket item has its own deep-dive with benchmark numbers:
🟢 FP32 on the GPU: 3–3.6× and the end of the consumer-GPU penalty
🟢 CUDA Graphs for streaming: one launch instead of a launch storm
🟢 A hardened realtime path: worker threads, allocation-free streaming, and dtype-aware dispatch
The big three#
FP32 CUDA execution path#
The CUDA biquad and SOS parallel-scan kernels were double-only. A float32 input — the norm for realtime and ML pipelines — was silently upcast, doubling memory traffic and, on a consumer GPU with a 1:32 FP32:FP64 ratio (RTX 3070, A40), running at a fraction of peak. 0.6.0 templates the kernels on scalar_t and dispatches on the input dtype.
8th-order Butterworth @ 48 kHz (RTX 3070) |
GPU FP64 |
GPU FP32 |
Speedup |
|---|---|---|---|
30 s / 1 ch |
9.49 ms |
2.80 ms |
3.39× |
60 s / 1 ch |
18.31 ms |
6.00 ms |
3.05× |
60 s / 8 ch |
89.18 ms |
24.49 ms |
3.64× |
That 8-channel row is the interesting one: in FP64 the GPU (89 ms) lost to its own CPU (28 ms). In FP32 (24 ms) it wins. Read the deep-dive →
CUDA Graphs for streaming#
A K-section cascade issues roughly 4·K CUDA kernel launches per chunk. At 48 kHz with 512-sample buffers that overhead dominates — the parallel scan measures ~135 µs of pure launch/dispatch overhead regardless of chunk length. The new torchfx.realtime.CudaGraphRunner captures the fixed-shape forward once and replays it per chunk:
Chunk size (4-section cascade, RTX 3070) |
Eager |
Graph replay |
Speedup |
|---|---|---|---|
128 |
209.9 µs |
52.2 µs |
4.02× |
512 |
208.9 µs |
79.9 µs |
2.62× |
1024 |
207.9 µs |
116.7 µs |
1.78× |
The win is largest exactly where it matters — the small-chunk realtime regime. Read the deep-dive →
A hardened realtime path#
RealtimeProcessor now runs DSP in a dedicated worker thread, not inside the PortAudio callback, with per-callback latency logging and xrun counters. Underneath, the GPU SOS forward became allocation-free per call (O(K) → O(1) allocations), cutting the eager GPU streaming path ~27%. Read the deep-dive →
Also in this release#
Static-gain folding. A constant linear
Gainbetween filters is now folded into the fused SOS cascade, sowave | IIR | Gain | IIRmaterialises as a singleFusedSOSCascaderather than three stages. (A clampingGainor a dynamicNormalizestill stays its own stage.)Dtype-aware dispatch threshold.
PARALLEL_SCAN_THRESHOLDis now threaded into the kernels and tunable, with a dtype-aware default (float32 → 2048, float64 → 1024) backed by a crossover sweep.Explicit CUDA architectures. Wheels ship native SASS for
sm_75;80;86;89— no first-call PTX→SASS JIT stall.Correctness hardening. Filters reused across sample rates now recompute coefficients instead of silently applying stale ones; offline
Wavematerialisation resets state so reusing a filter across waves can’t leak DF1 state; half-precision inputs are rejected with a clear error rather than silently upcast.
How is TorchFX doing against the field?#
On CPU, against torchaudio (same machine, median over the swept axes):
Workload |
TorchFX vs torchaudio |
|---|---|
IIR cascade |
5.4× faster (up to 10.4×) |
Single biquad |
2.1× faster |
FIR (FFT) |
2.4× faster |
Installation#
pip install torchfx==0.6.0
CUDA wheels from the GitHub Pages index:
pip install torch --index-url https://download.pytorch.org/whl/cu128
pip install torchfx==0.6.0 \
--index-url https://matteospanio.github.io/torchfx/wheels/cu128/ \
--extra-index-url https://pypi.org/simple
What’s next#
The high-leverage GPU wins are in. The remaining kernel work — a single-kernel SOS mega-kernel that makes fusion kernel-level (not just dispatch-level), a single-pass scan, and a cache-blocked CPU SIMD path for edge devices — is tracked in the roadmap (Epic 4.6). On the library side, the focus turns to dynamics effects (compressor/limiter) and audio-quality testing.
The full list of changes is in the CHANGELOG. File issues and feature requests on GitHub.