TorchFX 0.6.0: FP32 on the GPU, CUDA Graphs, and a Hardened Realtime Path#

TorchFX 0.6.0 is a performance and realtime release. The headline is the GPU follow-up promised back in 0.5.4: the CUDA kernels now run natively in float32 instead of silently upcasting to float64, which is 3.0–3.6× faster on consumer GPUs and finally lets the GPU beat its own CPU on multichannel workloads. On top of that, a new CUDA Graph path collapses the per-chunk launch overhead for streaming — up to 4× lower latency on short chunks — and the realtime engine moved its DSP off the audio callback into a dedicated worker thread.

Everything from 0.5.x still works unchanged. If your code ran on 0.5.4, it runs on 0.6.0 — the new GPU behaviour is opt-in by the dtype you pass.

This post is the overview. Each big-ticket item has its own deep-dive with benchmark numbers:

The big three#

FP32 CUDA execution path#

The CUDA biquad and SOS parallel-scan kernels were double-only. A float32 input — the norm for realtime and ML pipelines — was silently upcast, doubling memory traffic and, on a consumer GPU with a 1:32 FP32:FP64 ratio (RTX 3070, A40), running at a fraction of peak. 0.6.0 templates the kernels on scalar_t and dispatches on the input dtype.

8th-order Butterworth @ 48 kHz (RTX 3070)

GPU FP64

GPU FP32

Speedup

30 s / 1 ch

9.49 ms

2.80 ms

3.39×

60 s / 1 ch

18.31 ms

6.00 ms

3.05×

60 s / 8 ch

89.18 ms

24.49 ms

3.64×

That 8-channel row is the interesting one: in FP64 the GPU (89 ms) lost to its own CPU (28 ms). In FP32 (24 ms) it wins. Read the deep-dive →

CUDA Graphs for streaming#

A K-section cascade issues roughly 4·K CUDA kernel launches per chunk. At 48 kHz with 512-sample buffers that overhead dominates — the parallel scan measures ~135 µs of pure launch/dispatch overhead regardless of chunk length. The new torchfx.realtime.CudaGraphRunner captures the fixed-shape forward once and replays it per chunk:

Chunk size (4-section cascade, RTX 3070)

Eager

Graph replay

Speedup

128

209.9 µs

52.2 µs

4.02×

512

208.9 µs

79.9 µs

2.62×

1024

207.9 µs

116.7 µs

1.78×

The win is largest exactly where it matters — the small-chunk realtime regime. Read the deep-dive →

A hardened realtime path#

RealtimeProcessor now runs DSP in a dedicated worker thread, not inside the PortAudio callback, with per-callback latency logging and xrun counters. Underneath, the GPU SOS forward became allocation-free per call (O(K) → O(1) allocations), cutting the eager GPU streaming path ~27%. Read the deep-dive →

Also in this release#

  • Static-gain folding. A constant linear Gain between filters is now folded into the fused SOS cascade, so wave | IIR | Gain | IIR materialises as a single FusedSOSCascade rather than three stages. (A clamping Gain or a dynamic Normalize still stays its own stage.)

  • Dtype-aware dispatch threshold. PARALLEL_SCAN_THRESHOLD is now threaded into the kernels and tunable, with a dtype-aware default (float32 → 2048, float64 → 1024) backed by a crossover sweep.

  • Explicit CUDA architectures. Wheels ship native SASS for sm_75;80;86;89 — no first-call PTX→SASS JIT stall.

  • Correctness hardening. Filters reused across sample rates now recompute coefficients instead of silently applying stale ones; offline Wave materialisation resets state so reusing a filter across waves can’t leak DF1 state; half-precision inputs are rejected with a clear error rather than silently upcast.

How is TorchFX doing against the field?#

On CPU, against torchaudio (same machine, median over the swept axes):

Workload

TorchFX vs torchaudio

IIR cascade

5.4× faster (up to 10.4×)

Single biquad

2.1× faster

FIR (FFT)

2.4× faster

Installation#

pip install torchfx==0.6.0

CUDA wheels from the GitHub Pages index:

pip install torch --index-url https://download.pytorch.org/whl/cu128
pip install torchfx==0.6.0 \
    --index-url https://matteospanio.github.io/torchfx/wheels/cu128/ \
    --extra-index-url https://pypi.org/simple

What’s next#

The high-leverage GPU wins are in. The remaining kernel work — a single-kernel SOS mega-kernel that makes fusion kernel-level (not just dispatch-level), a single-pass scan, and a cache-blocked CPU SIMD path for edge devices — is tracked in the roadmap (Epic 4.6). On the library side, the focus turns to dynamics effects (compressor/limiter) and audio-quality testing.

The full list of changes is in the CHANGELOG. File issues and feature requests on GitHub.