--- blogpost: true date: Jun 04, 2026 author: Matteo Spanio category: releases tags: release, cuda, fp32, cuda-graphs, realtime, performance --- # TorchFX 0.6.0: FP32 on the GPU, CUDA Graphs, and a Hardened Realtime Path **TorchFX 0.6.0** is a performance and realtime release. The headline is the GPU follow-up promised back in 0.5.4: the CUDA kernels now run **natively in `float32`** instead of silently upcasting to `float64`, which is **3.0–3.6× faster** on consumer GPUs and finally lets the GPU beat its own CPU on multichannel workloads. On top of that, a new **CUDA Graph** path collapses the per-chunk launch overhead for streaming — up to **4× lower latency** on short chunks — and the realtime engine moved its DSP off the audio callback into a dedicated worker thread. Everything from 0.5.x still works unchanged. If your code ran on 0.5.4, it runs on 0.6.0 — the new GPU behaviour is opt-in by the dtype you pass. This post is the overview. Each big-ticket item has its own deep-dive with benchmark numbers: - 🟢 **[FP32 on the GPU: 3–3.6× and the end of the consumer-GPU penalty](2026-06-04-fp32-cuda.md)** - 🟢 **[CUDA Graphs for streaming: one launch instead of a launch storm](2026-06-04-cuda-graphs.md)** - 🟢 **[A hardened realtime path: worker threads, allocation-free streaming, and dtype-aware dispatch](2026-06-04-realtime-streaming.md)** ## The big three ### FP32 CUDA execution path The CUDA biquad and SOS parallel-scan kernels were `double`-only. A `float32` input — the norm for realtime and ML pipelines — was silently upcast, doubling memory traffic and, on a consumer GPU with a 1:32 FP32:FP64 ratio (RTX 3070, A40), running at a fraction of peak. 0.6.0 templates the kernels on `scalar_t` and dispatches on the input dtype. | 8th-order Butterworth @ 48 kHz (RTX 3070) | GPU FP64 | GPU FP32 | Speedup | |---|---:|---:|---:| | 30 s / 1 ch | 9.49 ms | 2.80 ms | **3.39×** | | 60 s / 1 ch | 18.31 ms | 6.00 ms | **3.05×** | | 60 s / 8 ch | 89.18 ms | **24.49 ms** | **3.64×** | That 8-channel row is the interesting one: in FP64 the GPU (89 ms) *lost* to its own CPU (28 ms). In FP32 (24 ms) it wins. [Read the deep-dive →](2026-06-04-fp32-cuda.md) ### CUDA Graphs for streaming A `K`-section cascade issues roughly `4·K` CUDA kernel launches *per chunk*. At 48 kHz with 512-sample buffers that overhead dominates — the parallel scan measures ~135 µs of pure launch/dispatch overhead regardless of chunk length. The new `torchfx.realtime.CudaGraphRunner` captures the fixed-shape forward once and replays it per chunk: | Chunk size (4-section cascade, RTX 3070) | Eager | Graph replay | Speedup | |---|---:|---:|---:| | 128 | 209.9 µs | 52.2 µs | **4.02×** | | 512 | 208.9 µs | 79.9 µs | **2.62×** | | 1024 | 207.9 µs | 116.7 µs | **1.78×** | The win is largest exactly where it matters — the small-chunk realtime regime. [Read the deep-dive →](2026-06-04-cuda-graphs.md) ### A hardened realtime path `RealtimeProcessor` now runs DSP in a dedicated worker thread, not inside the PortAudio callback, with per-callback latency logging and xrun counters. Underneath, the GPU SOS forward became **allocation-free per call** (O(K) → O(1) allocations), cutting the eager GPU streaming path ~27%. [Read the deep-dive →](2026-06-04-realtime-streaming.md) ## Also in this release - **Static-gain folding.** A constant linear `Gain` between filters is now folded into the fused SOS cascade, so `wave | IIR | Gain | IIR` materialises as a *single* `FusedSOSCascade` rather than three stages. (A clamping `Gain` or a dynamic `Normalize` still stays its own stage.) - **Dtype-aware dispatch threshold.** `PARALLEL_SCAN_THRESHOLD` is now threaded into the kernels and tunable, with a dtype-aware default (float32 → 2048, float64 → 1024) backed by a crossover sweep. - **Explicit CUDA architectures.** Wheels ship native SASS for `sm_75;80;86;89` — no first-call PTX→SASS JIT stall. - **Correctness hardening.** Filters reused across sample rates now recompute coefficients instead of silently applying stale ones; offline `Wave` materialisation resets state so reusing a filter across waves can't leak DF1 state; half-precision inputs are rejected with a clear error rather than silently upcast. ## How is TorchFX doing against the field? On CPU, against `torchaudio` (same machine, median over the swept axes): | Workload | TorchFX vs torchaudio | |---|---| | IIR cascade | **5.4× faster** (up to 10.4×) | | Single biquad | 2.1× faster | | FIR (FFT) | 2.4× faster | ## Installation ```bash pip install torchfx==0.6.0 ``` CUDA wheels from the GitHub Pages index: ```bash pip install torch --index-url https://download.pytorch.org/whl/cu128 pip install torchfx==0.6.0 \ --index-url https://matteospanio.github.io/torchfx/wheels/cu128/ \ --extra-index-url https://pypi.org/simple ``` ## What's next The high-leverage GPU wins are in. The remaining kernel work — a single-kernel SOS mega-kernel that makes fusion *kernel-level* (not just dispatch-level), a single-pass scan, and a cache-blocked CPU SIMD path for edge devices — is tracked in the [roadmap](../guides/developer/roadmap.md) (Epic 4.6). On the library side, the focus turns to dynamics effects (compressor/limiter) and audio-quality testing. The full list of changes is in the [CHANGELOG](https://github.com/matteospanio/torchfx/blob/master/CHANGELOG). File issues and feature requests on [GitHub](https://github.com/matteospanio/torchfx).