---
blogpost: true
date: Jun 04, 2026
author: Matteo Spanio
category: releases
tags: release, cuda, fp32, cuda-graphs, realtime, performance
---

# TorchFX 0.6.0: FP32 on the GPU, CUDA Graphs, and a Hardened Realtime Path

**TorchFX 0.6.0** is a performance and realtime release. The headline is the GPU follow-up promised back in 0.5.4: the CUDA kernels now run **natively in `float32`** instead of silently upcasting to `float64`, which is **3.0–3.6× faster** on consumer GPUs and finally lets the GPU beat its own CPU on multichannel workloads. On top of that, a new **CUDA Graph** path collapses the per-chunk launch overhead for streaming — up to **4× lower latency** on short chunks — and the realtime engine moved its DSP off the audio callback into a dedicated worker thread.

Everything from 0.5.x still works unchanged. If your code ran on 0.5.4, it runs on 0.6.0 — the new GPU behaviour is opt-in by the dtype you pass.

This post is the overview. Each big-ticket item has its own deep-dive with benchmark numbers:

- 🟢 **[FP32 on the GPU: 3–3.6× and the end of the consumer-GPU penalty](2026-06-04-fp32-cuda.md)**
- 🟢 **[CUDA Graphs for streaming: one launch instead of a launch storm](2026-06-04-cuda-graphs.md)**
- 🟢 **[A hardened realtime path: worker threads, allocation-free streaming, and dtype-aware dispatch](2026-06-04-realtime-streaming.md)**

## The big three

### FP32 CUDA execution path

The CUDA biquad and SOS parallel-scan kernels were `double`-only. A `float32` input — the norm for realtime and ML pipelines — was silently upcast, doubling memory traffic and, on a consumer GPU with a 1:32 FP32:FP64 ratio (RTX 3070, A40), running at a fraction of peak. 0.6.0 templates the kernels on `scalar_t` and dispatches on the input dtype.

| 8th-order Butterworth @ 48 kHz (RTX 3070) | GPU FP64 | GPU FP32 | Speedup |
|---|---:|---:|---:|
| 30 s / 1 ch | 9.49 ms | 2.80 ms | **3.39×** |
| 60 s / 1 ch | 18.31 ms | 6.00 ms | **3.05×** |
| 60 s / 8 ch | 89.18 ms | **24.49 ms** | **3.64×** |

That 8-channel row is the interesting one: in FP64 the GPU (89 ms) *lost* to its own CPU (28 ms). In FP32 (24 ms) it wins. [Read the deep-dive →](2026-06-04-fp32-cuda.md)

### CUDA Graphs for streaming

A `K`-section cascade issues roughly `4·K` CUDA kernel launches *per chunk*. At 48 kHz with 512-sample buffers that overhead dominates — the parallel scan measures ~135 µs of pure launch/dispatch overhead regardless of chunk length. The new `torchfx.realtime.CudaGraphRunner` captures the fixed-shape forward once and replays it per chunk:

| Chunk size (4-section cascade, RTX 3070) | Eager | Graph replay | Speedup |
|---|---:|---:|---:|
| 128 | 209.9 µs | 52.2 µs | **4.02×** |
| 512 | 208.9 µs | 79.9 µs | **2.62×** |
| 1024 | 207.9 µs | 116.7 µs | **1.78×** |

The win is largest exactly where it matters — the small-chunk realtime regime. [Read the deep-dive →](2026-06-04-cuda-graphs.md)

### A hardened realtime path

`RealtimeProcessor` now runs DSP in a dedicated worker thread, not inside the PortAudio callback, with per-callback latency logging and xrun counters. Underneath, the GPU SOS forward became **allocation-free per call** (O(K) → O(1) allocations), cutting the eager GPU streaming path ~27%. [Read the deep-dive →](2026-06-04-realtime-streaming.md)

## Also in this release

- **Static-gain folding.** A constant linear `Gain` between filters is now folded into the fused SOS cascade, so `wave | IIR | Gain | IIR` materialises as a *single* `FusedSOSCascade` rather than three stages. (A clamping `Gain` or a dynamic `Normalize` still stays its own stage.)
- **Dtype-aware dispatch threshold.** `PARALLEL_SCAN_THRESHOLD` is now threaded into the kernels and tunable, with a dtype-aware default (float32 → 2048, float64 → 1024) backed by a crossover sweep.
- **Explicit CUDA architectures.** Wheels ship native SASS for `sm_75;80;86;89` — no first-call PTX→SASS JIT stall.
- **Correctness hardening.** Filters reused across sample rates now recompute coefficients instead of silently applying stale ones; offline `Wave` materialisation resets state so reusing a filter across waves can't leak DF1 state; half-precision inputs are rejected with a clear error rather than silently upcast.

## How is TorchFX doing against the field?

On CPU, against `torchaudio` (same machine, median over the swept axes):

| Workload | TorchFX vs torchaudio |
|---|---|
| IIR cascade | **5.4× faster** (up to 10.4×) |
| Single biquad | 2.1× faster |
| FIR (FFT) | 2.4× faster |

## Installation

```bash
pip install torchfx==0.6.0
```

CUDA wheels from the GitHub Pages index:

```bash
pip install torch --index-url https://download.pytorch.org/whl/cu128
pip install torchfx==0.6.0 \
    --index-url https://matteospanio.github.io/torchfx/wheels/cu128/ \
    --extra-index-url https://pypi.org/simple
```

## What's next

The high-leverage GPU wins are in. The remaining kernel work — a single-kernel SOS mega-kernel that makes fusion *kernel-level* (not just dispatch-level), a single-pass scan, and a cache-blocked CPU SIMD path for edge devices — is tracked in the [roadmap](../guides/developer/roadmap.md) (Epic 4.6). On the library side, the focus turns to dynamics effects (compressor/limiter) and audio-quality testing.

The full list of changes is in the [CHANGELOG](https://github.com/matteospanio/torchfx/blob/master/CHANGELOG). File issues and feature requests on [GitHub](https://github.com/matteospanio/torchfx).