--- blogpost: true date: Jun 09, 2026 author: Matteo Spanio category: releases tags: release, dynamics, reverb, simd, performance, edge, tooling --- # TorchFX 0.7.0: A Dynamics Suite, a Real Reverb, and the Edge **TorchFX 0.7.0** delivers exactly what 0.6.0's "what's next" promised — the single-pass GPU scan, a cache-blocked CPU SIMD path for edge devices, and a full dynamics toolkit — and then some. The headline is **dynamics**: a `Compressor`, an `Expander`/`Gate`, and a look-ahead brick-wall `Limiter`, each a native per-channel C++/CUDA kernel. Alongside them the old toy `Reverb` is replaced by a proper **Freeverb-style** algorithmic reverb, many short signals can now be processed in a **single batched launch**, and the whole release ships itself: tagging, PyPI, and these very release notes are now automated. One breaking change: `Reverb`'s constructor changed (see below). Everything else from 0.6.x runs unchanged. ## A full dynamics toolkit Four new dynamics processors, all built on the same high-quality decoupled peak detector (release max-hold + attack one-pole; Giannoulis, Massberg & Reiss, 2012) running one channel per thread in C++/CUDA: - **`Compressor`** — threshold/ratio/attack/release, soft knee, makeup gain, peak or RMS detection, and a `ratio=inf` limiter mode. - **`Expander` / `Gate`** — the mirror of the compressor: attenuate *below* the threshold. `Gate` is the infinite-ratio convenience (a hard noise gate) with an optional `floor` (range). - **`Limiter`** — a **look-ahead brick-wall** peak limiter. A vectorised forward max-pool computes the look-ahead window, the gain is attack/release-smoothed, and a per-sample clamp **guarantees `|y| ≤ threshold`** — so unlike `Compressor(ratio=inf)`, it can never overshoot a single sample. ```python import torchfx as fx mastered = ( wave | fx.effect.Compressor(threshold=-18, ratio=3, attack=0.005, release=0.08) | fx.effect.Limiter(threshold=-1.0, lookahead=0.005) ) ``` ## A reverb worth the name The old `Reverb` was a single feedforward comb — barely a reverb. 0.7.0 replaces it with the classic **Schroeder/Moorer (Freeverb) structure**: per channel, **8 parallel low-pass-feedback comb filters** summed through **4 series all-pass diffusers**, in a fused native kernel (ring buffers in per-channel scratch, tunings scaled to the sample rate) for a dense, natural decay. ```python # Breaking change: room_size / damping / mix (was delay / decay / mix) hall = fx.effect.Reverb(room_size=0.85, damping=0.3, mix=0.25) ``` ## Performance: the GPU scan, and the edge Two kernel-level wins close out the 0.6.0 roadmap, plus a new batching path. ### Single-pass SOS scan (GPU) The fused cascade's forcing pass and the three-phase Blelloch scan are folded into **one decoupled-look-back kernel** (Merrill–Garland / CUB style, with an atomic per-section tile dispenser for deadlock-free forward progress). Per fused section drops from ~8 CUDA launches to **2**: | K=50 cascade (RTX 3070) | 3-phase | single-pass | | |---|---:|---:|---:| | CUDA launches | 600 | **102** | **5.9× fewer** | | Fused cascade time | 18.9 ms | **9.9 ms** | **~1.9×** | It's now the default GPU path (`TORCHFX_FUSED_SCAN=0` for the legacy oracle), bit-exact to it on sm_86 and sm_89. ### Cache-blocked cross-channel CPU SIMD (the edge) When channels outnumber CPU cores, the scalar kernel runs them serially. A new path packs a SIMD vector of channels using **L1-resident tile transposes** (the fix for an earlier full-transpose attempt that went memory-bound), gated to `C > cores`. Validated on a **Raspberry Pi 5** (Cortex-A76 ×4) — the edge target: | 4-section SOS, float32 | Pi 5 | 12-core x86 | |---|---:|---:| | C=8 | **1.8×** | scalar (≤ cores) | | C=16 | **2.6×** | **1.3×** | | C=32 | **2.6×** | **2.4×** | No regression below the gate — a win on both the edge device and the desktop. ### Batched multi-signal processing `torchfx.batch_process(waves, effect)` pads many signals to a common length, concatenates them on the channel dimension, and runs the effect in a **single** kernel launch — numerically identical to per-file, but filling the GPU and amortising dispatch overhead. **~2.5–7× faster on CPU and ~3–4× on GPU** for 8–512 stereo files. Ideal for the CLI batch/watch modes. ## Tooling that ships itself - **Automated releases.** A version bump on `master` now tags `vX.Y.Z`, publishes the wheels to PyPI and the CUDA index, **and** opens a GitHub Release whose notes are the CHANGELOG section for that version. (This post's release ran through it.) - **Performance-regression gates** in ordinary CI — deterministic dispatch-count invariants (a depth-K cascade must fuse to *one* native call) plus a same-machine relative-time smoke, so fusion can't silently regress. - **Codecov** project/patch status, a **profiling guide** for diagnosing pipelines, and a fix for the flaky CUDA-wheel disk-space failures. ## Installation ```bash pip install torchfx==0.7.0 ``` CUDA wheels from the GitHub Pages index: ```bash pip install torch --index-url https://download.pytorch.org/whl/cu128 pip install torchfx==0.7.0 \ --index-url https://matteospanio.github.io/torchfx/wheels/cu128/ \ --extra-index-url https://pypi.org/simple ``` ## What's next With dynamics and the headline kernel work landed, the focus turns to the **v0.8.0 realtime milestone** — PipeWire/JACK integration and a GPU realtime path — plus production effects (chorus, flanger, phaser) on the road to v1.0. The full list of changes is in the [CHANGELOG](https://github.com/matteospanio/torchfx/blob/master/CHANGELOG); issues and feature requests are welcome on [GitHub](https://github.com/matteospanio/torchfx).