TorchFX 0.7.0: A Dynamics Suite, a Real Reverb, and the Edge#
TorchFX 0.7.0 delivers exactly what 0.6.0’s “what’s next” promised — the single-pass GPU scan, a cache-blocked CPU SIMD path for edge devices, and a full dynamics toolkit — and then some. The headline is dynamics: a Compressor, an Expander/Gate, and a look-ahead brick-wall Limiter, each a native per-channel C++/CUDA kernel. Alongside them the old toy Reverb is replaced by a proper Freeverb-style algorithmic reverb, many short signals can now be processed in a single batched launch, and the whole release ships itself: tagging, PyPI, and these very release notes are now automated.
One breaking change: Reverb’s constructor changed (see below). Everything else from 0.6.x runs unchanged.
A full dynamics toolkit#
Four new dynamics processors, all built on the same high-quality decoupled peak detector (release max-hold + attack one-pole; Giannoulis, Massberg & Reiss, 2012) running one channel per thread in C++/CUDA:
Compressor— threshold/ratio/attack/release, soft knee, makeup gain, peak or RMS detection, and aratio=inflimiter mode.Expander/Gate— the mirror of the compressor: attenuate below the threshold.Gateis the infinite-ratio convenience (a hard noise gate) with an optionalfloor(range).Limiter— a look-ahead brick-wall peak limiter. A vectorised forward max-pool computes the look-ahead window, the gain is attack/release-smoothed, and a per-sample clamp guarantees|y| ≤ threshold— so unlikeCompressor(ratio=inf), it can never overshoot a single sample.
import torchfx as fx
mastered = (
wave
| fx.effect.Compressor(threshold=-18, ratio=3, attack=0.005, release=0.08)
| fx.effect.Limiter(threshold=-1.0, lookahead=0.005)
)
A reverb worth the name#
The old Reverb was a single feedforward comb — barely a reverb. 0.7.0 replaces it with the classic Schroeder/Moorer (Freeverb) structure: per channel, 8 parallel low-pass-feedback comb filters summed through 4 series all-pass diffusers, in a fused native kernel (ring buffers in per-channel scratch, tunings scaled to the sample rate) for a dense, natural decay.
# Breaking change: room_size / damping / mix (was delay / decay / mix)
hall = fx.effect.Reverb(room_size=0.85, damping=0.3, mix=0.25)
Performance: the GPU scan, and the edge#
Two kernel-level wins close out the 0.6.0 roadmap, plus a new batching path.
Single-pass SOS scan (GPU)#
The fused cascade’s forcing pass and the three-phase Blelloch scan are folded into one decoupled-look-back kernel (Merrill–Garland / CUB style, with an atomic per-section tile dispenser for deadlock-free forward progress). Per fused section drops from ~8 CUDA launches to 2:
K=50 cascade (RTX 3070) |
3-phase |
single-pass |
|
|---|---|---|---|
CUDA launches |
600 |
102 |
5.9× fewer |
Fused cascade time |
18.9 ms |
9.9 ms |
~1.9× |
It’s now the default GPU path (TORCHFX_FUSED_SCAN=0 for the legacy oracle), bit-exact to it on sm_86 and sm_89.
Cache-blocked cross-channel CPU SIMD (the edge)#
When channels outnumber CPU cores, the scalar kernel runs them serially. A new path packs a SIMD vector of channels using L1-resident tile transposes (the fix for an earlier full-transpose attempt that went memory-bound), gated to C > cores. Validated on a Raspberry Pi 5 (Cortex-A76 ×4) — the edge target:
4-section SOS, float32 |
Pi 5 |
12-core x86 |
|---|---|---|
C=8 |
1.8× |
scalar (≤ cores) |
C=16 |
2.6× |
1.3× |
C=32 |
2.6× |
2.4× |
No regression below the gate — a win on both the edge device and the desktop.
Batched multi-signal processing#
torchfx.batch_process(waves, effect) pads many signals to a common length, concatenates them on the channel dimension, and runs the effect in a single kernel launch — numerically identical to per-file, but filling the GPU and amortising dispatch overhead. ~2.5–7× faster on CPU and ~3–4× on GPU for 8–512 stereo files. Ideal for the CLI batch/watch modes.
Tooling that ships itself#
Automated releases. A version bump on
masternow tagsvX.Y.Z, publishes the wheels to PyPI and the CUDA index, and opens a GitHub Release whose notes are the CHANGELOG section for that version. (This post’s release ran through it.)Performance-regression gates in ordinary CI — deterministic dispatch-count invariants (a depth-K cascade must fuse to one native call) plus a same-machine relative-time smoke, so fusion can’t silently regress.
Codecov project/patch status, a profiling guide for diagnosing pipelines, and a fix for the flaky CUDA-wheel disk-space failures.
Installation#
pip install torchfx==0.7.0
CUDA wheels from the GitHub Pages index:
pip install torch --index-url https://download.pytorch.org/whl/cu128
pip install torchfx==0.7.0 \
--index-url https://matteospanio.github.io/torchfx/wheels/cu128/ \
--extra-index-url https://pypi.org/simple
What’s next#
With dynamics and the headline kernel work landed, the focus turns to the v0.8.0 realtime milestone — PipeWire/JACK integration and a GPU realtime path — plus production effects (chorus, flanger, phaser) on the road to v1.0. The full list of changes is in the CHANGELOG; issues and feature requests are welcome on GitHub.