TorchFX 0.7.0: A Dynamics Suite, a Real Reverb, and the Edge#

TorchFX 0.7.0 delivers exactly what 0.6.0’s “what’s next” promised — the single-pass GPU scan, a cache-blocked CPU SIMD path for edge devices, and a full dynamics toolkit — and then some. The headline is dynamics: a Compressor, an Expander/Gate, and a look-ahead brick-wall Limiter, each a native per-channel C++/CUDA kernel. Alongside them the old toy Reverb is replaced by a proper Freeverb-style algorithmic reverb, many short signals can now be processed in a single batched launch, and the whole release ships itself: tagging, PyPI, and these very release notes are now automated.

One breaking change: Reverb’s constructor changed (see below). Everything else from 0.6.x runs unchanged.

A full dynamics toolkit#

Four new dynamics processors, all built on the same high-quality decoupled peak detector (release max-hold + attack one-pole; Giannoulis, Massberg & Reiss, 2012) running one channel per thread in C++/CUDA:

  • Compressor — threshold/ratio/attack/release, soft knee, makeup gain, peak or RMS detection, and a ratio=inf limiter mode.

  • Expander / Gate — the mirror of the compressor: attenuate below the threshold. Gate is the infinite-ratio convenience (a hard noise gate) with an optional floor (range).

  • Limiter — a look-ahead brick-wall peak limiter. A vectorised forward max-pool computes the look-ahead window, the gain is attack/release-smoothed, and a per-sample clamp guarantees |y| threshold — so unlike Compressor(ratio=inf), it can never overshoot a single sample.

import torchfx as fx

mastered = (
    wave
    | fx.effect.Compressor(threshold=-18, ratio=3, attack=0.005, release=0.08)
    | fx.effect.Limiter(threshold=-1.0, lookahead=0.005)
)

A reverb worth the name#

The old Reverb was a single feedforward comb — barely a reverb. 0.7.0 replaces it with the classic Schroeder/Moorer (Freeverb) structure: per channel, 8 parallel low-pass-feedback comb filters summed through 4 series all-pass diffusers, in a fused native kernel (ring buffers in per-channel scratch, tunings scaled to the sample rate) for a dense, natural decay.

# Breaking change: room_size / damping / mix (was delay / decay / mix)
hall = fx.effect.Reverb(room_size=0.85, damping=0.3, mix=0.25)

Performance: the GPU scan, and the edge#

Two kernel-level wins close out the 0.6.0 roadmap, plus a new batching path.

Single-pass SOS scan (GPU)#

The fused cascade’s forcing pass and the three-phase Blelloch scan are folded into one decoupled-look-back kernel (Merrill–Garland / CUB style, with an atomic per-section tile dispenser for deadlock-free forward progress). Per fused section drops from ~8 CUDA launches to 2:

K=50 cascade (RTX 3070)

3-phase

single-pass

CUDA launches

600

102

5.9× fewer

Fused cascade time

18.9 ms

9.9 ms

~1.9×

It’s now the default GPU path (TORCHFX_FUSED_SCAN=0 for the legacy oracle), bit-exact to it on sm_86 and sm_89.

Cache-blocked cross-channel CPU SIMD (the edge)#

When channels outnumber CPU cores, the scalar kernel runs them serially. A new path packs a SIMD vector of channels using L1-resident tile transposes (the fix for an earlier full-transpose attempt that went memory-bound), gated to C > cores. Validated on a Raspberry Pi 5 (Cortex-A76 ×4) — the edge target:

4-section SOS, float32

Pi 5

12-core x86

C=8

1.8×

scalar (≤ cores)

C=16

2.6×

1.3×

C=32

2.6×

2.4×

No regression below the gate — a win on both the edge device and the desktop.

Batched multi-signal processing#

torchfx.batch_process(waves, effect) pads many signals to a common length, concatenates them on the channel dimension, and runs the effect in a single kernel launch — numerically identical to per-file, but filling the GPU and amortising dispatch overhead. ~2.5–7× faster on CPU and ~3–4× on GPU for 8–512 stereo files. Ideal for the CLI batch/watch modes.

Tooling that ships itself#

  • Automated releases. A version bump on master now tags vX.Y.Z, publishes the wheels to PyPI and the CUDA index, and opens a GitHub Release whose notes are the CHANGELOG section for that version. (This post’s release ran through it.)

  • Performance-regression gates in ordinary CI — deterministic dispatch-count invariants (a depth-K cascade must fuse to one native call) plus a same-machine relative-time smoke, so fusion can’t silently regress.

  • Codecov project/patch status, a profiling guide for diagnosing pipelines, and a fix for the flaky CUDA-wheel disk-space failures.

Installation#

pip install torchfx==0.7.0

CUDA wheels from the GitHub Pages index:

pip install torch --index-url https://download.pytorch.org/whl/cu128
pip install torchfx==0.7.0 \
    --index-url https://matteospanio.github.io/torchfx/wheels/cu128/ \
    --extra-index-url https://pypi.org/simple

What’s next#

With dynamics and the headline kernel work landed, the focus turns to the v0.8.0 realtime milestone — PipeWire/JACK integration and a GPU realtime path — plus production effects (chorus, flanger, phaser) on the road to v1.0. The full list of changes is in the CHANGELOG; issues and feature requests are welcome on GitHub.