---
blogpost: true
date: Jun 09, 2026
author: Matteo Spanio
category: releases
tags: release, dynamics, reverb, simd, performance, edge, tooling
---

# TorchFX 0.7.0: A Dynamics Suite, a Real Reverb, and the Edge

**TorchFX 0.7.0** delivers exactly what 0.6.0's "what's next" promised — the single-pass GPU scan, a cache-blocked CPU SIMD path for edge devices, and a full dynamics toolkit — and then some. The headline is **dynamics**: a `Compressor`, an `Expander`/`Gate`, and a look-ahead brick-wall `Limiter`, each a native per-channel C++/CUDA kernel. Alongside them the old toy `Reverb` is replaced by a proper **Freeverb-style** algorithmic reverb, many short signals can now be processed in a **single batched launch**, and the whole release ships itself: tagging, PyPI, and these very release notes are now automated.

One breaking change: `Reverb`'s constructor changed (see below). Everything else from 0.6.x runs unchanged.

## A full dynamics toolkit

Four new dynamics processors, all built on the same high-quality decoupled peak detector (release max-hold + attack one-pole; Giannoulis, Massberg & Reiss, 2012) running one channel per thread in C++/CUDA:

- **`Compressor`** — threshold/ratio/attack/release, soft knee, makeup gain, peak or RMS detection, and a `ratio=inf` limiter mode.
- **`Expander` / `Gate`** — the mirror of the compressor: attenuate *below* the threshold. `Gate` is the infinite-ratio convenience (a hard noise gate) with an optional `floor` (range).
- **`Limiter`** — a **look-ahead brick-wall** peak limiter. A vectorised forward max-pool computes the look-ahead window, the gain is attack/release-smoothed, and a per-sample clamp **guarantees `|y| ≤ threshold`** — so unlike `Compressor(ratio=inf)`, it can never overshoot a single sample.

```python
import torchfx as fx

mastered = (
    wave
    | fx.effect.Compressor(threshold=-18, ratio=3, attack=0.005, release=0.08)
    | fx.effect.Limiter(threshold=-1.0, lookahead=0.005)
)
```

## A reverb worth the name

The old `Reverb` was a single feedforward comb — barely a reverb. 0.7.0 replaces it with the classic **Schroeder/Moorer (Freeverb) structure**: per channel, **8 parallel low-pass-feedback comb filters** summed through **4 series all-pass diffusers**, in a fused native kernel (ring buffers in per-channel scratch, tunings scaled to the sample rate) for a dense, natural decay.

```python
# Breaking change: room_size / damping / mix (was delay / decay / mix)
hall = fx.effect.Reverb(room_size=0.85, damping=0.3, mix=0.25)
```

## Performance: the GPU scan, and the edge

Two kernel-level wins close out the 0.6.0 roadmap, plus a new batching path.

### Single-pass SOS scan (GPU)

The fused cascade's forcing pass and the three-phase Blelloch scan are folded into **one decoupled-look-back kernel** (Merrill–Garland / CUB style, with an atomic per-section tile dispenser for deadlock-free forward progress). Per fused section drops from ~8 CUDA launches to **2**:

| K=50 cascade (RTX 3070) | 3-phase | single-pass | |
|---|---:|---:|---:|
| CUDA launches | 600 | **102** | **5.9× fewer** |
| Fused cascade time | 18.9 ms | **9.9 ms** | **~1.9×** |

It's now the default GPU path (`TORCHFX_FUSED_SCAN=0` for the legacy oracle), bit-exact to it on sm_86 and sm_89.

### Cache-blocked cross-channel CPU SIMD (the edge)

When channels outnumber CPU cores, the scalar kernel runs them serially. A new path packs a SIMD vector of channels using **L1-resident tile transposes** (the fix for an earlier full-transpose attempt that went memory-bound), gated to `C > cores`. Validated on a **Raspberry Pi 5** (Cortex-A76 ×4) — the edge target:

| 4-section SOS, float32 | Pi 5 | 12-core x86 |
|---|---:|---:|
| C=8 | **1.8×** | scalar (≤ cores) |
| C=16 | **2.6×** | **1.3×** |
| C=32 | **2.6×** | **2.4×** |

No regression below the gate — a win on both the edge device and the desktop.

### Batched multi-signal processing

`torchfx.batch_process(waves, effect)` pads many signals to a common length, concatenates them on the channel dimension, and runs the effect in a **single** kernel launch — numerically identical to per-file, but filling the GPU and amortising dispatch overhead. **~2.5–7× faster on CPU and ~3–4× on GPU** for 8–512 stereo files. Ideal for the CLI batch/watch modes.

## Tooling that ships itself

- **Automated releases.** A version bump on `master` now tags `vX.Y.Z`, publishes the wheels to PyPI and the CUDA index, **and** opens a GitHub Release whose notes are the CHANGELOG section for that version. (This post's release ran through it.)
- **Performance-regression gates** in ordinary CI — deterministic dispatch-count invariants (a depth-K cascade must fuse to *one* native call) plus a same-machine relative-time smoke, so fusion can't silently regress.
- **Codecov** project/patch status, a **profiling guide** for diagnosing pipelines, and a fix for the flaky CUDA-wheel disk-space failures.

## Installation

```bash
pip install torchfx==0.7.0
```

CUDA wheels from the GitHub Pages index:

```bash
pip install torch --index-url https://download.pytorch.org/whl/cu128
pip install torchfx==0.7.0 \
    --index-url https://matteospanio.github.io/torchfx/wheels/cu128/ \
    --extra-index-url https://pypi.org/simple
```

## What's next

With dynamics and the headline kernel work landed, the focus turns to the **v0.8.0 realtime milestone** — PipeWire/JACK integration and a GPU realtime path — plus production effects (chorus, flanger, phaser) on the road to v1.0. The full list of changes is in the [CHANGELOG](https://github.com/matteospanio/torchfx/blob/master/CHANGELOG); issues and feature requests are welcome on [GitHub](https://github.com/matteospanio/torchfx).