---
blogpost: true
date: May 26, 2026
author: Matteo Spanio
category: releases
tags: release, scipy, filter-design, performance, validation
---

# TorchFX 0.5.4: Native Filter Design & Goodbye scipy

**TorchFX 0.5.4** drops `scipy` as a runtime dependency. Every filter-design call that used to go through `scipy.signal` --- Butterworth, Chebyshev I/II, Elliptic, Linkwitz-Riley, and `DesignableFIR` --- is now performed by a native pure-PyTorch design module. The library is leaner, the dependency tree is shorter, and the design step itself is **14--50× faster** than scipy on the parameter ranges we ship.

If your code worked on 0.5.3, it works on 0.5.4 unchanged. The only thing you might notice is that `pip install torchfx` no longer pulls down ~30 MB of scientific Python you weren't using.

## `torchfx.filter._design`

The new [`torchfx.filter._design`](https://github.com/matteospanio/torchfx) module is a faithful, drop-in replacement for the handful of scipy entry points TorchFX actually used:

- `design_butterworth_sos`
- `design_cheby1_sos`
- `design_cheby2_sos`
- `design_ellip_sos`
- `design_firwin` (multi-band capable)

All five return canonical CPU `float64` SOS (or FIR) tensors that plug straight into the existing C++/CUDA kernels --- no changes to the dispatch layer, no changes to user code, no changes to the SOS cascade. The implementation follows the standard analog-prototype + bilinear-transform pipeline. The elliptic design is a port of scipy's algorithm, cross-referenced against Orfanidis's *Lecture Notes on Elliptic Filter Design*.

Numerical equivalence to scipy is verified by [`tests/test_native_design.py`](https://github.com/matteospanio/torchfx) --- 509 tests sweeping orders 1--16, lowpass and highpass, the major FIR windows, and the full Rp/Rs grid for Chebyshev and Elliptic. Tolerances are tight: `1e-12` for Butterworth and FIR, `1e-9` for Chebyshev, `1e-7` for Elliptic up to N=8, and `1e-5` for Elliptic at N=12--16. scipy remains a dev-only dependency precisely so we can keep running these equivalence checks; downstream installs never see it.

## 14--50× faster than scipy

The benchmark numbers from [`benchmarks/test_design_benchmarks.py`](https://github.com/matteospanio/torchfx) are where this lands. Paired side-by-side timings, single core, Python 3.12:

| Filter type | Order | scipy | TorchFX native | Speedup |
|-------------|------:|------:|---------------:|--------:|
| Butterworth | 16    | 470 µs | 26 µs         | **18×** |
| Chebyshev I | 16    | 480 µs | 32 µs         | **15×** |
| Chebyshev II| 16    | 490 µs | 37 µs         | **13×** |
| Elliptic    | 16    | 670 µs | 92 µs         | **7×**  |

Across the parameter sweep, native design is **14--50× faster** for Butterworth and Chebyshev I/II, and **7--20× faster** for Elliptic. Run the benchmarks yourself:

```bash
uv run pytest benchmarks/test_design_benchmarks.py --benchmark-enable
```

The interesting implementation note: the IIR design pipeline runs on Python `complex` scalars rather than `torch` tensors. For small problems (N≤16, which covers every realistic order), per-op tensor dispatch dominates the actual arithmetic, so a vectorized tensor implementation is *slower*, not faster. `firwin` stays in torch because windows are length-N (~31..1024) where vectorization wins. Knowing which path to take here is what unlocked the speedup.

A small but symbolic side effect: `LinkwitzRiley.compute_coefficients` replaced its last `np.vstack` call with `torch.cat`. After this release, `numpy` is used only at the I/O boundary (`soundfile`, `sounddevice`) --- never for signal-processing math in `src/torchfx/` itself.

## CPU float32 Fast Path

CPU IIR kernels in [`_csrc/cpu/iir_cpu.cpp`](https://github.com/matteospanio/torchfx) previously upcast every input to `float64` before running the cascade. Many realtime and ML pipelines pass `float32` audio --- forcing them through `float64` doubled the memory traffic for no accuracy gain at the orders TorchFX targets.

0.5.4 makes the CPU kernels **dtype-aware**: `float32` inputs run in `float32`, `float64` inputs run in `float64`, no conversion either way. The `_ops` dispatch and the shared SOS forward path were updated to match. CUDA stays on `float64` for now --- the kernel generation hasn't been retuned for mixed precision yet, and that's a follow-up.

The net effect on a typical `float32` IIR chain is a meaningful drop in CPU time and memory bandwidth, with bit-for-bit-equivalent output to within the standard `float32` rounding.

## Code Hygiene: Six Biquad Subclasses Slimmed Down

Six biquad subclasses (`BiquadLPF`, `BiquadHPF`, `BiquadNotch`, `BiquadBPF`, `BiquadBPFPeak`, `BiquadAllPass`) plus `Shelving`, `ParametricEQ`, `Notch`, and `AllPass` in `iir.py` all expressed RBJ cookbook coefficients with an explicit `a0_inv` divide-and-assign block. About 50 lines of duplicated normalization plumbing.

0.5.4 replaces that with a single `_finalize_coeffs(b0, b1, b2, a0, a1, a2)` helper that performs the `a0` normalization centrally. Subclasses now read like the RBJ cookbook itself --- compute the six raw coefficients, hand them off, done. The old per-class `_set_coefficients` helper is gone, along with `_gain_db` and `Delay._extend_waveform` (each was called from exactly one site).

This is purely refactoring --- behavior is identical, but the filter implementations are noticeably easier to read.

## Wave & Realtime Hardening

A wave of validation and contract-tightening changes landed alongside the design work:

- **`Wave`** now enforces internal `(channels, samples)` shape invariants on every input. 1D mono inputs are normalized to `(1, T)`. `Wave.get_channel()` no longer returns flattened 1D tensors that broke downstream `len()` semantics. `Wave.merge` is explicit about behaviour: split-channel mode zero-pads to the longest input length, while mix mode requires matching channel counts and raises clearly when they don't.
- **`Delay`** and **`Gain`** consistently raise `ValueError` for invalid user parameters. BPM-synced `Delay` now recomputes `delay_samples` when `fs` changes after initialization --- so reusing a delay across files with different sample rates stays sample-accurate.
- **`_ops.delay_line_forward`** validates rank and dtype, supports tensors with arbitrary leading batch dimensions by flattening and restoring `(..., T)`, and round-trips non-native float dtypes through the extension's supported `float32`/`float64` execution path.
- **`DesignableFIR`** initializes its `nn.Module` state even when `fs=None`, then updates kernel coefficients in `compute_coefficients()` once the sample rate is available. This brings it in line with the lazy-`fs` pattern that every other filter follows.
- **`StreamProcessor`** now resets stateful effects at file and chunk boundaries, and rejects effects that change chunk length up front instead of producing garbled output mid-stream.
- **CLI processing** surfaces config/effect parse failures (including invalid constructor parameters) as user-facing messages instead of raw tracebacks.

None of these are flashy on their own, but together they close the regression gaps that turned up while validating the design module against the rest of the pipeline. New tests in `tests/test_wave.py`, `tests/test_effects.py`, `tests/test_fir.py`, `tests/test_realtime.py`, and `tests/test_cli.py` lock the new contracts in place.

## What Hasn't Changed

Everything from 0.5.3 still applies: prebuilt wheels, scikit-build-core build, no pure-PyTorch fallback, native CPU/CUDA kernels. The user-facing API is identical. Filter coefficients computed by 0.5.4's native design match scipy to within the tolerances above, so any code that depended on specific coefficient values continues to work.

## Installation

```bash
pip install torchfx==0.5.4
```

CUDA wheels remain available from the GitHub Pages index:

```bash
pip install torch --index-url https://download.pytorch.org/whl/cu124
pip install torchfx==0.5.4 \
    --index-url https://matteospanio.github.io/torchfx/wheels/cu124/ \
    --extra-index-url https://pypi.org/simple
```

`scipy` is no longer in your transitive dependency closure.

## What's Next

With scipy out of the runtime and the design path running on native code, the next round of optimization work returns to the GPU side: retuning the CUDA SOS kernel for mixed precision so `float32` gets the same fast path on GPU that it now has on CPU, and folding the design step into the deferred pipeline so coefficient computation happens once per chain rather than once per filter.

As always, file issues and feature requests on [GitHub](https://github.com/matteospanio/torchfx).