TorchFX 0.5.4: Native Filter Design & Goodbye scipy#

TorchFX 0.5.4 drops scipy as a runtime dependency. Every filter-design call that used to go through scipy.signal — Butterworth, Chebyshev I/II, Elliptic, Linkwitz-Riley, and DesignableFIR — is now performed by a native pure-PyTorch design module. The library is leaner, the dependency tree is shorter, and the design step itself is 14–50× faster than scipy on the parameter ranges we ship.

If your code worked on 0.5.3, it works on 0.5.4 unchanged. The only thing you might notice is that pip install torchfx no longer pulls down ~30 MB of scientific Python you weren’t using.

`torchfx.filter._design`#

The new torchfx.filter._design module is a faithful, drop-in replacement for the handful of scipy entry points TorchFX actually used:

design_butterworth_sos
design_cheby1_sos
design_cheby2_sos
design_ellip_sos
design_firwin (multi-band capable)

All five return canonical CPU float64 SOS (or FIR) tensors that plug straight into the existing C++/CUDA kernels — no changes to the dispatch layer, no changes to user code, no changes to the SOS cascade. The implementation follows the standard analog-prototype + bilinear-transform pipeline. The elliptic design is a port of scipy’s algorithm, cross-referenced against Orfanidis’s Lecture Notes on Elliptic Filter Design.

Numerical equivalence to scipy is verified by tests/test_native_design.py — 509 tests sweeping orders 1–16, lowpass and highpass, the major FIR windows, and the full Rp/Rs grid for Chebyshev and Elliptic. Tolerances are tight: 1e-12 for Butterworth and FIR, 1e-9 for Chebyshev, 1e-7 for Elliptic up to N=8, and 1e-5 for Elliptic at N=12–16. scipy remains a dev-only dependency precisely so we can keep running these equivalence checks; downstream installs never see it.

14–50× faster than scipy#

The benchmark numbers from benchmarks/test_design_benchmarks.py are where this lands. Paired side-by-side timings, single core, Python 3.12:

Filter type	Order	scipy	TorchFX native	Speedup
Butterworth	16	470 µs	26 µs	18×
Chebyshev I	16	480 µs	32 µs	15×
Chebyshev II	16	490 µs	37 µs	13×
Elliptic	16	670 µs	92 µs	7×

Across the parameter sweep, native design is 14–50× faster for Butterworth and Chebyshev I/II, and 7–20× faster for Elliptic. Run the benchmarks yourself:

uv run pytest benchmarks/test_design_benchmarks.py --benchmark-enable

The interesting implementation note: the IIR design pipeline runs on Python complex scalars rather than torch tensors. For small problems (N≤16, which covers every realistic order), per-op tensor dispatch dominates the actual arithmetic, so a vectorized tensor implementation is slower, not faster. firwin stays in torch because windows are length-N (~31…1024) where vectorization wins. Knowing which path to take here is what unlocked the speedup.

A small but symbolic side effect: LinkwitzRiley.compute_coefficients replaced its last np.vstack call with torch.cat. After this release, numpy is used only at the I/O boundary (soundfile, sounddevice) — never for signal-processing math in src/torchfx/ itself.

CPU float32 Fast Path#

CPU IIR kernels in _csrc/cpu/iir_cpu.cpp previously upcast every input to float64 before running the cascade. Many realtime and ML pipelines pass float32 audio — forcing them through float64 doubled the memory traffic for no accuracy gain at the orders TorchFX targets.

0.5.4 makes the CPU kernels dtype-aware: float32 inputs run in float32, float64 inputs run in float64, no conversion either way. The _ops dispatch and the shared SOS forward path were updated to match. CUDA stays on float64 for now — the kernel generation hasn’t been retuned for mixed precision yet, and that’s a follow-up.

The net effect on a typical float32 IIR chain is a meaningful drop in CPU time and memory bandwidth, with bit-for-bit-equivalent output to within the standard float32 rounding.

Code Hygiene: Six Biquad Subclasses Slimmed Down#

Six biquad subclasses (BiquadLPF, BiquadHPF, BiquadNotch, BiquadBPF, BiquadBPFPeak, BiquadAllPass) plus Shelving, ParametricEQ, Notch, and AllPass in iir.py all expressed RBJ cookbook coefficients with an explicit a0_inv divide-and-assign block. About 50 lines of duplicated normalization plumbing.

0.5.4 replaces that with a single _finalize_coeffs(b0, b1, b2, a0, a1, a2) helper that performs the a0 normalization centrally. Subclasses now read like the RBJ cookbook itself — compute the six raw coefficients, hand them off, done. The old per-class _set_coefficients helper is gone, along with _gain_db and Delay._extend_waveform (each was called from exactly one site).

This is purely refactoring — behavior is identical, but the filter implementations are noticeably easier to read.

Wave & Realtime Hardening#

A wave of validation and contract-tightening changes landed alongside the design work:

Wave now enforces internal (channels, samples) shape invariants on every input. 1D mono inputs are normalized to (1, T). Wave.get_channel() no longer returns flattened 1D tensors that broke downstream len() semantics. Wave.merge is explicit about behaviour: split-channel mode zero-pads to the longest input length, while mix mode requires matching channel counts and raises clearly when they don’t.
Delay and Gain consistently raise ValueError for invalid user parameters. BPM-synced Delay now recomputes delay_samples when fs changes after initialization — so reusing a delay across files with different sample rates stays sample-accurate.
_ops.delay_line_forward validates rank and dtype, supports tensors with arbitrary leading batch dimensions by flattening and restoring (..., T), and round-trips non-native float dtypes through the extension’s supported float32/float64 execution path.
DesignableFIR initializes its nn.Module state even when fs=None, then updates kernel coefficients in compute_coefficients() once the sample rate is available. This brings it in line with the lazy-fs pattern that every other filter follows.
StreamProcessor now resets stateful effects at file and chunk boundaries, and rejects effects that change chunk length up front instead of producing garbled output mid-stream.
CLI processing surfaces config/effect parse failures (including invalid constructor parameters) as user-facing messages instead of raw tracebacks.

None of these are flashy on their own, but together they close the regression gaps that turned up while validating the design module against the rest of the pipeline. New tests in tests/test_wave.py, tests/test_effects.py, tests/test_fir.py, tests/test_realtime.py, and tests/test_cli.py lock the new contracts in place.

What Hasn’t Changed#

Everything from 0.5.3 still applies: prebuilt wheels, scikit-build-core build, no pure-PyTorch fallback, native CPU/CUDA kernels. The user-facing API is identical. Filter coefficients computed by 0.5.4’s native design match scipy to within the tolerances above, so any code that depended on specific coefficient values continues to work.

Installation#

pip install torchfx==0.5.4

CUDA wheels remain available from the GitHub Pages index:

pip install torch --index-url https://download.pytorch.org/whl/cu124
pip install torchfx==0.5.4 \
    --index-url https://matteospanio.github.io/torchfx/wheels/cu124/ \
    --extra-index-url https://pypi.org/simple

scipy is no longer in your transitive dependency closure.

What’s Next#

With scipy out of the runtime and the design path running on native code, the next round of optimization work returns to the GPU side: retuning the CUDA SOS kernel for mixed precision so float32 gets the same fast path on GPU that it now has on CPU, and folding the design step into the deferred pipeline so coefficient computation happens once per chain rather than once per filter.

As always, file issues and feature requests on GitHub.

TorchFX 0.5.3: Build System Overhaul & Prebuilt Wheels A Hardened Realtime Path: Worker Threads, Allocation-Free Streaming, and Dtype-Aware Dispatch

26 May 2026

Recent Posts

Tags

Categories

Authors

Archives

TorchFX 0.5.4: Native Filter Design & Goodbye scipy#

`torchfx.filter._design`#

14–50× faster than scipy#

CPU float32 Fast Path#

Code Hygiene: Six Biquad Subclasses Slimmed Down#

Wave & Realtime Hardening#

What Hasn’t Changed#

Installation#

What’s Next#

26 May 2026

Recent Posts

Tags

Categories

Authors

Archives

TorchFX 0.5.4: Native Filter Design & Goodbye scipy#

torchfx.filter._design#

14–50× faster than scipy#

CPU float32 Fast Path#

Code Hygiene: Six Biquad Subclasses Slimmed Down#

Wave & Realtime Hardening#

What Hasn’t Changed#

Installation#

What’s Next#

`torchfx.filter._design`#