TorchFX 0.5.4: Native Filter Design & Goodbye scipy#
TorchFX 0.5.4 drops scipy as a runtime dependency. Every filter-design call that used to go through scipy.signal — Butterworth, Chebyshev I/II, Elliptic, Linkwitz-Riley, and DesignableFIR — is now performed by a native pure-PyTorch design module. The library is leaner, the dependency tree is shorter, and the design step itself is 14–50× faster than scipy on the parameter ranges we ship.
If your code worked on 0.5.3, it works on 0.5.4 unchanged. The only thing you might notice is that pip install torchfx no longer pulls down ~30 MB of scientific Python you weren’t using.
torchfx.filter._design#
The new torchfx.filter._design module is a faithful, drop-in replacement for the handful of scipy entry points TorchFX actually used:
design_butterworth_sosdesign_cheby1_sosdesign_cheby2_sosdesign_ellip_sosdesign_firwin(multi-band capable)
All five return canonical CPU float64 SOS (or FIR) tensors that plug straight into the existing C++/CUDA kernels — no changes to the dispatch layer, no changes to user code, no changes to the SOS cascade. The implementation follows the standard analog-prototype + bilinear-transform pipeline. The elliptic design is a port of scipy’s algorithm, cross-referenced against Orfanidis’s Lecture Notes on Elliptic Filter Design.
Numerical equivalence to scipy is verified by tests/test_native_design.py — 509 tests sweeping orders 1–16, lowpass and highpass, the major FIR windows, and the full Rp/Rs grid for Chebyshev and Elliptic. Tolerances are tight: 1e-12 for Butterworth and FIR, 1e-9 for Chebyshev, 1e-7 for Elliptic up to N=8, and 1e-5 for Elliptic at N=12–16. scipy remains a dev-only dependency precisely so we can keep running these equivalence checks; downstream installs never see it.
14–50× faster than scipy#
The benchmark numbers from benchmarks/test_design_benchmarks.py are where this lands. Paired side-by-side timings, single core, Python 3.12:
Filter type |
Order |
scipy |
TorchFX native |
Speedup |
|---|---|---|---|---|
Butterworth |
16 |
470 µs |
26 µs |
18× |
Chebyshev I |
16 |
480 µs |
32 µs |
15× |
Chebyshev II |
16 |
490 µs |
37 µs |
13× |
Elliptic |
16 |
670 µs |
92 µs |
7× |
Across the parameter sweep, native design is 14–50× faster for Butterworth and Chebyshev I/II, and 7–20× faster for Elliptic. Run the benchmarks yourself:
uv run pytest benchmarks/test_design_benchmarks.py --benchmark-enable
The interesting implementation note: the IIR design pipeline runs on Python complex scalars rather than torch tensors. For small problems (N≤16, which covers every realistic order), per-op tensor dispatch dominates the actual arithmetic, so a vectorized tensor implementation is slower, not faster. firwin stays in torch because windows are length-N (~31…1024) where vectorization wins. Knowing which path to take here is what unlocked the speedup.
A small but symbolic side effect: LinkwitzRiley.compute_coefficients replaced its last np.vstack call with torch.cat. After this release, numpy is used only at the I/O boundary (soundfile, sounddevice) — never for signal-processing math in src/torchfx/ itself.
CPU float32 Fast Path#
CPU IIR kernels in _csrc/cpu/iir_cpu.cpp previously upcast every input to float64 before running the cascade. Many realtime and ML pipelines pass float32 audio — forcing them through float64 doubled the memory traffic for no accuracy gain at the orders TorchFX targets.
0.5.4 makes the CPU kernels dtype-aware: float32 inputs run in float32, float64 inputs run in float64, no conversion either way. The _ops dispatch and the shared SOS forward path were updated to match. CUDA stays on float64 for now — the kernel generation hasn’t been retuned for mixed precision yet, and that’s a follow-up.
The net effect on a typical float32 IIR chain is a meaningful drop in CPU time and memory bandwidth, with bit-for-bit-equivalent output to within the standard float32 rounding.
Code Hygiene: Six Biquad Subclasses Slimmed Down#
Six biquad subclasses (BiquadLPF, BiquadHPF, BiquadNotch, BiquadBPF, BiquadBPFPeak, BiquadAllPass) plus Shelving, ParametricEQ, Notch, and AllPass in iir.py all expressed RBJ cookbook coefficients with an explicit a0_inv divide-and-assign block. About 50 lines of duplicated normalization plumbing.
0.5.4 replaces that with a single _finalize_coeffs(b0, b1, b2, a0, a1, a2) helper that performs the a0 normalization centrally. Subclasses now read like the RBJ cookbook itself — compute the six raw coefficients, hand them off, done. The old per-class _set_coefficients helper is gone, along with _gain_db and Delay._extend_waveform (each was called from exactly one site).
This is purely refactoring — behavior is identical, but the filter implementations are noticeably easier to read.
Wave & Realtime Hardening#
A wave of validation and contract-tightening changes landed alongside the design work:
Wavenow enforces internal(channels, samples)shape invariants on every input. 1D mono inputs are normalized to(1, T).Wave.get_channel()no longer returns flattened 1D tensors that broke downstreamlen()semantics.Wave.mergeis explicit about behaviour: split-channel mode zero-pads to the longest input length, while mix mode requires matching channel counts and raises clearly when they don’t.DelayandGainconsistently raiseValueErrorfor invalid user parameters. BPM-syncedDelaynow recomputesdelay_sampleswhenfschanges after initialization — so reusing a delay across files with different sample rates stays sample-accurate._ops.delay_line_forwardvalidates rank and dtype, supports tensors with arbitrary leading batch dimensions by flattening and restoring(..., T), and round-trips non-native float dtypes through the extension’s supportedfloat32/float64execution path.DesignableFIRinitializes itsnn.Modulestate even whenfs=None, then updates kernel coefficients incompute_coefficients()once the sample rate is available. This brings it in line with the lazy-fspattern that every other filter follows.StreamProcessornow resets stateful effects at file and chunk boundaries, and rejects effects that change chunk length up front instead of producing garbled output mid-stream.CLI processing surfaces config/effect parse failures (including invalid constructor parameters) as user-facing messages instead of raw tracebacks.
None of these are flashy on their own, but together they close the regression gaps that turned up while validating the design module against the rest of the pipeline. New tests in tests/test_wave.py, tests/test_effects.py, tests/test_fir.py, tests/test_realtime.py, and tests/test_cli.py lock the new contracts in place.
What Hasn’t Changed#
Everything from 0.5.3 still applies: prebuilt wheels, scikit-build-core build, no pure-PyTorch fallback, native CPU/CUDA kernels. The user-facing API is identical. Filter coefficients computed by 0.5.4’s native design match scipy to within the tolerances above, so any code that depended on specific coefficient values continues to work.
Installation#
pip install torchfx==0.5.4
CUDA wheels remain available from the GitHub Pages index:
pip install torch --index-url https://download.pytorch.org/whl/cu124
pip install torchfx==0.5.4 \
--index-url https://matteospanio.github.io/torchfx/wheels/cu124/ \
--extra-index-url https://pypi.org/simple
scipy is no longer in your transitive dependency closure.
What’s Next#
With scipy out of the runtime and the design path running on native code, the next round of optimization work returns to the GPU side: retuning the CUDA SOS kernel for mixed precision so float32 gets the same fast path on GPU that it now has on CPU, and folding the design step into the deferred pipeline so coefficient computation happens once per chain rather than once per filter.
As always, file issues and feature requests on GitHub.