TorchFX 0.5.2: Transparent Filter Fusion & Unified Forward Paths#

TorchFX 0.5.2 focuses on two things: making filter chains faster without changing your code, and cleaning up internal duplication so the library is easier to maintain and extend.

The headline feature is deferred pipeline fusion β€” when you chain IIR and biquad filters together, TorchFX now automatically merges them into a single fused kernel call, eliminating per-filter Python dispatch overhead.

Deferred Pipeline with Auto-Fusion#

Previously, wave | f1 | f2 | f3 executed each filter immediately, one at a time. Now, Wave.__or__ accumulates filters in a lazy pipeline that only materializes when you access .ys. At materialization time, consecutive IIR and biquad filters are automatically fused into a single FusedSOSCascade β€” their SOS matrices are concatenated into one [K_total, 6] tensor and processed in a single native kernel call.

from torchfx import Wave
from torchfx.filter.iir import LoButterworth, HiButterworth

wave = Wave.from_file("input.wav")

# All three syntaxes benefit from auto-fusion:
result = wave | LoButterworth(4000, order=2) | HiButterworth(200, order=2)
result = wave | (LoButterworth(4000, order=2) | HiButterworth(200, order=2))
result = wave | nn.Sequential(LoButterworth(4000, order=2), HiButterworth(200, order=2))

Fusion is transparent β€” the numerical result is identical to applying filters sequentially. Non-fusible effects (like Gain) break the chain naturally: wave | f1 | f2 | Gain(0.5) | f3 | f4 fuses f1+f2 and f3+f4 independently, with Gain applied between them.

FilterChain and the Pipe Operator#

The pipe operator | now works directly between filters and effects, not just with Wave:

from torchfx.filter.iir import LoButterworth, HiButterworth
from torchfx.effect import Gain

# Build reusable chains
bandpass = HiButterworth(200, order=2) | LoButterworth(4000, order=2)
chain = bandpass | Gain(0.8)

# Apply to audio
result = wave | chain

FilterChain is an auto-flattening nn.Sequential subclass β€” (f1 | f2) | f3 produces a flat FilterChain(f1, f2, f3), not nested containers. It’s exported from the top-level torchfx package.

Unified Biquad/IIR Forward Path#

A biquad filter is mathematically a single second-order IIR section (K=1). Its SOS matrix is simply [[b0, b1, b2, 1.0, a1, a2]] β€” a [1, 6] tensor. Yet previously, Biquad and IIR had completely separate forward-path implementations: different state management, different native dispatch, different fallback paths β€” about 200 lines of duplicated logic.

In 0.5.2, Biquad stores its coefficients as a [1, 6] SOS tensor and delegates its forward() to the same _sos_cascade_forward helper used by IIR and FusedSOSCascade. This:

  • Removes ~150 lines of duplicated forward-path code

  • Enables mixed fusion β€” wave | BiquadLPF(...) | LoButterworth(...) now auto-fuses

  • Retains the specialized CUDA kernel β€” when num_sections == 1 on CUDA, _sos_cascade_forward dispatches to the optimized biquad_forward kernel (128-channel batching per thread block), so single-biquad performance is preserved

  • Maintains backward compatibility β€” Biquad.b and Biquad.a are still accessible as read-only properties

Performance Improvements#

Beyond fusion, several targeted optimizations reduce per-call overhead:

  • SOS coefficient caching: device-matched SOS tensors are cached between forward calls, eliminating per-call .to() device transfers that accounted for 21% of Self CPU time in batch profiles

  • In-place state updates: replaced per-section torch.stack() allocations with copy_() into pre-existing state buffers

  • Reverb op fusion: simplified the PyTorch fallback from 5 tensor ops to 2 (clone + add_ with alpha)

  • Delay wet/dry mix: replaced (1-mix)*x + mix*y (3 ops) with torch.lerp (1 fused op)

  • Biquad coefficient caching: feedback coefficients a1, a2 are pre-extracted as Python floats at coefficient-computation time, eliminating per-forward GPU-to-CPU sync

Benchmark Results (Quadro RTX 6000)#

Measured on a Quadro RTX 6000 (24 GB), CUDA 12.8, PyTorch 2.10.0:

Full Pipeline (bandpass + gain + lowpass chain, 44.1 kHz, 2ch, 5s)#

Device

Mean

Relative

CUDA

3.57 ms

1.0x

CPU

5.98 ms

1.7x slower

SOS Cascade (44.1 kHz, 2ch, 5s)#

Sections

CPU

CUDA

4

1.66 ms

1.28 ms

8

2.63 ms

2.43 ms

Biquad Stateful (44.1 kHz, 1ch)#

Duration

CPU

CUDA

0.1s

93 us

239 us

5s

1.17 ms

487 us

30s

7.0 ms

2.37 ms

IIR 6th-order Butterworth (44.1 kHz)#

Backend

60s / 1ch

60s / 8ch

TorchFX GPU

15.2 ms

75.3 ms

TorchFX CPU

53.2 ms

129.9 ms

SciPy

61.5 ms

857.9 ms

Numba @njit CPU

71.6 ms

95.4 ms

Numba @cuda.jit

1,584 ms

1,652 ms

TorchFX GPU is 4x faster than SciPy for single-channel and 11x faster for 8-channel workloads. The Numba CUDA baseline confirms that the custom parallel scan kernel significantly outperforms a naive GPU implementation.

Test Coverage#

Test coverage increased from 74% to 88%, with a fail_under = 87 gate enforced in CI. New test files:

  • test_fused.py β€” FusedSOSCascade edge cases (9% to 75%)

  • test_filter_base.py β€” AbstractFilter and ParallelFilterCombination (44% to 89%)

  • test_filter_utils.py β€” filter utilities (0% to 91%)

  • test_ops_dispatch.py β€” native dispatch and fallback paths (64% to 84%)

  • test_filterbank.py β€” LogFilterBank (73% to 100%)

  • test_iir_gaps.py β€” IIR edge cases (71% to 86%)

  • test_chain_fusion.py β€” deferred pipeline, fusion, and FilterChain

Bug Fixes#

  • AbstractFilter._has_computed_coeff incorrectly returned True for IIR subclasses whose _sos is None β€” silently claiming coefficients were ready before compute_coefficients() had run

  • ParallelFilterCombination.__init__ assigned self.fs before self.filters, crashing when fs was passed at construction time

  • iir_cpu.cpp used a hardcoded double sec_sx0[16] stack array, silently overflowing for filters with >16 SOS sections (order >32). Now falls back to heap allocation for high orders.

Installation#

pip install torchfx==0.5.2

Benchmarks were run on a Quadro RTX 6000 (24 GB) with CUDA 12.8 and PyTorch 2.10.0. Performance varies by hardware and software configuration β€” always benchmark on your target system.