TorchFX 0.5.2: Transparent Filter Fusion & Unified Forward Paths#

TorchFX 0.5.2 focuses on two things: making filter chains faster without changing your code, and cleaning up internal duplication so the library is easier to maintain and extend.

The headline feature is deferred pipeline fusion — when you chain IIR and biquad filters together, TorchFX now automatically merges them into a single fused kernel call, eliminating per-filter Python dispatch overhead.

Deferred Pipeline with Auto-Fusion#

Previously, wave | f1 | f2 | f3 executed each filter immediately, one at a time. Now, Wave.__or__ accumulates filters in a lazy pipeline that only materializes when you access .ys. At materialization time, consecutive IIR and biquad filters are automatically fused into a single FusedSOSCascade — their SOS matrices are concatenated into one [K_total, 6] tensor and processed in a single native kernel call.

from torchfx import Wave
from torchfx.filter.iir import LoButterworth, HiButterworth

wave = Wave.from_file("input.wav")

# All three syntaxes benefit from auto-fusion:
result = wave | LoButterworth(4000, order=2) | HiButterworth(200, order=2)
result = wave | (LoButterworth(4000, order=2) | HiButterworth(200, order=2))
result = wave | nn.Sequential(LoButterworth(4000, order=2), HiButterworth(200, order=2))

Fusion is transparent — the numerical result is identical to applying filters sequentially. Non-fusible effects (like Gain) break the chain naturally: wave | f1 | f2 | Gain(0.5) | f3 | f4 fuses f1+f2 and f3+f4 independently, with Gain applied between them.

`FilterChain` and the Pipe Operator#

The pipe operator | now works directly between filters and effects, not just with Wave:

from torchfx.filter.iir import LoButterworth, HiButterworth
from torchfx.effect import Gain

# Build reusable chains
bandpass = HiButterworth(200, order=2) | LoButterworth(4000, order=2)
chain = bandpass | Gain(0.8)

# Apply to audio
result = wave | chain

FilterChain is an auto-flattening nn.Sequential subclass — (f1 | f2) | f3 produces a flat FilterChain(f1, f2, f3), not nested containers. It’s exported from the top-level torchfx package.

Unified Biquad/IIR Forward Path#

A biquad filter is mathematically a single second-order IIR section (K=1). Its SOS matrix is simply [[b0, b1, b2, 1.0, a1, a2]] — a [1, 6] tensor. Yet previously, Biquad and IIR had completely separate forward-path implementations: different state management, different native dispatch, different fallback paths — about 200 lines of duplicated logic.

In 0.5.2, Biquad stores its coefficients as a [1, 6] SOS tensor and delegates its forward() to the same _sos_cascade_forward helper used by IIR and FusedSOSCascade. This:

Removes ~150 lines of duplicated forward-path code
Enables mixed fusion — wave | BiquadLPF(...) | LoButterworth(...) now auto-fuses
Retains the specialized CUDA kernel — when num_sections == 1 on CUDA, _sos_cascade_forward dispatches to the optimized biquad_forward kernel (128-channel batching per thread block), so single-biquad performance is preserved
Maintains backward compatibility — Biquad.b and Biquad.a are still accessible as read-only properties

Performance Improvements#

Beyond fusion, several targeted optimizations reduce per-call overhead:

SOS coefficient caching: device-matched SOS tensors are cached between forward calls, eliminating per-call .to() device transfers that accounted for 21% of Self CPU time in batch profiles
In-place state updates: replaced per-section torch.stack() allocations with copy_() into pre-existing state buffers
Reverb op fusion: simplified the PyTorch fallback from 5 tensor ops to 2 (clone + add_ with alpha)
Delay wet/dry mix: replaced (1-mix)*x + mix*y (3 ops) with torch.lerp (1 fused op)
Biquad coefficient caching: feedback coefficients a1, a2 are pre-extracted as Python floats at coefficient-computation time, eliminating per-forward GPU-to-CPU sync

Benchmark Results (Quadro RTX 6000)#

Measured on a Quadro RTX 6000 (24 GB), CUDA 12.8, PyTorch 2.10.0:

Full Pipeline (bandpass + gain + lowpass chain, 44.1 kHz, 2ch, 5s)#

Device	Mean	Relative
CUDA	3.57 ms	1.0x
CPU	5.98 ms	1.7x slower

SOS Cascade (44.1 kHz, 2ch, 5s)#

Sections	CPU	CUDA
4	1.66 ms	1.28 ms
8	2.63 ms	2.43 ms

Biquad Stateful (44.1 kHz, 1ch)#

Duration	CPU	CUDA
0.1s	93 us	239 us
5s	1.17 ms	487 us
30s	7.0 ms	2.37 ms

IIR 6th-order Butterworth (44.1 kHz)#

Backend	60s / 1ch	60s / 8ch
TorchFX GPU	15.2 ms	75.3 ms
TorchFX CPU	53.2 ms	129.9 ms
SciPy	61.5 ms	857.9 ms
Numba `@njit` CPU	71.6 ms	95.4 ms
Numba `@cuda.jit`	1,584 ms	1,652 ms

TorchFX GPU is 4x faster than SciPy for single-channel and 11x faster for 8-channel workloads. The Numba CUDA baseline confirms that the custom parallel scan kernel significantly outperforms a naive GPU implementation.

Test Coverage#

Test coverage increased from 74% to 88%, with a fail_under = 87 gate enforced in CI. New test files:

test_fused.py — FusedSOSCascade edge cases (9% to 75%)
test_filter_base.py — AbstractFilter and ParallelFilterCombination (44% to 89%)
test_filter_utils.py — filter utilities (0% to 91%)
test_ops_dispatch.py — native dispatch and fallback paths (64% to 84%)
test_filterbank.py — LogFilterBank (73% to 100%)
test_iir_gaps.py — IIR edge cases (71% to 86%)
test_chain_fusion.py — deferred pipeline, fusion, and FilterChain

Bug Fixes#

AbstractFilter._has_computed_coeff incorrectly returned True for IIR subclasses whose _sos is None — silently claiming coefficients were ready before compute_coefficients() had run
ParallelFilterCombination.__init__ assigned self.fs before self.filters, crashing when fs was passed at construction time
iir_cpu.cpp used a hardcoded double sec_sx0[16] stack array, silently overflowing for filters with >16 SOS sections (order >32). Now falls back to heap allocation for high orders.

Installation#

pip install torchfx==0.5.2

Benchmarks were run on a Quadro RTX 6000 (24 GB) with CUDA 12.8 and PyTorch 2.10.0. Performance varies by hardware and software configuration — always benchmark on your target system.

TorchFX 0.5.0: Custom CUDA Kernels & Native C++ Extension TorchFX 0.5.3: Build System Overhaul & Prebuilt Wheels

13 April 2026

Recent Posts

Tags

Categories

Authors

Archives

TorchFX 0.5.2: Transparent Filter Fusion & Unified Forward Paths#

Deferred Pipeline with Auto-Fusion#

`FilterChain` and the Pipe Operator#

Unified Biquad/IIR Forward Path#

Performance Improvements#

Benchmark Results (Quadro RTX 6000)#

Full Pipeline (bandpass + gain + lowpass chain, 44.1 kHz, 2ch, 5s)#

SOS Cascade (44.1 kHz, 2ch, 5s)#

Biquad Stateful (44.1 kHz, 1ch)#

IIR 6th-order Butterworth (44.1 kHz)#

Test Coverage#

Bug Fixes#

Installation#

13 April 2026

Recent Posts

Tags

Categories

Authors

Archives

TorchFX 0.5.2: Transparent Filter Fusion & Unified Forward Paths#

Deferred Pipeline with Auto-Fusion#

FilterChain and the Pipe Operator#

Unified Biquad/IIR Forward Path#

Performance Improvements#

Benchmark Results (Quadro RTX 6000)#

Full Pipeline (bandpass + gain + lowpass chain, 44.1 kHz, 2ch, 5s)#

SOS Cascade (44.1 kHz, 2ch, 5s)#

Biquad Stateful (44.1 kHz, 1ch)#

IIR 6th-order Butterworth (44.1 kHz)#

Test Coverage#

Bug Fixes#

Installation#

`FilterChain` and the Pipe Operator#