--- blogpost: true date: Apr 13, 2026 author: Matteo Spanio category: releases tags: release, performance, fusion, pipeline --- # TorchFX 0.5.2: Transparent Filter Fusion & Unified Forward Paths **TorchFX 0.5.2** focuses on two things: making filter chains faster without changing your code, and cleaning up internal duplication so the library is easier to maintain and extend. The headline feature is **deferred pipeline fusion** --- when you chain IIR and biquad filters together, TorchFX now automatically merges them into a single fused kernel call, eliminating per-filter Python dispatch overhead. ## Deferred Pipeline with Auto-Fusion Previously, `wave | f1 | f2 | f3` executed each filter immediately, one at a time. Now, `Wave.__or__` accumulates filters in a lazy pipeline that only materializes when you access `.ys`. At materialization time, consecutive IIR and biquad filters are automatically fused into a single `FusedSOSCascade` --- their SOS matrices are concatenated into one `[K_total, 6]` tensor and processed in a single native kernel call. ```python from torchfx import Wave from torchfx.filter.iir import LoButterworth, HiButterworth wave = Wave.from_file("input.wav") # All three syntaxes benefit from auto-fusion: result = wave | LoButterworth(4000, order=2) | HiButterworth(200, order=2) result = wave | (LoButterworth(4000, order=2) | HiButterworth(200, order=2)) result = wave | nn.Sequential(LoButterworth(4000, order=2), HiButterworth(200, order=2)) ``` Fusion is transparent --- the numerical result is identical to applying filters sequentially. Non-fusible effects (like `Gain`) break the chain naturally: `wave | f1 | f2 | Gain(0.5) | f3 | f4` fuses `f1+f2` and `f3+f4` independently, with `Gain` applied between them. ## `FilterChain` and the Pipe Operator The pipe operator `|` now works directly between filters and effects, not just with `Wave`: ```python from torchfx.filter.iir import LoButterworth, HiButterworth from torchfx.effect import Gain # Build reusable chains bandpass = HiButterworth(200, order=2) | LoButterworth(4000, order=2) chain = bandpass | Gain(0.8) # Apply to audio result = wave | chain ``` `FilterChain` is an auto-flattening `nn.Sequential` subclass --- `(f1 | f2) | f3` produces a flat `FilterChain(f1, f2, f3)`, not nested containers. It's exported from the top-level `torchfx` package. ## Unified Biquad/IIR Forward Path A biquad filter is mathematically a single second-order IIR section (K=1). Its SOS matrix is simply `[[b0, b1, b2, 1.0, a1, a2]]` --- a `[1, 6]` tensor. Yet previously, `Biquad` and `IIR` had completely separate forward-path implementations: different state management, different native dispatch, different fallback paths --- about 200 lines of duplicated logic. In 0.5.2, `Biquad` stores its coefficients as a `[1, 6]` SOS tensor and delegates its `forward()` to the same `_sos_cascade_forward` helper used by `IIR` and `FusedSOSCascade`. This: - **Removes ~150 lines** of duplicated forward-path code - **Enables mixed fusion** --- `wave | BiquadLPF(...) | LoButterworth(...)` now auto-fuses - **Retains the specialized CUDA kernel** --- when `num_sections == 1` on CUDA, `_sos_cascade_forward` dispatches to the optimized `biquad_forward` kernel (128-channel batching per thread block), so single-biquad performance is preserved - **Maintains backward compatibility** --- `Biquad.b` and `Biquad.a` are still accessible as read-only properties ## Performance Improvements Beyond fusion, several targeted optimizations reduce per-call overhead: - **SOS coefficient caching**: device-matched SOS tensors are cached between forward calls, eliminating per-call `.to()` device transfers that accounted for 21% of Self CPU time in batch profiles - **In-place state updates**: replaced per-section `torch.stack()` allocations with `copy_()` into pre-existing state buffers - **Reverb op fusion**: simplified the PyTorch fallback from 5 tensor ops to 2 (`clone` + `add_` with alpha) - **Delay wet/dry mix**: replaced `(1-mix)*x + mix*y` (3 ops) with `torch.lerp` (1 fused op) - **Biquad coefficient caching**: feedback coefficients `a1`, `a2` are pre-extracted as Python floats at coefficient-computation time, eliminating per-forward GPU-to-CPU sync ## Benchmark Results (Quadro RTX 6000) Measured on a Quadro RTX 6000 (24 GB), CUDA 12.8, PyTorch 2.10.0: ### Full Pipeline (bandpass + gain + lowpass chain, 44.1 kHz, 2ch, 5s) | Device | Mean | Relative | |--------|-----:|----------| | **CUDA** | **3.57 ms** | 1.0x | | CPU | 5.98 ms | 1.7x slower | ### SOS Cascade (44.1 kHz, 2ch, 5s) | Sections | CPU | CUDA | |---------:|----:|-----:| | 4 | 1.66 ms | 1.28 ms | | 8 | 2.63 ms | 2.43 ms | ### Biquad Stateful (44.1 kHz, 1ch) | Duration | CPU | CUDA | |---------:|----:|-----:| | 0.1s | 93 us | 239 us | | 5s | 1.17 ms | 487 us | | 30s | 7.0 ms | 2.37 ms | ### IIR 6th-order Butterworth (44.1 kHz) | Backend | 60s / 1ch | 60s / 8ch | |---------|----------:|----------:| | **TorchFX GPU** | **15.2 ms** | **75.3 ms** | | TorchFX CPU | 53.2 ms | 129.9 ms | | SciPy | 61.5 ms | 857.9 ms | | Numba `@njit` CPU | 71.6 ms | 95.4 ms | | Numba `@cuda.jit` | 1,584 ms | 1,652 ms | TorchFX GPU is **4x faster than SciPy** for single-channel and **11x faster** for 8-channel workloads. The Numba CUDA baseline confirms that the custom parallel scan kernel significantly outperforms a naive GPU implementation. ## Test Coverage Test coverage increased from **74% to 88%**, with a `fail_under = 87` gate enforced in CI. New test files: - `test_fused.py` --- FusedSOSCascade edge cases (9% to 75%) - `test_filter_base.py` --- AbstractFilter and ParallelFilterCombination (44% to 89%) - `test_filter_utils.py` --- filter utilities (0% to 91%) - `test_ops_dispatch.py` --- native dispatch and fallback paths (64% to 84%) - `test_filterbank.py` --- LogFilterBank (73% to 100%) - `test_iir_gaps.py` --- IIR edge cases (71% to 86%) - `test_chain_fusion.py` --- deferred pipeline, fusion, and FilterChain ## Bug Fixes - `AbstractFilter._has_computed_coeff` incorrectly returned `True` for IIR subclasses whose `_sos is None` --- silently claiming coefficients were ready before `compute_coefficients()` had run - `ParallelFilterCombination.__init__` assigned `self.fs` before `self.filters`, crashing when `fs` was passed at construction time - `iir_cpu.cpp` used a hardcoded `double sec_sx0[16]` stack array, silently overflowing for filters with >16 SOS sections (order >32). Now falls back to heap allocation for high orders. ## Installation ```bash pip install torchfx==0.5.2 ``` > Benchmarks were run on a Quadro RTX 6000 (24 GB) with CUDA 12.8 and PyTorch 2.10.0. Performance varies by hardware and software configuration --- always benchmark on your target system.