---
blogpost: true
date: Mar 27, 2026
author: Matteo Spanio
category: releases
tags: release, cuda, performance, native-extension, fft
---

# TorchFX 0.5.0: Custom CUDA Kernels & Native C++ Extension

I'm excited to announce **TorchFX 0.5.0**, a performance-focused release that introduces custom CUDA kernels, a JIT-compiled C++ native extension, and major algorithmic improvements across the entire filter pipeline.

This release delivers on the Phase 3 optimization goals outlined in the 0.4.0 roadmap.

## The Native Extension (`torchfx._ops`)

At the core of 0.5.0 is a new JIT-compiled C++/CUDA extension that loads automatically when you import TorchFX. The extension is compiled on first use via `torch.utils.cpp_extension` and cached for subsequent imports.

```python
import torchfx
# [torchfx] native extension: YES
# [torchfx] CUDA available: True (NVIDIA RTX 6000)
```

**Key design decisions:**

- **Automatic fallback**: If compilation fails (no compiler, no CUDA toolkit), TorchFX falls back to pure-PyTorch paths transparently. Your code doesn't change.
- **CPU-only support**: The C++ extension compiles and loads without the CUDA toolkit. You get native-speed IIR filtering on CPU even without a GPU.
- **Environment control**: Set `TORCHFX_NO_CUDA=1` to force CPU-only compilation if you want to skip CUDA entirely.

### Compiler Requirements

To compile the native extension, you need **GCC 9 or newer** (or an equivalent C++17-compatible compiler). The CPU extension compiles with `-O3 -ffast-math -march=native` and OpenMP parallelization for multi-channel workloads. On most Linux systems with a recent toolchain this works out of the box.

## CUDA Parallel Scan for IIR Filters

IIR (Infinite Impulse Response) filters have a fundamental challenge on GPUs: each output sample depends on previous outputs, creating a sequential dependency chain. The naive approach --- one thread per channel, looping over samples --- leaves 99% of the GPU idle.

TorchFX 0.5.0 solves this with a **parallel prefix scan** (Blelloch scan) that decomposes the IIR recurrence into parallel-friendly operations:

- **O(N) total work** instead of O(N log N) from the previous Hillis-Steele approach
- **24 KB shared memory per block**, down from 48 KB, allowing higher occupancy
- **128 channels batched per thread block** for the sequential biquad kernel, improving GPU utilization on short signals

The result is that a 300-second, 12-channel IIR cascade completes in **550 ms on GPU** --- compared to 5.4 seconds with SciPy and 1.1 seconds on CPU.

## FFT-Based FIR Convolution

FIR filters now default to **FFT convolution** via the overlap-save method, adapted from [Julius](https://github.com/adefossez/julius) (MIT License). For kernel sizes >= 64 taps, this is up to **10x faster** than direct convolution.

You can control the convolution mode per filter:

```python
from torchfx.filter import DesignableFIR

# FFT convolution (default, fast for large kernels)
fir = DesignableFIR(num_taps=512, cutoff=4000, fs=44100, conv_mode="fft")

# Direct convolution (better for very small kernels)
fir = DesignableFIR(num_taps=16, cutoff=4000, fs=44100, conv_mode="direct")

# Automatic selection based on kernel size
fir = DesignableFIR(num_taps=128, cutoff=4000, fs=44100, conv_mode="auto")
```

## LogFilterBank

A new `LogFilterBank` class provides logarithmically-spaced frequency band decomposition, useful for spectral analysis, multiband processing, and feature extraction:

```python
from torchfx.filter import LogFilterBank

bank = LogFilterBank(n_bands=32, f_low=20, f_high=20000, fs=44100)
bands = bank(wave.ys)  # [n_bands, channels, samples]
```

## Performance-Optimized Fallback Paths

Even without the native extension, 0.5.0 is dramatically faster than 0.4.0. The pure-PyTorch fallback paths have been completely rewritten:

- **Stateful biquad and IIR SOS**: Replaced sample-by-sample Python loops with a vectorized zero-state/zero-input decomposition using `lfilter`. This gives a **100-500x speedup** when the C++ extension is unavailable.
- **Eager SOS computation**: The SOS matrix is now computed immediately after `compute_coefficients()` instead of lazily during `forward()`, avoiding repeated work.
- **Pre-computed constant tensors**: SOS convolution kernels and delta tensors are cached to avoid per-call allocation.
- **Eliminated redundant device transfers**: State tensor `.to(device)` calls are now guarded to skip when already on the correct device.

## GPU Kernel Improvements

Beyond the parallel scan, several targeted GPU optimizations:

- **Removed synchronous CUDA calls** from native kernels, improving throughput by avoiding unnecessary CPU-GPU synchronization points.
- **Scalar coefficient passing**: Biquad coefficients (`b0`, `b1`, `b2`, `a1`, `a2`) are now passed as scalar arguments to CUDA kernels instead of being extracted from device tensors, eliminating a GPU-to-CPU synchronization that was causing a segfault on some configurations.

## Benchmark Infrastructure

The benchmark suite has been migrated from standalone scripts to a **pytest-benchmark** suite under `benchmarks/`:

```bash
# Run all benchmarks
uv run pytest --benchmark-enable

# Run only IIR benchmarks
uv run pytest benchmarks/test_iir_bench.py --benchmark-enable

# Run only FIR benchmarks
uv run pytest benchmarks/test_fir_bench.py --benchmark-enable
```

Each benchmark compares **five backends**: TorchFX GPU (CUDA), TorchFX CPU, SciPy, Numba `@njit` (CPU), and Numba `@cuda.jit` (GPU), across signal durations from 1 to 300 seconds and varying channel counts / filter orders.

## Bug Fixes

- Fixed native extension being unreachable on CPU-only machines due to an overly strict `torch.cuda.is_available()` gate in `_ops.py`.
- Fixed segfault in the CUDA biquad kernel caused by dereferencing a device pointer on the host.

## Benchmark Results (RTX 6000)

Here's a snapshot from our CI benchmarks on a Quadro RTX 6000 (24 GB):

| Backend | 300s / order 12 IIR | Relative |
|---------|--------------------:|----------|
| **TorchFX GPU** | **550 ms** | 1.0x |
| TorchFX CPU | 1,086 ms | 2.0x slower |
| Numba `@njit` CPU | TBD | -- |
| SciPy | 5,428 ms | 9.9x slower |
| Numba `@cuda.jit` | 12,957 ms | 23.6x slower |

The GPU kernel maintains sub-millisecond standard deviation, making it suitable for latency-sensitive workloads.

## Installation

```bash
pip install torchfx
```

The native extension compiles automatically on first import. Ensure you have:
- **GCC >= 9** (or equivalent C++17 compiler)
- **PyTorch >= 2.0** with matching CUDA toolkit (for GPU kernels)
- `setuptools` (now a runtime dependency, required by `torch.utils.cpp_extension`)

For CPU-only builds:

```bash
TORCHFX_NO_CUDA=1 pip install torchfx
```

## What's Next

With the performance foundation in place, we're turning our attention to:

- Additional effects: compressor, phaser, pitch shift
- Batch processing optimizations for the CLI pipeline
- v1.0.0 release candidate with API stability guarantees

> Benchmarks where run on a Quadro RTX 6000 (24 GB) with CUDA 12.1 and PyTorch 2.10.0. Performance may vary based on hardware and software configuration. Always benchmark on your target system for best results.