Posts tagged performance

TorchFX 0.7.0: A Dynamics Suite, a Real Reverb, and the Edge

09 June 2026

TorchFX 0.7.0 delivers exactly what 0.6.0’s “what’s next” promised — the single-pass GPU scan, a cache-blocked CPU SIMD path for edge devices, and a full dynamics toolkit — and then some. The headline is dynamics: a Compressor, an Expander/Gate, and a look-ahead brick-wall Limiter, each a native per-channel C++/CUDA kernel. Alongside them the old toy Reverb is replaced by a proper Freeverb-style algorithmic reverb, many short signals can now be processed in a single batched launch, and the whole release ships itself: tagging, PyPI, and these very release notes are now automated.

Read more ...

TorchFX 0.6.0: FP32 on the GPU, CUDA Graphs, and a Hardened Realtime Path

04 June 2026

TorchFX 0.6.0 is a performance and realtime release. The headline is the GPU follow-up promised back in 0.5.4: the CUDA kernels now run natively in float32 instead of silently upcasting to float64, which is 3.0–3.6× faster on consumer GPUs and finally lets the GPU beat its own CPU on multichannel workloads. On top of that, a new CUDA Graph path collapses the per-chunk launch overhead for streaming — up to 4× lower latency on short chunks — and the realtime engine moved its DSP off the audio callback into a dedicated worker thread.

Read more ...

FP32 on the GPU: 3–3.6× and the End of the Consumer-GPU Penalty

04 June 2026

This is the GPU half of the promise we made in 0.5.4: “retuning the CUDA SOS kernel for mixed precision so float32 gets the same fast path on GPU that it now has on CPU.” TorchFX 0.6.0 delivers it.

Read more ...

A Hardened Realtime Path: Worker Threads, Allocation-Free Streaming, and Dtype-Aware Dispatch

04 June 2026

The flashy 0.6.0 numbers are FP32 on the GPU and CUDA Graphs. This post covers the quieter work that makes the streaming path actually dependable: the realtime architecture, the per-call allocations, the dispatch heuristic, and a handful of silent-correctness bugs.

Read more ...

TorchFX 0.5.4: Native Filter Design & Goodbye scipy

26 May 2026

TorchFX 0.5.4 drops scipy as a runtime dependency. Every filter-design call that used to go through scipy.signal — Butterworth, Chebyshev I/II, Elliptic, Linkwitz-Riley, and DesignableFIR — is now performed by a native pure-PyTorch design module. The library is leaner, the dependency tree is shorter, and the design step itself is 14–50× faster than scipy on the parameter ranges we ship.

Read more ...

TorchFX 0.5.2: Transparent Filter Fusion & Unified Forward Paths

13 April 2026

TorchFX 0.5.2 focuses on two things: making filter chains faster without changing your code, and cleaning up internal duplication so the library is easier to maintain and extend.

Read more ...

TorchFX 0.5.0: Custom CUDA Kernels & Native C++ Extension

27 March 2026

I’m excited to announce TorchFX 0.5.0, a performance-focused release that introduces custom CUDA kernels, a JIT-compiled C++ native extension, and major algorithmic improvements across the entire filter pipeline.

Read more ...