Posts tagged performance
TorchFX 0.6.0: FP32 on the GPU, CUDA Graphs, and a Hardened Realtime Path
- 04 June 2026
TorchFX 0.6.0 is a performance and realtime release. The headline is the GPU follow-up promised back in 0.5.4: the CUDA kernels now run natively in float32 instead of silently upcasting to float64, which is 3.0–3.6× faster on consumer GPUs and finally lets the GPU beat its own CPU on multichannel workloads. On top of that, a new CUDA Graph path collapses the per-chunk launch overhead for streaming — up to 4× lower latency on short chunks — and the realtime engine moved its DSP off the audio callback into a dedicated worker thread.
FP32 on the GPU: 3–3.6× and the End of the Consumer-GPU Penalty
- 04 June 2026
This is the GPU half of the promise we made in 0.5.4: “retuning the CUDA SOS kernel for mixed precision so float32 gets the same fast path on GPU that it now has on CPU.” TorchFX 0.6.0 delivers it.
A Hardened Realtime Path: Worker Threads, Allocation-Free Streaming, and Dtype-Aware Dispatch
- 04 June 2026
The flashy 0.6.0 numbers are FP32 on the GPU and CUDA Graphs. This post covers the quieter work that makes the streaming path actually dependable: the realtime architecture, the per-call allocations, the dispatch heuristic, and a handful of silent-correctness bugs.
TorchFX 0.5.4: Native Filter Design & Goodbye scipy
- 26 May 2026
TorchFX 0.5.4 drops scipy as a runtime dependency. Every filter-design call that used to go through scipy.signal — Butterworth, Chebyshev I/II, Elliptic, Linkwitz-Riley, and DesignableFIR — is now performed by a native pure-PyTorch design module. The library is leaner, the dependency tree is shorter, and the design step itself is 14–50× faster than scipy on the parameter ranges we ship.
TorchFX 0.5.2: Transparent Filter Fusion & Unified Forward Paths
- 13 April 2026
TorchFX 0.5.2 focuses on two things: making filter chains faster without changing your code, and cleaning up internal duplication so the library is easier to maintain and extend.
TorchFX 0.5.0: Custom CUDA Kernels & Native C++ Extension
- 27 March 2026
I’m excited to announce TorchFX 0.5.0, a performance-focused release that introduces custom CUDA kernels, a JIT-compiled C++ native extension, and major algorithmic improvements across the entire filter pipeline.