TorchFX 0.5.3: Build System Overhaul & Prebuilt Wheels#

TorchFX 0.5.3 is a packaging-focused release. The headline change is invisible if you only read the API docs but very visible the first time you pip install the library: TorchFX has migrated from runtime JIT compilation to scikit-build-core + CMake, and the project now ships prebuilt CPU wheels for Linux x86_64, macOS (Intel and Apple Silicon), and Windows x86_64 across Python 3.10–3.14.

No more 10–30 second freeze on first import. No more “do you have a C++ compiler?” support tickets. Just pip install torchfx and go.

From JIT to scikit-build-core#

In 0.5.0, the native extension was JIT-compiled via torch.utils.cpp_extension.load() the first time you imported TorchFX. That was a deliberate choice at the time — it kept the build configuration trivial and let users get CUDA kernels without setting up a separate build pipeline. But it had three sharp edges:

First-import latency. Every fresh environment paid a 10–30 second compilation tax on first import torchfx. Cached afterward, but painful in containers, CI runs, and notebook kernels.
Compiler requirement at the user’s machine. Every end user needed GCC 9+, the CUDA toolkit (for GPU users), and a working build environment. This was a non-trivial barrier for data scientists and audio engineers who just wanted to filter some audio.
setuptools as a runtime dependency. Required only because torch.utils.cpp_extension pulled it in — a 1.5 MB dependency carried purely for JIT compilation.

0.5.3 replaces all of this with scikit-build-core as the build backend and CMake as the build system. The C++/CUDA extension is now compiled at install time, baked into the wheel, and loaded as a normal precompiled module.

# 0.5.2 and earlier:
pip install torchfx        # fast install
import torchfx             # ~20s compile on first import

# 0.5.3:
pip install torchfx        # downloads prebuilt wheel (no compile needed)
import torchfx             # instant

Build-time configuration still respects TORCHFX_NO_CUDA=1 for CPU-only builds, and the CMake configuration is the same one introduced in 0.5.0 — just invoked at the right point in the lifecycle.

Prebuilt Wheels via GitHub Actions#

The new .github/workflows/wheels.yml pipeline drives cibuildwheel across four runners — ubuntu-latest (manylinux x86_64), macos-13 (Intel), macos-14 (Apple Silicon), and windows-latest — and produces CPU wheels for Python 3.10–3.14 on every tagged release. PyTorch’s own shared libraries are explicitly excluded from the repaired wheels (via auditwheel/delocate/delvewheel) so the wheel stays small and links against whichever PyTorch build the user has installed.

A second pipeline, wheels-cuda.yml, builds CUDA wheels for Linux x86_64 against CUDA 12.4 and 12.8 using NVIDIA’s CUDA toolkit on top of the manylinux container, and publishes them to GitHub Pages as a PEP 503 simple-repository index:

pip install torch --index-url https://download.pytorch.org/whl/cu124
pip install torchfx \
    --index-url https://matteospanio.github.io/torchfx/wheels/cu124/ \
    --extra-index-url https://pypi.org/simple

The CUDA wheels carry a +cu124 (or +cu128) PEP 440 local-version suffix, so they cannot collide with the CPU wheels on PyPI: pip only resolves them when explicitly pointed at the CUDA index. macOS CUDA wheels are not feasible (NVIDIA dropped Mac support in 2019); Windows CUDA users build from source.

What this means in practice:

Fresh pip install torchfx now pulls a prebuilt wheel matching your Python version. No compiler needed.
CI environments (GitHub Actions, GitLab CI, Docker builds) are dramatically faster — no more burning a minute on extension compilation in every job.
Reproducibility improves: the wheel published to PyPI is the exact binary built and tested by the CI pipeline.

Source distributions remain available for users who need a custom build (CUDA, alternative architectures, debug builds).

The Pure-PyTorch Fallback is Gone#

In 0.5.0 and 0.5.1, the C++ extension was optional — if compilation failed, TorchFX silently fell back to pure-PyTorch implementations of the IIR DF1 loop and the delay line. That fallback was 100–500x slower than the native path even with vectorization tricks, and it complicated every dispatch site with a “did the extension load?” branch.

With prebuilt wheels, the failure mode disappears: the extension is always present. So 0.5.3 removes the pure-PyTorch fallback entirely.

_biquad_df1_fallback and the SOS DF1 fallback in _ops.py are gone.
The _load_extension and torch.utils.cpp_extension.load machinery in _ops.py is gone.
Every dispatch site assumes torchfx_ext is importable.

If you build TorchFX from source without the extension (e.g., a hostile build environment), the import will now fail loudly rather than silently degrading to a 500x-slower path. This is the right tradeoff: a clear error message beats a silent performance cliff.

CPU Delay Kernel#

Previously, delay_line_forward had a CUDA kernel and a Python fallback for CPU. With the fallback architecture going away, that gap had to be filled — so 0.5.3 adds delay_cpu.cpp, a native C++ implementation of the delay line for CPU.

The dispatch in _ops.py is now symmetric with the IIR path: native C++ on CPU, native CUDA on GPU, no fallback in between. Multi-channel delay processing on CPU sees the expected speedup over the previous Python loop.

What Hasn’t Changed#

This release is deliberately a packaging release. The user-facing API, the deferred pipeline fusion from 0.5.2, the SOS coefficient caching, the CUDA parallel scan kernels — all of it is unchanged. If your code worked on 0.5.2, it works on 0.5.3 with no modifications.

The only behavioural difference you might notice is the absence of the first-import compilation pause. Everything else is identical, just faster to get started.

Installation#

pip install torchfx==0.5.3

For CUDA support, build from source against your target CUDA version:

pip install torchfx==0.5.3 --no-binary torchfx

Set TORCHFX_NO_CUDA=1 to force a CPU-only build from source.

What’s Next#

With the build system stabilized, upcoming work will return to the optimization roadmap: CUDA wheels in CI, expanded benchmark coverage, and the next round of pipeline-level fusion opportunities surfaced by the Phase 0 baseline.

As always, file issues and feature requests on GitHub.

TorchFX 0.5.2: Transparent Filter Fusion & Unified Forward Paths TorchFX 0.5.4: Native Filter Design & Goodbye scipy

03 May 2026

Recent Posts

Tags

Categories

Authors

Archives