--- blogpost: true date: May 3, 2026 author: Matteo Spanio category: releases tags: release, build-system, packaging, wheels, cmake --- # TorchFX 0.5.3: Build System Overhaul & Prebuilt Wheels **TorchFX 0.5.3** is a packaging-focused release. The headline change is invisible if you only read the API docs but very visible the first time you `pip install` the library: TorchFX has migrated from runtime JIT compilation to **scikit-build-core + CMake**, and the project now ships **prebuilt CPU wheels** for Linux x86_64, macOS (Intel and Apple Silicon), and Windows x86_64 across Python 3.10–3.14. No more 10–30 second freeze on first import. No more "do you have a C++ compiler?" support tickets. Just `pip install torchfx` and go. ## From JIT to scikit-build-core In 0.5.0, the native extension was JIT-compiled via `torch.utils.cpp_extension.load()` the first time you imported TorchFX. That was a deliberate choice at the time --- it kept the build configuration trivial and let users get CUDA kernels without setting up a separate build pipeline. But it had three sharp edges: 1. **First-import latency.** Every fresh environment paid a 10–30 second compilation tax on first `import torchfx`. Cached afterward, but painful in containers, CI runs, and notebook kernels. 2. **Compiler requirement at the user's machine.** Every end user needed GCC 9+, the CUDA toolkit (for GPU users), and a working build environment. This was a non-trivial barrier for data scientists and audio engineers who just wanted to filter some audio. 3. **`setuptools` as a runtime dependency.** Required only because `torch.utils.cpp_extension` pulled it in --- a 1.5 MB dependency carried purely for JIT compilation. 0.5.3 replaces all of this with **scikit-build-core** as the build backend and **CMake** as the build system. The C++/CUDA extension is now compiled at install time, baked into the wheel, and loaded as a normal precompiled module. ```bash # 0.5.2 and earlier: pip install torchfx # fast install import torchfx # ~20s compile on first import # 0.5.3: pip install torchfx # downloads prebuilt wheel (no compile needed) import torchfx # instant ``` Build-time configuration still respects `TORCHFX_NO_CUDA=1` for CPU-only builds, and the CMake configuration is the same one introduced in 0.5.0 --- just invoked at the right point in the lifecycle. ## Prebuilt Wheels via GitHub Actions The new [`.github/workflows/wheels.yml`](https://github.com/matteospanio/torchfx) pipeline drives [`cibuildwheel`](https://cibuildwheel.pypa.io/) across four runners --- `ubuntu-latest` (manylinux x86_64), `macos-13` (Intel), `macos-14` (Apple Silicon), and `windows-latest` --- and produces CPU wheels for Python 3.10–3.14 on every tagged release. PyTorch's own shared libraries are explicitly excluded from the repaired wheels (via `auditwheel`/`delocate`/`delvewheel`) so the wheel stays small and links against whichever PyTorch build the user has installed. A second pipeline, [`wheels-cuda.yml`](https://github.com/matteospanio/torchfx), builds CUDA wheels for Linux x86_64 against CUDA 12.4 and 12.8 using NVIDIA's CUDA toolkit on top of the manylinux container, and publishes them to GitHub Pages as a PEP 503 simple-repository index: ```bash pip install torch --index-url https://download.pytorch.org/whl/cu124 pip install torchfx \ --index-url https://matteospanio.github.io/torchfx/wheels/cu124/ \ --extra-index-url https://pypi.org/simple ``` The CUDA wheels carry a `+cu124` (or `+cu128`) PEP 440 local-version suffix, so they cannot collide with the CPU wheels on PyPI: pip only resolves them when explicitly pointed at the CUDA index. macOS CUDA wheels are not feasible (NVIDIA dropped Mac support in 2019); Windows CUDA users build from source. What this means in practice: - **Fresh `pip install torchfx`** now pulls a prebuilt wheel matching your Python version. No compiler needed. - **CI environments** (GitHub Actions, GitLab CI, Docker builds) are dramatically faster --- no more burning a minute on extension compilation in every job. - **Reproducibility** improves: the wheel published to PyPI is the exact binary built and tested by the CI pipeline. Source distributions remain available for users who need a custom build (CUDA, alternative architectures, debug builds). ## The Pure-PyTorch Fallback is Gone In 0.5.0 and 0.5.1, the C++ extension was *optional* --- if compilation failed, TorchFX silently fell back to pure-PyTorch implementations of the IIR DF1 loop and the delay line. That fallback was 100–500x slower than the native path even with vectorization tricks, and it complicated every dispatch site with a "did the extension load?" branch. With prebuilt wheels, the failure mode disappears: the extension is always present. So 0.5.3 **removes the pure-PyTorch fallback entirely**. - `_biquad_df1_fallback` and the SOS DF1 fallback in `_ops.py` are gone. - The `_load_extension` and `torch.utils.cpp_extension.load` machinery in `_ops.py` is gone. - Every dispatch site assumes `torchfx_ext` is importable. If you build TorchFX from source without the extension (e.g., a hostile build environment), the import will now fail loudly rather than silently degrading to a 500x-slower path. This is the right tradeoff: a clear error message beats a silent performance cliff. ## CPU Delay Kernel Previously, `delay_line_forward` had a CUDA kernel and a Python fallback for CPU. With the fallback architecture going away, that gap had to be filled --- so 0.5.3 adds [`delay_cpu.cpp`](https://github.com/matteospanio/torchfx/blob/master/src/torchfx/_csrc/delay_cpu.cpp), a native C++ implementation of the delay line for CPU. The dispatch in `_ops.py` is now symmetric with the IIR path: native C++ on CPU, native CUDA on GPU, no fallback in between. Multi-channel delay processing on CPU sees the expected speedup over the previous Python loop. ## What Hasn't Changed This release is deliberately a packaging release. The user-facing API, the deferred pipeline fusion from 0.5.2, the SOS coefficient caching, the CUDA parallel scan kernels --- all of it is unchanged. If your code worked on 0.5.2, it works on 0.5.3 with no modifications. The only behavioural difference you might notice is the absence of the first-import compilation pause. Everything else is identical, just faster to get started. ## Installation ```bash pip install torchfx==0.5.3 ``` For CUDA support, build from source against your target CUDA version: ```bash pip install torchfx==0.5.3 --no-binary torchfx ``` Set `TORCHFX_NO_CUDA=1` to force a CPU-only build from source. ## What's Next With the build system stabilized, upcoming work will return to the optimization roadmap: CUDA wheels in CI, expanded benchmark coverage, and the next round of pipeline-level fusion opportunities surfaced by the Phase 0 baseline. As always, file issues and feature requests on [GitHub](https://github.com/matteospanio/torchfx).