--- blogpost: true date: Jun 04, 2026 author: Matteo Spanio category: features tags: cuda, fp32, performance, precision, kernels --- # FP32 on the GPU: 3–3.6× and the End of the Consumer-GPU Penalty This is the GPU half of the promise we made in [0.5.4](2026-05-26-release-054.md): *"retuning the CUDA SOS kernel for mixed precision so float32 gets the same fast path on GPU that it now has on CPU."* TorchFX 0.6.0 delivers it. ## The problem: a double-only kernel on a float32 world The CUDA biquad and SOS parallel-scan kernels were written entirely in `double`. The 3×3 state-transition matrices, the forcing function, every phase of the Blelloch scan — all `double`. The Python dispatch layer enforced it: any CUDA input was upcast to `float64` before the kernel ran. That is the wrong default for two reasons: 1. **Most audio is `float32`.** Realtime callbacks, ML feature pipelines, and game audio all carry `float32`. Forcing them through `float64` doubles the memory bandwidth for no accuracy gain at the filter orders TorchFX targets. 2. **Consumer GPUs hate FP64.** An RTX 3070 (and the A40) has a **1:32 FP32:FP64 throughput ratio** — FP64 runs at one-thirty-second of peak. The kernel was leaving most of the card on the floor. The symptom was stark: on a 60-second, 8-channel cascade the GPU was *slower than its own CPU*. ## The fix: template on `scalar_t`, dispatch on the input The kernels are now templated on `scalar_t` (the same pattern the CPU kernels already used). The `Mat3x3` state matrix, the forcing kernel, all three Blelloch phases, and the sequential fallback are instantiated for both `float` and `double`. The two host entry points dispatch on the input tensor's dtype via `AT_DISPATCH_FLOATING_TYPES`, and the block-aggregate scratch is allocated in the input dtype so the FP32 path stays FP32 end to end. The scalar coefficients arrive as `double` and are cast to `scalar_t` at launch. The dispatch rule is now simple and symmetric with the CPU: > **The native execution dtype follows the input.** `float32` in → FP32 kernels. `float64` in → FP64 kernels. Pass the dtype you want. No silent conversions in either direction. Half precision (`float16` / `bfloat16`) is *rejected* with a clear error — the IIR feedback recurrence is not numerically safe there. ## The numbers 8th-order Butterworth (4 SOS sections) @ 48 kHz, RTX 3070, median over 30 iterations (`benchmarks/bench_fp32_speedup.py`): | Workload | GPU FP64 | GPU FP32 | Speedup | |---|---:|---:|---:| | 30 s / 1 ch | 9.49 ms | 2.80 ms | **3.39×** | | 60 s / 1 ch | 18.31 ms | 6.00 ms | **3.05×** | | 60 s / 2 ch | 29.03 ms | 9.32 ms | **3.11×** | | 60 s / 4 ch | 49.22 ms | 14.41 ms | **3.42×** | | 60 s / 8 ch | 89.18 ms | **24.49 ms** | **3.64×** | A consistent **3.0–3.6×**. Note that this is *not* the theoretical 32× FP32:FP64 ratio — the parallel scan is partly bandwidth- and launch-overhead-bound, not pure-FLOP-bound, so 3–4× is the honest, measured win. (It would be larger again on a datacenter card with a 1:2 ratio, where FP64 was never the bottleneck.) ### The inversion, resolved The most satisfying row is 8 channels. Here is the full picture for that workload: | 60 s / 8 ch | Time | Verdict | |---|---:|---| | CPU FP64 | 34.0 ms | — | | CPU FP32 | 27.6 ms | — | | **GPU FP64** | **89.2 ms** | *loses to its own CPU* | | **GPU FP32** | **24.5 ms** | **beats the CPU** | In FP64 the consumer GPU genuinely lost to the OpenMP CPU kernel once the per-step working set widened across channels — an embarrassing inversion. FP32 erases it: the GPU is now the fastest backend everywhere on this card. ## Is FP32 safe? Lower precision is only a win if it's still correct. 0.6.0 ships `tests/test_fp32_precision.py`, which validates both paths against `scipy.signal.sosfilt`: - The **FP64 path** matches scipy to ~double precision. - The **FP32 path** matches the reference within a documented float32 bound (max-abs + RMS-relative), swept across Butterworth and Chebyshev I at orders 2/4/8/16, on both CPU and CUDA. The harness records the per-design error so the FP32-safe-vs-needs-FP64 boundary is tracked over time. For the well-conditioned audio designs TorchFX ships, FP32 tracks the FP64 reference to float32 precision with no surprises. If you have a pathological high-order, poles-near-the-unit-circle design, pass `float64` and you get the precise path — your choice, per call. ## Try it ```python import torch from torchfx import Wave from torchfx.filter import LoButterworth # float32 in -> FP32 GPU kernels (fast) wave = Wave(torch.randn(8, 48000 * 60, dtype=torch.float32), fs=48000).to("cuda") out = wave | LoButterworth(4000, order=8) # float64 in -> FP64 GPU kernels (precise), same code wave64 = Wave(torch.randn(8, 48000 * 60, dtype=torch.float64), fs=48000).to("cuda") out64 = wave64 | LoButterworth(4000, order=8) ``` ```bash python benchmarks/bench_fp32_speedup.py ``` Back to the [0.6.0 release notes](2026-06-04-release-060.md).