Posts tagged cuda-graphs

TorchFX 0.6.0: FP32 on the GPU, CUDA Graphs, and a Hardened Realtime Path

TorchFX 0.6.0 is a performance and realtime release. The headline is the GPU follow-up promised back in 0.5.4: the CUDA kernels now run natively in float32 instead of silently upcasting to float64, which is 3.0–3.6× faster on consumer GPUs and finally lets the GPU beat its own CPU on multichannel workloads. On top of that, a new CUDA Graph path collapses the per-chunk launch overhead for streaming — up to 4× lower latency on short chunks — and the realtime engine moved its DSP off the audio callback into a dedicated worker thread.

Read more ...


CUDA Graphs for Streaming: One Launch Instead of a Launch Storm

For offline batch processing, GPU kernel-launch overhead disappears into the noise. For realtime streaming, it is the cost. TorchFX 0.6.0 adds torchfx.realtime.CudaGraphRunner, which captures a fixed-shape filter forward into a CUDA Graph and replays it per chunk — up to 4× lower per-chunk latency.

Read more ...