Posts tagged streaming
CUDA Graphs for Streaming: One Launch Instead of a Launch Storm
- 04 June 2026
For offline batch processing, GPU kernel-launch overhead disappears into the noise. For realtime streaming, it is the cost. TorchFX 0.6.0 adds torchfx.realtime.CudaGraphRunner, which captures a fixed-shape filter forward into a CUDA Graph and replays it per chunk — up to 4× lower per-chunk latency.
A Hardened Realtime Path: Worker Threads, Allocation-Free Streaming, and Dtype-Aware Dispatch
- 04 June 2026
The flashy 0.6.0 numbers are FP32 on the GPU and CUDA Graphs. This post covers the quieter work that makes the streaming path actually dependable: the realtime architecture, the per-call allocations, the dispatch heuristic, and a handful of silent-correctness bugs.