Performance Optimization and Benchmarking#
Learn how to measure, optimize, and maximize the performance of your TorchFX audio processing pipelines. This comprehensive guide covers benchmarking methodologies, GPU vs CPU performance comparisons, filter type trade-offs, and best practices for building high-throughput audio processing systems.
Prerequisites#
Before starting this guide, you should be familiar with:
Wave - Digital Audio Representation - Wave class fundamentals
Pipeline Operator - Functional Composition - Pipeline operator basics
GPU Acceleration - GPU device management
../filters/iir-filters - IIR filter characteristics
../filters/fir-filters - FIR filter characteristics
Overview#
Performance optimization in TorchFX involves understanding three key trade-offs:
Dimension |
Trade-off |
Optimization Strategy |
|---|---|---|
Execution Backend |
GPU vs CPU vs SciPy |
Choose based on duration, channels, batch size |
Filter Type |
FIR vs IIR |
Balance computational cost vs phase response |
API Pattern |
Classes vs Sequential vs Pipe |
Select based on ergonomics vs performance needs |
graph TB
subgraph "Performance Optimization Dimensions"
AudioTask["Audio Processing Task"]
subgraph Backend["Backend Selection"]
GPU["GPU (CUDA)<br/>High parallelism"]
CPU["CPU (PyTorch)<br/>Moderate performance"]
SciPy["SciPy<br/>Baseline reference"]
end
subgraph FilterType["Filter Type"]
FIR["FIR Filters<br/>Linear phase<br/>Higher compute"]
IIR["IIR Filters<br/>Non-linear phase<br/>Lower compute"]
end
subgraph API["API Pattern"]
Pipe["Pipeline Operator<br/>Auto sample rate"]
Sequential["nn.Sequential<br/>Standard PyTorch"]
Custom["Custom Module<br/>Maximum control"]
end
AudioTask --> Backend
AudioTask --> FilterType
AudioTask --> API
Backend --> Optimize["Optimized Pipeline"]
FilterType --> Optimize
API --> Optimize
end
style AudioTask fill:#e1f5ff
style Optimize fill:#e1ffe1
style GPU fill:#fff5e1
style FIR fill:#f5e1ff
style Pipe fill:#ffe1e1
Performance Optimization Framework - Three key dimensions for optimizing TorchFX pipelines.
See also
For detailed GPU acceleration patterns, see GPU Acceleration. For comprehensive benchmarking infrastructure, see the benchmark suite in the benchmark/ directory.
Benchmark Methodology#
TorchFX includes a comprehensive benchmarking suite that evaluates performance across different dimensions. The suite consists of three benchmark scripts, each targeting a specific aspect of performance.
Benchmark Suite Architecture#
graph TB
subgraph "TorchFX Benchmark Suite"
API["api_bench.py<br/>API Pattern Comparison"]
FIR["fir_bench.py<br/>FIR Filter Performance"]
IIR["iir_bench.py<br/>IIR Filter Performance"]
end
subgraph "Common Test Parameters"
SR["Sample Rate: 44.1 kHz"]
REP["Repetitions: 50"]
Signal["Signal: Random noise<br/>Float32, normalized"]
end
subgraph "Variable Parameters"
Duration["Duration: 1s - 10 min"]
Channels["Channels: 1, 2, 4, 8, 12"]
Backends["Backends: GPU, CPU, SciPy"]
end
subgraph "Output"
CSV["CSV Results<br/>.out files"]
end
API --> SR
FIR --> SR
IIR --> SR
API --> REP
FIR --> REP
IIR --> REP
FIR --> Duration
FIR --> Channels
IIR --> Duration
IIR --> Channels
API --> Backends
FIR --> Backends
IIR --> Backends
API --> CSV
FIR --> CSV
IIR --> CSV
style API fill:#e1f5ff
style FIR fill:#e8f5e1
style IIR fill:#fff5e1
Benchmark Suite Organization - Three complementary benchmarks measuring different performance aspects.
Test Signal Generation#
All benchmarks use consistent signal generation to ensure comparable results:
import numpy as np
def create_audio(sample_rate, duration, num_channels):
"""Generate multi-channel random noise for benchmarking.
Parameters
----------
sample_rate : int
Sample rate in Hz (typically 44100)
duration : float
Duration in seconds
num_channels : int
Number of audio channels
Returns
-------
signal : np.ndarray
Shape (num_channels, num_samples), float32, normalized to [-1, 1]
"""
signal = np.random.randn(num_channels, int(sample_rate * duration))
signal = signal.astype(np.float32)
# Normalize each channel independently
signal /= np.max(np.abs(signal), axis=1, keepdims=True)
return signal
Key Characteristics:
Distribution: Gaussian random noise (
np.random.randn)Data Type: Float32 for GPU compatibility
Normalization: Per-channel normalization to [-1, 1] range
Deterministic: Same random seed produces consistent results (when seeded)
Tip
Using random noise ensures benchmarks test worst-case performance without special structure or patterns that could be optimized by the hardware.
Timing Methodology#
All benchmarks use Python’s timeit.timeit() for accurate timing measurements:
import timeit
# Standard pattern across all benchmarks
REP = 50 # Number of repetitions
# Time the operation
elapsed = timeit.timeit(
lambda: process_audio(wave, filter_chain),
number=REP
)
# Report average time per iteration
avg_time = elapsed / REP
print(f"Average time: {avg_time:.6f}s")
Timing Best Practices:
Multiple Repetitions: 50 repetitions minimize variance and warm-up effects
Lambda Wrapper: Captures all setup costs in the closure
Average Reporting: Reports per-iteration time for easy comparison
Excluded Overhead: Setup (loading, coefficient computation) excluded from timing
Important
Timing only measures the core processing operation. Setup costs like loading audio files, computing filter coefficients, and device transfers are performed before timing begins.
API Performance Comparison#
The api_bench.py benchmark compares different API patterns for applying the same filter chain to audio. This helps users understand the performance and ergonomic trade-offs of different coding styles.
Test Configuration#
Filter Chain: 6 cascaded IIR filters
3 × High-pass Chebyshev Type I filters (20 Hz, 60 Hz, 65 Hz)
3 × Low-pass Butterworth filters (5000 Hz, 4900 Hz, 4850 Hz)
Test Signal: 8-channel audio, 2 minutes duration, 44.1 kHz sample rate
graph LR
Input["Input Audio<br/>8 channels, 2 min"] --> F1["HiChebyshev1<br/>20 Hz"]
F1 --> F2["HiChebyshev1<br/>60 Hz"]
F2 --> F3["HiChebyshev1<br/>65 Hz"]
F3 --> F4["LoButterworth<br/>5000 Hz"]
F4 --> F5["LoButterworth<br/>4900 Hz"]
F5 --> F6["LoButterworth<br/>4850 Hz"]
F6 --> Output["Filtered Output<br/>8 channels, 2 min"]
style Input fill:#e1f5ff
style Output fill:#e1ffe1
API Benchmark Filter Chain - Six cascaded IIR filters applied to 8-channel audio.
API Pattern 1: Custom nn.Module Class#
A traditional PyTorch approach using a custom torch.nn.Module:
from torch import nn
from torchfx.filter import HiChebyshev1, LoButterworth
class FilterChain(nn.Module):
"""Custom module for filter chain."""
def __init__(self, fs):
super().__init__()
self.f1 = HiChebyshev1(20, fs=fs)
self.f2 = HiChebyshev1(60, fs=fs)
self.f3 = HiChebyshev1(65, fs=fs)
self.f4 = LoButterworth(5000, fs=fs)
self.f5 = LoButterworth(4900, fs=fs)
self.f6 = LoButterworth(4850, fs=fs)
def forward(self, x):
x = self.f1(x)
x = self.f2(x)
x = self.f3(x)
x = self.f4(x)
x = self.f5(x)
x = self.f6(x)
return x
# Usage
fchain = FilterChain(signal.fs)
result = fchain(signal.ys)
Characteristics:
Sample Rate: Must be passed explicitly to
__init__Reusability: Can be instantiated once and reused
Flexibility: Full control over forward pass logic
Overhead: Standard PyTorch
nn.Modulecall overhead
API Pattern 2: nn.Sequential#
Using PyTorch’s built-in torch.nn.Sequential container:
from torch.nn import Sequential
from torchfx.filter import HiChebyshev1, LoButterworth
# Create filter chain inline
fchain = Sequential(
HiChebyshev1(20, fs=signal.fs),
HiChebyshev1(60, fs=signal.fs),
HiChebyshev1(65, fs=signal.fs),
LoButterworth(5000, fs=signal.fs),
LoButterworth(4900, fs=signal.fs),
LoButterworth(4850, fs=signal.fs),
)
# Apply to audio tensor
result = fchain(signal.ys)
Characteristics:
Sample Rate: Must be passed explicitly to each filter
Simplicity: No custom class needed
Performance: Identical to custom
nn.Module(same underlying mechanism)Flexibility: Can add/remove filters easily
Note
nn.Sequential and custom nn.Module classes have identical performance characteristics. The choice between them is purely about code organization and readability.
API Pattern 3: Pipeline Operator (Pipe)#
TorchFX’s idiomatic pipeline operator pattern:
from torchfx import Wave
from torchfx.filter import HiChebyshev1, LoButterworth
# Apply filters using pipe operator
result = (
signal
| HiChebyshev1(20)
| HiChebyshev1(60)
| HiChebyshev1(65)
| LoButterworth(5000)
| LoButterworth(4900)
| LoButterworth(4850)
)
Characteristics:
Sample Rate: Automatically configured from
WaveobjectErgonomics: Most readable and concise
Safety: Eliminates sample rate mismatch errors
Performance: Minimal overhead for automatic configuration
Tip
The pipeline operator is the recommended pattern for TorchFX. It provides the best balance of readability, safety (automatic sample rate configuration), and performance.
API Pattern 4: SciPy Baseline#
Pure NumPy/SciPy implementation for baseline comparison:
from scipy.signal import butter, cheby1, lfilter
# Pre-compute filter coefficients
b1, a1 = cheby1(2, 0.5, 20, btype='high', fs=SAMPLE_RATE)
b2, a2 = cheby1(2, 0.5, 60, btype='high', fs=SAMPLE_RATE)
b3, a3 = cheby1(2, 0.5, 65, btype='high', fs=SAMPLE_RATE)
b4, a4 = butter(2, 5000, btype='low', fs=SAMPLE_RATE)
b5, a5 = butter(2, 4900, btype='low', fs=SAMPLE_RATE)
b6, a6 = butter(2, 4850, btype='low', fs=SAMPLE_RATE)
# Apply filters sequentially
filtered = lfilter(b1, a1, signal)
filtered = lfilter(b2, a2, filtered)
filtered = lfilter(b3, a3, filtered)
filtered = lfilter(b4, a4, filtered)
filtered = lfilter(b5, a5, filtered)
filtered = lfilter(b6, a6, filtered)
Characteristics:
Performance: Optimized C implementation
GPU: No GPU acceleration available
Integration: Requires NumPy arrays (no PyTorch tensors)
Baseline: Reference for CPU performance comparison
API Performance Summary#
API Pattern |
Sample Rate Config |
Ergonomics |
Performance |
Use Case |
|---|---|---|---|---|
Custom |
Manual ( |
Good |
Fast |
Complex custom logic |
|
Manual ( |
Very Good |
Fast |
Standard PyTorch integration |
Pipeline Operator |
Automatic |
Excellent |
Fast |
Recommended for TorchFX |
SciPy |
Manual ( |
Fair |
Fast (CPU only) |
Baseline comparison |
Key Insight: The pipeline operator provides automatic sample rate configuration with negligible performance overhead, making it the most ergonomic choice without sacrificing speed.
See also
Pipeline Operator - Functional Composition - Detailed documentation on the pipeline operator pattern
FIR Filter Performance#
The fir_bench.py benchmark evaluates FIR filter performance across varying signal durations, channel counts, and execution backends (GPU, CPU, SciPy).
FIR Test Configuration#
Filter Chain: 5 cascaded FIR filters with varying tap counts
DesignableFIR: 101 taps, 1000 Hz cutoff
DesignableFIR: 102 taps, 5000 Hz cutoff
DesignableFIR: 103 taps, 1500 Hz cutoff
DesignableFIR: 104 taps, 1800 Hz cutoff
DesignableFIR: 105 taps, 1850 Hz cutoff
Test Parameters:
Durations: 5s, 60s, 180s, 300s, 600s (10 minutes)
Channels: 1, 2, 4, 8, 12
Backends: GPU (CUDA), CPU (PyTorch), SciPy (NumPy)
Repetitions: 50 per configuration
graph TB
subgraph "FIR Benchmark Test Matrix"
Input["Input Signal<br/>Variable duration & channels"]
subgraph Filters["FIR Filter Chain"]
F1["FIR 101 taps<br/>1000 Hz"]
F2["FIR 102 taps<br/>5000 Hz"]
F3["FIR 103 taps<br/>1500 Hz"]
F4["FIR 104 taps<br/>1800 Hz"]
F5["FIR 105 taps<br/>1850 Hz"]
end
subgraph Durations["Duration Sweep"]
D1["5 seconds"]
D2["60 seconds"]
D3["180 seconds"]
D4["300 seconds"]
D5["600 seconds"]
end
subgraph Channels["Channel Sweep"]
C1["1 channel"]
C2["2 channels"]
C3["4 channels"]
C4["8 channels"]
C5["12 channels"]
end
Input --> F1
F1 --> F2
F2 --> F3
F3 --> F4
F4 --> F5
Input --> Durations
Input --> Channels
end
style Input fill:#e1f5ff
style Filters fill:#fff5e1
FIR Benchmark Configuration - Tests performance across duration and channel count dimensions.
FIR Coefficient Pre-Computation#
FIR filters require coefficient computation before filtering. The benchmark explicitly pre-computes coefficients to exclude design time from performance measurements:
import torch.nn as nn
from torchfx import Wave
from torchfx.filter import DesignableFIR
SAMPLE_RATE = 44100
# Create filter chain
fchain = nn.Sequential(
DesignableFIR(num_taps=101, cutoff=1000, fs=SAMPLE_RATE),
DesignableFIR(num_taps=102, cutoff=5000, fs=SAMPLE_RATE),
DesignableFIR(num_taps=103, cutoff=1500, fs=SAMPLE_RATE),
DesignableFIR(num_taps=104, cutoff=1800, fs=SAMPLE_RATE),
DesignableFIR(num_taps=105, cutoff=1850, fs=SAMPLE_RATE),
)
# Pre-compute coefficients before timing
for f in fchain:
f.compute_coefficients()
# Now ready for benchmarking (coefficient design excluded)
Why Pre-Compute?
Separation of Concerns: Design time vs filtering time are separate
Realistic Use Case: Coefficients are typically computed once and reused
Fair Comparison: SciPy baseline also pre-computes coefficients
Important
In production code, call compute_coefficients() once during initialization, then reuse the filter for processing multiple audio files.
FIR Device Transfer Pattern#
The benchmark demonstrates proper device management for GPU acceleration:
import timeit
# GPU benchmarking
wave.to("cuda")
fchain.to("cuda")
gpu_time = timeit.timeit(lambda: wave | fchain, number=REP)
# CPU benchmarking (transfer back)
wave.to("cpu")
fchain.to("cpu")
cpu_time = timeit.timeit(lambda: wave | fchain, number=REP)
# Calculate speedup
speedup = cpu_time / gpu_time
print(f"GPU speedup: {speedup:.2f}x")
Device Transfer Rules:
Move both
Waveand filter chain to the same devicePre-compute coefficients before moving to GPU
Time only the filtering operation (exclude transfers)
Move back to CPU for result saving
See also
GPU Acceleration - Comprehensive guide to GPU device management
FIR Performance Characteristics#
FIR filters have distinct performance characteristics compared to IIR filters:
Characteristic |
FIR Filters |
Reason |
|---|---|---|
Computational Cost |
Higher |
Convolution with many taps |
GPU Advantage |
Excellent |
High parallelism in convolution |
Memory Footprint |
Larger |
Must store all tap coefficients |
Scaling with Taps |
Linear O(N) |
More taps = more multiply-accumulate operations |
Scaling with Channels |
Excellent |
Independent per-channel convolution |
When FIR Filters Excel:
Long signals (>60s): Amortizes setup overhead
Many channels (≥4): Parallel convolution across channels
GPU available: Convolution is highly parallel
Linear phase required: Only FIR can provide linear phase
When to Avoid FIR:
Real-time processing: IIR filters have lower latency
Limited memory: FIR coefficients consume more memory
CPU-only, short signals: IIR may be faster
Tip
For steep frequency responses, FIR filters require many taps (100+). Consider IIR filters if phase linearity is not critical.
SciPy FIR Baseline Implementation#
The benchmark includes a SciPy baseline for CPU performance comparison:
from scipy.signal import firwin, lfilter
# Design FIR coefficients using scipy
b1 = firwin(101, 1000, fs=SAMPLE_RATE)
b2 = firwin(102, 5000, fs=SAMPLE_RATE)
b3 = firwin(103, 1500, fs=SAMPLE_RATE)
b4 = firwin(104, 1800, fs=SAMPLE_RATE)
b5 = firwin(105, 1850, fs=SAMPLE_RATE)
# Apply filters sequentially
a = [1] # FIR filters have a = [1]
filtered = lfilter(b1, a, signal)
filtered = lfilter(b2, a, filtered)
filtered = lfilter(b3, a, filtered)
filtered = lfilter(b4, a, filtered)
filtered = lfilter(b5, a, filtered)
SciPy Characteristics:
CPU-only: No GPU acceleration
Optimized: Uses NumPy’s optimized C/Fortran backend
Reference: Establishes baseline CPU performance
Compatibility: Requires NumPy arrays (not PyTorch tensors)
IIR Filter Performance#
The iir_bench.py benchmark evaluates IIR (Infinite Impulse Response) filter performance using the same test matrix as FIR benchmarks but with recursive filters.
IIR Test Configuration#
Filter Chain: 4 cascaded IIR filters
HiButterworth: 1000 Hz cutoff, order 2
LoButterworth: 5000 Hz cutoff, order 2
HiChebyshev1: 1500 Hz cutoff, order 2
LoChebyshev1: 1800 Hz cutoff, order 2
Test Parameters:
Durations: 1s, 5s, 180s, 300s, 600s (10 minutes)
Channels: 1, 2, 4, 8, 12
Backends: GPU (CUDA), CPU (PyTorch), SciPy (NumPy)
Repetitions: 50 per configuration
graph TB
subgraph "IIR Benchmark Architecture"
Input["Input Signal<br/>Variable duration & channels"]
subgraph Chain["IIR Filter Chain"]
F1["HiButterworth<br/>1000 Hz, order 2"]
F2["LoButterworth<br/>5000 Hz, order 2"]
F3["HiChebyshev1<br/>1500 Hz, order 2"]
F4["LoChebyshev1<br/>1800 Hz, order 2"]
end
subgraph Setup["IIR-Specific Setup"]
Compute["compute_coefficients()<br/>Design b, a coefficients"]
MoveCoeff["move_coeff('cuda'/'cpu')<br/>Transfer to device"]
end
subgraph Backends["Execution Backends"]
GPU["GPU: fchain(wave.ys)"]
CPU["CPU: fchain(wave.ys)"]
SciPy["SciPy: lfilter(b, a, signal)"]
end
Input --> F1
F1 --> F2
F2 --> F3
F3 --> F4
F1 --> Compute
Compute --> MoveCoeff
MoveCoeff --> GPU
MoveCoeff --> CPU
F1 --> SciPy
end
style Input fill:#e1f5ff
style Chain fill:#fff5e1
style Setup fill:#e8f5e1
IIR Benchmark Structure - Shows coefficient management and execution backends.
IIR Coefficient Management#
Unlike FIR filters, IIR filters have both numerator (b) and denominator (a) coefficients that must be explicitly managed:
import torch.nn as nn
from torchfx import Wave
from torchfx.filter import HiButterworth, LoButterworth
SAMPLE_RATE = 44100
# Create IIR filter chain
fchain = nn.Sequential(
HiButterworth(cutoff=1000, order=2, fs=SAMPLE_RATE),
LoButterworth(cutoff=5000, order=2, fs=SAMPLE_RATE),
)
# Move wave and module to GPU
wave.to("cuda")
fchain.to("cuda")
# IIR-specific: compute and move coefficients
for f in fchain:
f.compute_coefficients() # Design b, a coefficients
f.move_coeff("cuda") # Move coefficients to GPU
# Now ready for GPU processing
result = fchain(wave.ys)
Two-Step Device Transfer:
Module transfer:
fchain.to("cuda")moves module parametersCoefficient transfer:
f.move_coeff("cuda")moves filter coefficients
Warning
For IIR filters, you must both move the module to the device and call move_coeff(). Forgetting the second step will cause runtime errors.
IIR vs FIR Performance Trade-offs#
IIR and FIR filters have fundamentally different performance characteristics:
Aspect |
IIR Filters |
FIR Filters |
|---|---|---|
Computational Cost |
Lower (fewer operations per sample) |
Higher (convolution with many taps) |
Memory Footprint |
Small (few coefficients: b, a) |
Large (many tap coefficients) |
GPU Advantage |
Moderate (less parallelism) |
High (highly parallel convolution) |
Phase Response |
Non-linear |
Can be linear (symmetric taps) |
Stability |
Can be unstable if poorly designed |
Always stable |
Filter Order |
Achieves sharp cutoff with low order |
Requires many taps for sharp cutoff |
Performance Comparison Example:
# IIR: Order 8 Butterworth (16 coefficients total)
iir_filter = LoButterworth(cutoff=1000, order=8, fs=44100)
# Coefficients: b (9 values) + a (9 values) = 18 total
# Equivalent FIR: ~150+ taps for similar frequency response
fir_filter = DesignableFIR(num_taps=151, cutoff=1000, fs=44100)
# Coefficients: 151 tap values
# IIR is ~8x more memory-efficient and faster on CPU
# FIR has better GPU parallelism and linear phase
Choosing Between IIR and FIR:
flowchart TD
Start["Choose Filter Type"]
LinearPhase{"Linear phase<br/>required?"}
Stability{"Stability<br/>critical?"}
GPU{"GPU<br/>available?"}
Memory{"Memory<br/>constrained?"}
UseFIR["Use FIR Filters<br/>✓ Linear phase<br/>✓ Always stable<br/>✓ GPU-friendly"]
UseIIR["Use IIR Filters<br/>✓ Low memory<br/>✓ Low latency<br/>✓ Efficient CPU"]
Start --> LinearPhase
LinearPhase -->|Yes| UseFIR
LinearPhase -->|No| Stability
Stability -->|Critical| UseFIR
Stability -->|Not critical| GPU
GPU -->|Yes, long signals| UseFIR
GPU -->|No or short signals| Memory
Memory -->|Yes| UseIIR
Memory -->|No| UseFIR
style UseFIR fill:#e1ffe1
style UseIIR fill:#e1f5ff
Filter Type Selection Decision Tree - Choose based on phase, stability, and resource constraints.
IIR SciPy Baseline#
The IIR benchmark includes SciPy baseline for comparison:
from scipy.signal import butter, cheby1, lfilter
# Design IIR coefficients
b1, a1 = butter(2, 1000, btype='high', fs=SAMPLE_RATE)
b2, a2 = butter(2, 5000, btype='low', fs=SAMPLE_RATE)
b3, a3 = cheby1(2, 0.5, 1500, btype='high', fs=SAMPLE_RATE)
b4, a4 = cheby1(2, 0.5, 1800, btype='low', fs=SAMPLE_RATE)
# Apply filters sequentially
filtered = lfilter(b1, a1, signal)
filtered = lfilter(b2, a2, filtered)
filtered = lfilter(b3, a3, filtered)
filtered = lfilter(b4, a4, filtered)
SciPy IIR Performance:
CPU-optimized: Highly optimized C implementation
No GPU: SciPy doesn’t support CUDA
Baseline: Reference for CPU performance
Filter Design: Uses standard signal processing algorithms
Performance Optimization Guidelines#
Based on the benchmarking results, follow these guidelines to optimize your TorchFX pipelines.
When to Use GPU Acceleration#
GPU acceleration provides the greatest benefit under specific conditions:
flowchart TD
Start["Audio Processing Task"]
CheckDuration{"Signal duration<br/>> 60 seconds?"}
CheckChannels{"Channels ≥ 4?"}
CheckBatch{"Batch processing<br/>multiple files?"}
CheckFIR{"Using FIR filters<br/>with >100 taps?"}
CheckRealtime{"Real-time<br/>low-latency requirement?"}
UseGPU["✓ Use GPU<br/>wave.to('cuda')<br/>fchain.to('cuda')"]
UseCPU["✓ Use CPU<br/>Default or wave.to('cpu')"]
Start --> CheckDuration
CheckDuration -->|Yes| UseGPU
CheckDuration -->|No| CheckChannels
CheckChannels -->|Yes| UseGPU
CheckChannels -->|No| CheckBatch
CheckBatch -->|Yes| UseGPU
CheckBatch -->|No| CheckFIR
CheckFIR -->|Yes| UseGPU
CheckFIR -->|No| CheckRealtime
CheckRealtime -->|Yes| UseCPU
CheckRealtime -->|No| UseGPU
style UseGPU fill:#e1ffe1
style UseCPU fill:#e1f5ff
GPU Decision Tree - Follow this flowchart to determine optimal execution backend.
GPU Performance Sweet Spot:
Factor |
Threshold |
Reasoning |
|---|---|---|
Duration |
> 60 seconds |
Amortizes data transfer overhead |
Channels |
≥ 4 channels |
Exploits parallel processing |
Batch Size |
> 5 files |
Transfer overhead amortized across batch |
FIR Taps |
> 100 taps |
Convolution highly parallelizable |
IIR Chain |
≥ 3 filters |
Accumulated compute benefits |
CPU Preferred Cases:
Real-time processing: More predictable latency
Short signals (<30s): Transfer overhead dominates
Single channel: Insufficient parallelism
IIR filters only: Less GPU benefit than FIR
Tip
When in doubt, benchmark your specific workload. Use the patterns from the benchmark suite as templates.
Filter Chain Optimization#
Optimize filter chains by pre-computing coefficients and reusing filters:
import torch.nn as nn
from torchfx import Wave
from torchfx.filter import DesignableFIR, HiButterworth
SAMPLE_RATE = 44100
# Create filter chain
fchain = nn.Sequential(
DesignableFIR(num_taps=101, cutoff=1000, fs=SAMPLE_RATE),
HiButterworth(cutoff=500, order=2, fs=SAMPLE_RATE),
)
# Pre-compute coefficients once during initialization
for f in fchain:
f.compute_coefficients()
# For IIR filters, also move coefficients to device
device = "cuda" if torch.cuda.is_available() else "cpu"
fchain.to(device)
for f in fchain:
if hasattr(f, 'move_coeff'):
f.move_coeff(device)
# Process multiple files without re-computing coefficients
audio_files = ["song1.wav", "song2.wav", "song3.wav"]
for audio_file in audio_files:
wave = Wave.from_file(audio_file).to(device)
result = wave | fchain # Uses cached coefficients
result.to("cpu").save(f"processed_{audio_file}")
Optimization Benefits:
Coefficient caching: Compute once, reuse for all files
Device pinning: Keep filters on GPU across iterations
Batch amortization: Setup cost amortized over multiple files
Device Placement Strategy#
Minimize device transfers by keeping processing on a single device:
import torch
import torchfx as fx
# Strategy 1: Single device throughout (RECOMMENDED)
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load and move to device once
wave = fx.Wave.from_file("audio.wav").to(device)
# Create filter chain on same device
fchain = torch.nn.Sequential(
fx.filter.HiButterworth(cutoff=80, order=2),
fx.filter.LoButterworth(cutoff=12000, order=4),
).to(device)
# All operations on same device
result = wave | fchain
# Move to CPU only for final I/O
result.to("cpu").save("output.wav")
Avoid Inefficient Transfers (Anti-pattern):
# ❌ WRONG: Unnecessary device transfers
wave = fx.Wave.from_file("audio.wav").to("cuda")
result1 = wave.to("cpu") | cpu_filter # Transfer 1
result2 = result1.to("cuda") | gpu_filter # Transfer 2
result3 = result2.to("cpu") | cpu_filter2 # Transfer 3
# ✅ CORRECT: Single device
device = "cuda" if torch.cuda.is_available() else "cpu"
wave = fx.Wave.from_file("audio.wav").to(device)
cpu_filter.to(device)
gpu_filter.to(device)
cpu_filter2.to(device)
result = wave | cpu_filter | gpu_filter | cpu_filter2
Device Transfer Costs:
CPU → GPU: O(n) where n = number of samples
GPU → CPU: O(n) where n = number of samples
Impact: Can dominate for short signals
See also
GPU Acceleration - Comprehensive device management patterns
Memory Management Best Practices#
Optimize memory usage for large-scale processing:
Optimization |
Implementation |
Impact |
|---|---|---|
In-place operations |
Use effects that modify tensors in-place where possible |
Reduces memory allocations |
Chunked processing |
Process long audio in chunks |
Prevents GPU OOM errors |
Coefficient caching |
Pre-compute and reuse filter coefficients |
Eliminates redundant computation |
Device pinning |
Keep frequently-used filters on device |
Reduces transfer overhead |
Batch size tuning |
Adjust batch size to fit GPU memory |
Maximizes throughput |
Memory-Efficient Chunked Processing Example:
import torch
import torchfx as fx
def process_long_audio_chunked(wave, fchain, chunk_duration=60):
"""Process very long audio in chunks to manage GPU memory.
Parameters
----------
wave : Wave
Input audio (can be on CPU or GPU)
fchain : nn.Module
Filter chain (must be on same device as intended chunks)
chunk_duration : float
Chunk duration in seconds
Returns
-------
Wave
Processed audio
"""
chunk_samples = int(chunk_duration * wave.fs)
num_chunks = (wave.ys.size(-1) + chunk_samples - 1) // chunk_samples
device = "cuda" if torch.cuda.is_available() else "cpu"
fchain.to(device)
results = []
for i in range(num_chunks):
start = i * chunk_samples
end = min((i + 1) * chunk_samples, wave.ys.size(-1))
# Extract chunk and move to GPU
chunk = fx.Wave(wave.ys[..., start:end], wave.fs).to(device)
# Process chunk on GPU
processed_chunk = chunk | fchain
# Move back to CPU and store
results.append(processed_chunk.ys.cpu())
# Clear GPU cache
if device == "cuda":
torch.cuda.empty_cache()
# Concatenate results
return fx.Wave(torch.cat(results, dim=-1), wave.fs)
# Usage
wave = fx.Wave.from_file("10_hour_recording.wav")
fchain = fx.filter.LoButterworth(cutoff=1000, order=4)
result = process_long_audio_chunked(wave, fchain, chunk_duration=60)
result.save("processed.wav")
Chunked Processing Benefits:
Processes arbitrarily long audio without OOM errors
Keeps GPU utilization high
Balances memory usage with throughput
Benchmarking Your Own Pipelines#
Use these patterns to benchmark your custom TorchFX pipelines.
Basic Benchmarking Template#
import timeit
import torch
import torchfx as fx
import numpy as np
# Configuration
SAMPLE_RATE = 44100
DURATION = 60 # seconds
NUM_CHANNELS = 4
REP = 50 # repetitions for timing
# Generate test signal
signal = np.random.randn(NUM_CHANNELS, int(SAMPLE_RATE * DURATION))
signal = signal.astype(np.float32)
signal /= np.max(np.abs(signal), axis=1, keepdims=True)
wave = fx.Wave(signal, SAMPLE_RATE)
# Create your processing pipeline
pipeline = torch.nn.Sequential(
fx.filter.HiButterworth(cutoff=100, order=2),
fx.filter.LoButterworth(cutoff=10000, order=4),
fx.effect.Normalize(peak=0.9),
)
# Pre-compute coefficients
for module in pipeline:
if hasattr(module, 'compute_coefficients'):
module.compute_coefficients()
# Benchmark CPU
wave.to("cpu")
pipeline.to("cpu")
cpu_time = timeit.timeit(lambda: wave | pipeline, number=REP)
# Benchmark GPU (if available)
if torch.cuda.is_available():
wave.to("cuda")
pipeline.to("cuda")
# Move IIR coefficients if needed
for module in pipeline:
if hasattr(module, 'move_coeff'):
module.move_coeff("cuda")
gpu_time = timeit.timeit(lambda: wave | pipeline, number=REP)
print(f"CPU time: {cpu_time/REP:.6f}s")
print(f"GPU time: {gpu_time/REP:.6f}s")
print(f"Speedup: {cpu_time/gpu_time:.2f}x")
else:
print(f"CPU time: {cpu_time/REP:.6f}s")
print("GPU not available")
Multi-Configuration Benchmark#
Test performance across multiple configurations:
import timeit
import torch
import torchfx as fx
import numpy as np
import pandas as pd
SAMPLE_RATE = 44100
REP = 50
# Test configurations
durations = [5, 30, 60, 120, 300] # seconds
channel_counts = [1, 2, 4, 8]
# Create filter chain
pipeline = torch.nn.Sequential(
fx.filter.LoButterworth(cutoff=1000, order=4),
fx.filter.HiButterworth(cutoff=100, order=2),
)
# Pre-compute coefficients
for module in pipeline:
if hasattr(module, 'compute_coefficients'):
module.compute_coefficients()
# Benchmark grid
results = []
for duration in durations:
for channels in channel_counts:
# Generate test signal
signal = np.random.randn(channels, int(SAMPLE_RATE * duration))
signal = signal.astype(np.float32)
signal /= np.max(np.abs(signal), axis=1, keepdims=True)
wave = fx.Wave(signal, SAMPLE_RATE)
# CPU benchmark
wave.to("cpu")
pipeline.to("cpu")
cpu_time = timeit.timeit(lambda: wave | pipeline, number=REP) / REP
# GPU benchmark
if torch.cuda.is_available():
wave.to("cuda")
pipeline.to("cuda")
gpu_time = timeit.timeit(lambda: wave | pipeline, number=REP) / REP
speedup = cpu_time / gpu_time
else:
gpu_time = None
speedup = None
results.append({
'duration': duration,
'channels': channels,
'cpu_time': cpu_time,
'gpu_time': gpu_time,
'speedup': speedup
})
# Convert to DataFrame for analysis
df = pd.DataFrame(results)
print(df.to_string(index=False))
# Save to CSV
df.to_csv("benchmark_results.csv", index=False)
Profiling with PyTorch Profiler#
For detailed performance analysis, use PyTorch’s built-in profiler:
import torch
import torchfx as fx
# Create pipeline
wave = fx.Wave.from_file("audio.wav").to("cuda")
pipeline = torch.nn.Sequential(
fx.filter.LoButterworth(cutoff=1000, order=4),
fx.filter.HiButterworth(cutoff=100, order=2),
).to("cuda")
# Profile the pipeline
with torch.profiler.profile(
activities=[
torch.profiler.ProfilerActivity.CPU,
torch.profiler.ProfilerActivity.CUDA,
],
record_shapes=True,
profile_memory=True,
with_stack=True
) as prof:
result = wave | pipeline
# Print profiling results
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
# Export for visualization
prof.export_chrome_trace("trace.json")
# Open trace.json in chrome://tracing for detailed visualization
See also
PyTorch Profiler Documentation - Official guide to PyTorch profiling tools
Complete Benchmarking Examples#
These complete examples demonstrate how to run comprehensive benchmarks for your specific use cases.
Example 1: API Pattern Comparison#
Compare different API patterns for your filter chain:
import timeit
import numpy as np
from torch import nn
from torchfx import Wave
from torchfx.filter import HiChebyshev1, LoButterworth
SAMPLE_RATE = 44100
DURATION = 120 # 2 minutes
NUM_CHANNELS = 8
REP = 50
# Generate test signal
signal_data = np.random.randn(NUM_CHANNELS, int(SAMPLE_RATE * DURATION))
signal_data = signal_data.astype(np.float32)
signal_data /= np.max(np.abs(signal_data), axis=1, keepdims=True)
wave = Wave(signal_data, SAMPLE_RATE)
# Pattern 1: Custom nn.Module class
class FilterChain(nn.Module):
def __init__(self, fs):
super().__init__()
self.f1 = HiChebyshev1(20, fs=fs)
self.f2 = LoButterworth(5000, fs=fs)
def forward(self, x):
x = self.f1(x)
x = self.f2(x)
return x
def test_class():
fchain = FilterChain(wave.fs)
return fchain(wave.ys)
# Pattern 2: nn.Sequential
def test_sequential():
fchain = nn.Sequential(
HiChebyshev1(20, fs=wave.fs),
LoButterworth(5000, fs=wave.fs),
)
return fchain(wave.ys)
# Pattern 3: Pipe operator
def test_pipe():
return wave | HiChebyshev1(20) | LoButterworth(5000)
# Benchmark each pattern
class_time = timeit.timeit(test_class, number=REP)
seq_time = timeit.timeit(test_sequential, number=REP)
pipe_time = timeit.timeit(test_pipe, number=REP)
print(f"Custom class: {class_time/REP:.6f}s")
print(f"nn.Sequential: {seq_time/REP:.6f}s")
print(f"Pipe operator: {pipe_time/REP:.6f}s")
Example 2: FIR Filter Performance Analysis#
Comprehensive FIR filter benchmarking across durations and channel counts:
import timeit
import numpy as np
import torch.nn as nn
import pandas as pd
from torchfx import Wave
from torchfx.filter import DesignableFIR
SAMPLE_RATE = 44100
REP = 50
# Test matrix
durations = [5, 60, 180, 300, 600]
channel_counts = [1, 2, 4, 8, 12]
results = []
for duration in durations:
for channels in channel_counts:
# Generate test signal
signal = np.random.randn(channels, int(SAMPLE_RATE * duration))
signal = signal.astype(np.float32)
signal /= np.max(np.abs(signal), axis=1, keepdims=True)
wave = Wave(signal, SAMPLE_RATE)
# Create FIR filter chain
fchain = nn.Sequential(
DesignableFIR(num_taps=101, cutoff=1000, fs=SAMPLE_RATE),
DesignableFIR(num_taps=102, cutoff=5000, fs=SAMPLE_RATE),
DesignableFIR(num_taps=103, cutoff=1500, fs=SAMPLE_RATE),
)
# Pre-compute coefficients
for f in fchain:
f.compute_coefficients()
# GPU benchmark
wave.to("cuda")
fchain.to("cuda")
gpu_time = timeit.timeit(lambda: wave | fchain, number=REP) / REP
# CPU benchmark
wave.to("cpu")
fchain.to("cpu")
cpu_time = timeit.timeit(lambda: wave | fchain, number=REP) / REP
results.append({
'duration_sec': duration,
'channels': channels,
'gpu_time_sec': gpu_time,
'cpu_time_sec': cpu_time,
'speedup': cpu_time / gpu_time
})
print(f"Duration: {duration}s, Channels: {channels}, "
f"GPU: {gpu_time:.6f}s, CPU: {cpu_time:.6f}s, "
f"Speedup: {cpu_time/gpu_time:.2f}x")
# Save results
df = pd.DataFrame(results)
df.to_csv("fir_benchmark.csv", index=False)
print("\nResults saved to fir_benchmark.csv")
Example 3: IIR Filter Performance Analysis#
Complete IIR filter benchmarking with coefficient management:
import timeit
import numpy as np
import torch.nn as nn
import pandas as pd
from torchfx import Wave
from torchfx.filter import HiButterworth, LoButterworth, HiChebyshev1, LoChebyshev1
SAMPLE_RATE = 44100
REP = 50
# Test matrix
durations = [1, 5, 180, 300, 600]
channel_counts = [1, 2, 4, 8, 12]
results = []
for duration in durations:
for channels in channel_counts:
# Generate test signal
signal = np.random.randn(channels, int(SAMPLE_RATE * duration))
signal = signal.astype(np.float32)
signal /= np.max(np.abs(signal), axis=1, keepdims=True)
wave = Wave(signal, SAMPLE_RATE)
# Create IIR filter chain
fchain = nn.Sequential(
HiButterworth(cutoff=1000, order=2, fs=SAMPLE_RATE),
LoButterworth(cutoff=5000, order=2, fs=SAMPLE_RATE),
HiChebyshev1(cutoff=1500, order=2, fs=SAMPLE_RATE),
LoChebyshev1(cutoff=1800, order=2, fs=SAMPLE_RATE),
)
# GPU benchmark
wave.to("cuda")
fchain.to("cuda")
# Compute and move coefficients
for f in fchain:
f.compute_coefficients()
f.move_coeff("cuda")
gpu_time = timeit.timeit(
lambda: fchain(wave.ys),
number=REP
) / REP
# CPU benchmark
wave.to("cpu")
fchain.to("cpu")
for f in fchain:
f.move_coeff("cpu")
cpu_time = timeit.timeit(
lambda: fchain(wave.ys),
number=REP
) / REP
results.append({
'duration_sec': duration,
'channels': channels,
'gpu_time_sec': gpu_time,
'cpu_time_sec': cpu_time,
'speedup': cpu_time / gpu_time
})
print(f"Duration: {duration}s, Channels: {channels}, "
f"GPU: {gpu_time:.6f}s, CPU: {cpu_time:.6f}s, "
f"Speedup: {cpu_time/gpu_time:.2f}x")
# Save results
df = pd.DataFrame(results)
df.to_csv("iir_benchmark.csv", index=False)
print("\nResults saved to iir_benchmark.csv")
Summary#
Key takeaways for optimizing TorchFX performance:
GPU Acceleration: Use GPU for long signals (>60s), multi-channel audio (≥4 channels), and batch processing
Filter Choice: FIR filters excel on GPU with parallel convolution; IIR filters are more CPU-efficient
API Pattern: Pipeline operator provides best ergonomics with automatic sample rate configuration and minimal overhead
Coefficient Caching: Pre-compute filter coefficients once and reuse for multiple files
Device Management: Minimize transfers by keeping all processing on one device
Memory: Use chunked processing for very long audio files to prevent OOM errors
Benchmarking: Use the provided templates to measure performance of your specific pipelines
GPU acceleration can provide 5-20x speedups for appropriate workloads. Follow the decision trees and best practices in this guide to maximize throughput in your audio processing pipelines.
External Resources#
PyTorch Profiler - Profiling PyTorch code
CUDA Best Practices Guide - NVIDIA optimization guide
SciPy Signal Processing - SciPy signal processing reference
PyTorch Performance Tuning - PyTorch optimization guide