(performance)=
# Performance Optimization and Benchmarking
Learn how to measure, optimize, and maximize the performance of your TorchFX audio processing pipelines. This comprehensive guide covers benchmarking methodologies, GPU vs CPU performance comparisons, filter type trade-offs, and best practices for building high-throughput audio processing systems.
## Prerequisites
Before starting this guide, you should be familiar with:
- {doc}`../core-concepts/wave` - Wave class fundamentals
- {doc}`../core-concepts/pipeline-operator` - Pipeline operator basics
- {doc}`gpu-acceleration` - GPU device management
- {doc}`../filters/iir-filters` - IIR filter characteristics
- {doc}`../filters/fir-filters` - FIR filter characteristics
## Overview
Performance optimization in TorchFX involves understanding three key trade-offs:
| Dimension | Trade-off | Optimization Strategy |
|-----------|-----------|----------------------|
| **Execution Backend** | GPU vs CPU vs SciPy | Choose based on duration, channels, batch size |
| **Filter Type** | FIR vs IIR | Balance computational cost vs phase response |
| **API Pattern** | Classes vs Sequential vs Pipe | Select based on ergonomics vs performance needs |
```{mermaid}
graph TB
subgraph "Performance Optimization Dimensions"
AudioTask["Audio Processing Task"]
subgraph Backend["Backend Selection"]
GPU["GPU (CUDA)
High parallelism"]
CPU["CPU (PyTorch)
Moderate performance"]
SciPy["SciPy
Baseline reference"]
end
subgraph FilterType["Filter Type"]
FIR["FIR Filters
Linear phase
Higher compute"]
IIR["IIR Filters
Non-linear phase
Lower compute"]
end
subgraph API["API Pattern"]
Pipe["Pipeline Operator
Auto sample rate"]
Sequential["nn.Sequential
Standard PyTorch"]
Custom["Custom Module
Maximum control"]
end
AudioTask --> Backend
AudioTask --> FilterType
AudioTask --> API
Backend --> Optimize["Optimized Pipeline"]
FilterType --> Optimize
API --> Optimize
end
style AudioTask fill:#e1f5ff
style Optimize fill:#e1ffe1
style GPU fill:#fff5e1
style FIR fill:#f5e1ff
style Pipe fill:#ffe1e1
```
**Performance Optimization Framework** - Three key dimensions for optimizing TorchFX pipelines.
```{seealso}
For detailed GPU acceleration patterns, see {doc}`gpu-acceleration`. For comprehensive benchmarking infrastructure, see the benchmark suite in the `benchmark/` directory.
```
## Benchmark Methodology
TorchFX includes a comprehensive benchmarking suite that evaluates performance across different dimensions. The suite consists of three benchmark scripts, each targeting a specific aspect of performance.
### Benchmark Suite Architecture
```{mermaid}
graph TB
subgraph "TorchFX Benchmark Suite"
API["api_bench.py
API Pattern Comparison"]
FIR["fir_bench.py
FIR Filter Performance"]
IIR["iir_bench.py
IIR Filter Performance"]
end
subgraph "Common Test Parameters"
SR["Sample Rate: 44.1 kHz"]
REP["Repetitions: 50"]
Signal["Signal: Random noise
Float32, normalized"]
end
subgraph "Variable Parameters"
Duration["Duration: 1s - 10 min"]
Channels["Channels: 1, 2, 4, 8, 12"]
Backends["Backends: GPU, CPU, SciPy"]
end
subgraph "Output"
CSV["CSV Results
.out files"]
end
API --> SR
FIR --> SR
IIR --> SR
API --> REP
FIR --> REP
IIR --> REP
FIR --> Duration
FIR --> Channels
IIR --> Duration
IIR --> Channels
API --> Backends
FIR --> Backends
IIR --> Backends
API --> CSV
FIR --> CSV
IIR --> CSV
style API fill:#e1f5ff
style FIR fill:#e8f5e1
style IIR fill:#fff5e1
```
**Benchmark Suite Organization** - Three complementary benchmarks measuring different performance aspects.
### Test Signal Generation
All benchmarks use consistent signal generation to ensure comparable results:
```python
import numpy as np
def create_audio(sample_rate, duration, num_channels):
"""Generate multi-channel random noise for benchmarking.
Parameters
----------
sample_rate : int
Sample rate in Hz (typically 44100)
duration : float
Duration in seconds
num_channels : int
Number of audio channels
Returns
-------
signal : np.ndarray
Shape (num_channels, num_samples), float32, normalized to [-1, 1]
"""
signal = np.random.randn(num_channels, int(sample_rate * duration))
signal = signal.astype(np.float32)
# Normalize each channel independently
signal /= np.max(np.abs(signal), axis=1, keepdims=True)
return signal
```
**Key Characteristics**:
- **Distribution**: Gaussian random noise (`np.random.randn`)
- **Data Type**: Float32 for GPU compatibility
- **Normalization**: Per-channel normalization to [-1, 1] range
- **Deterministic**: Same random seed produces consistent results (when seeded)
```{tip}
Using random noise ensures benchmarks test worst-case performance without special structure or patterns that could be optimized by the hardware.
```
### Timing Methodology
All benchmarks use Python's {func}`timeit.timeit` for accurate timing measurements:
```python
import timeit
# Standard pattern across all benchmarks
REP = 50 # Number of repetitions
# Time the operation
elapsed = timeit.timeit(
lambda: process_audio(wave, filter_chain),
number=REP
)
# Report average time per iteration
avg_time = elapsed / REP
print(f"Average time: {avg_time:.6f}s")
```
**Timing Best Practices**:
1. **Multiple Repetitions**: 50 repetitions minimize variance and warm-up effects
2. **Lambda Wrapper**: Captures all setup costs in the closure
3. **Average Reporting**: Reports per-iteration time for easy comparison
4. **Excluded Overhead**: Setup (loading, coefficient computation) excluded from timing
```{important}
Timing only measures the core processing operation. Setup costs like loading audio files, computing filter coefficients, and device transfers are performed **before** timing begins.
```
## API Performance Comparison
The `api_bench.py` benchmark compares different API patterns for applying the same filter chain to audio. This helps users understand the performance and ergonomic trade-offs of different coding styles.
### Test Configuration
**Filter Chain**: 6 cascaded IIR filters
- 3 × High-pass Chebyshev Type I filters (20 Hz, 60 Hz, 65 Hz)
- 3 × Low-pass Butterworth filters (5000 Hz, 4900 Hz, 4850 Hz)
**Test Signal**: 8-channel audio, 2 minutes duration, 44.1 kHz sample rate
```{mermaid}
graph LR
Input["Input Audio
8 channels, 2 min"] --> F1["HiChebyshev1
20 Hz"]
F1 --> F2["HiChebyshev1
60 Hz"]
F2 --> F3["HiChebyshev1
65 Hz"]
F3 --> F4["LoButterworth
5000 Hz"]
F4 --> F5["LoButterworth
4900 Hz"]
F5 --> F6["LoButterworth
4850 Hz"]
F6 --> Output["Filtered Output
8 channels, 2 min"]
style Input fill:#e1f5ff
style Output fill:#e1ffe1
```
**API Benchmark Filter Chain** - Six cascaded IIR filters applied to 8-channel audio.
### API Pattern 1: Custom nn.Module Class
A traditional PyTorch approach using a custom {class}`torch.nn.Module`:
```python
from torch import nn
from torchfx.filter import HiChebyshev1, LoButterworth
class FilterChain(nn.Module):
"""Custom module for filter chain."""
def __init__(self, fs):
super().__init__()
self.f1 = HiChebyshev1(20, fs=fs)
self.f2 = HiChebyshev1(60, fs=fs)
self.f3 = HiChebyshev1(65, fs=fs)
self.f4 = LoButterworth(5000, fs=fs)
self.f5 = LoButterworth(4900, fs=fs)
self.f6 = LoButterworth(4850, fs=fs)
def forward(self, x):
x = self.f1(x)
x = self.f2(x)
x = self.f3(x)
x = self.f4(x)
x = self.f5(x)
x = self.f6(x)
return x
# Usage
fchain = FilterChain(signal.fs)
result = fchain(signal.ys)
```
**Characteristics**:
- **Sample Rate**: Must be passed explicitly to `__init__`
- **Reusability**: Can be instantiated once and reused
- **Flexibility**: Full control over forward pass logic
- **Overhead**: Standard PyTorch `nn.Module` call overhead
### API Pattern 2: nn.Sequential
Using PyTorch's built-in {class}`torch.nn.Sequential` container:
```python
from torch.nn import Sequential
from torchfx.filter import HiChebyshev1, LoButterworth
# Create filter chain inline
fchain = Sequential(
HiChebyshev1(20, fs=signal.fs),
HiChebyshev1(60, fs=signal.fs),
HiChebyshev1(65, fs=signal.fs),
LoButterworth(5000, fs=signal.fs),
LoButterworth(4900, fs=signal.fs),
LoButterworth(4850, fs=signal.fs),
)
# Apply to audio tensor
result = fchain(signal.ys)
```
**Characteristics**:
- **Sample Rate**: Must be passed explicitly to each filter
- **Simplicity**: No custom class needed
- **Performance**: Identical to custom `nn.Module` (same underlying mechanism)
- **Flexibility**: Can add/remove filters easily
```{note}
`nn.Sequential` and custom `nn.Module` classes have **identical performance** characteristics. The choice between them is purely about code organization and readability.
```
### API Pattern 3: Pipeline Operator (Pipe)
TorchFX's idiomatic {term}`pipeline operator` pattern:
```python
from torchfx import Wave
from torchfx.filter import HiChebyshev1, LoButterworth
# Apply filters using pipe operator
result = (
signal
| HiChebyshev1(20)
| HiChebyshev1(60)
| HiChebyshev1(65)
| LoButterworth(5000)
| LoButterworth(4900)
| LoButterworth(4850)
)
```
**Characteristics**:
- **Sample Rate**: **Automatically configured** from {class}`~torchfx.Wave` object
- **Ergonomics**: Most readable and concise
- **Safety**: Eliminates sample rate mismatch errors
- **Performance**: Minimal overhead for automatic configuration
```{tip}
The pipeline operator is the **recommended pattern** for TorchFX. It provides the best balance of readability, safety (automatic sample rate configuration), and performance.
```
### API Pattern 4: SciPy Baseline
Pure NumPy/SciPy implementation for baseline comparison:
```python
from scipy.signal import butter, cheby1, lfilter
# Pre-compute filter coefficients
b1, a1 = cheby1(2, 0.5, 20, btype='high', fs=SAMPLE_RATE)
b2, a2 = cheby1(2, 0.5, 60, btype='high', fs=SAMPLE_RATE)
b3, a3 = cheby1(2, 0.5, 65, btype='high', fs=SAMPLE_RATE)
b4, a4 = butter(2, 5000, btype='low', fs=SAMPLE_RATE)
b5, a5 = butter(2, 4900, btype='low', fs=SAMPLE_RATE)
b6, a6 = butter(2, 4850, btype='low', fs=SAMPLE_RATE)
# Apply filters sequentially
filtered = lfilter(b1, a1, signal)
filtered = lfilter(b2, a2, filtered)
filtered = lfilter(b3, a3, filtered)
filtered = lfilter(b4, a4, filtered)
filtered = lfilter(b5, a5, filtered)
filtered = lfilter(b6, a6, filtered)
```
**Characteristics**:
- **Performance**: Optimized C implementation
- **GPU**: No GPU acceleration available
- **Integration**: Requires NumPy arrays (no PyTorch tensors)
- **Baseline**: Reference for CPU performance comparison
### API Performance Summary
| API Pattern | Sample Rate Config | Ergonomics | Performance | Use Case |
|-------------|-------------------|------------|-------------|----------|
| Custom `nn.Module` | Manual (`fs=`) | Good | Fast | Complex custom logic |
| `nn.Sequential` | Manual (`fs=`) | Very Good | Fast | Standard PyTorch integration |
| **Pipeline Operator** | **Automatic** | **Excellent** | **Fast** | **Recommended for TorchFX** |
| SciPy `lfilter` | Manual (`fs=`) | Fair | Fast (CPU only) | Baseline comparison |
**Key Insight**: The pipeline operator provides automatic sample rate configuration with negligible performance overhead, making it the most ergonomic choice without sacrificing speed.
```{seealso}
{doc}`../core-concepts/pipeline-operator` - Detailed documentation on the pipeline operator pattern
```
## FIR Filter Performance
The `fir_bench.py` benchmark evaluates FIR filter performance across varying signal durations, channel counts, and execution backends (GPU, CPU, SciPy).
### FIR Test Configuration
**Filter Chain**: 5 cascaded FIR filters with varying tap counts
- DesignableFIR: 101 taps, 1000 Hz cutoff
- DesignableFIR: 102 taps, 5000 Hz cutoff
- DesignableFIR: 103 taps, 1500 Hz cutoff
- DesignableFIR: 104 taps, 1800 Hz cutoff
- DesignableFIR: 105 taps, 1850 Hz cutoff
**Test Parameters**:
- **Durations**: 5s, 60s, 180s, 300s, 600s (10 minutes)
- **Channels**: 1, 2, 4, 8, 12
- **Backends**: GPU (CUDA), CPU (PyTorch), SciPy (NumPy)
- **Repetitions**: 50 per configuration
```{mermaid}
graph TB
subgraph "FIR Benchmark Test Matrix"
Input["Input Signal
Variable duration & channels"]
subgraph Filters["FIR Filter Chain"]
F1["FIR 101 taps
1000 Hz"]
F2["FIR 102 taps
5000 Hz"]
F3["FIR 103 taps
1500 Hz"]
F4["FIR 104 taps
1800 Hz"]
F5["FIR 105 taps
1850 Hz"]
end
subgraph Durations["Duration Sweep"]
D1["5 seconds"]
D2["60 seconds"]
D3["180 seconds"]
D4["300 seconds"]
D5["600 seconds"]
end
subgraph Channels["Channel Sweep"]
C1["1 channel"]
C2["2 channels"]
C3["4 channels"]
C4["8 channels"]
C5["12 channels"]
end
Input --> F1
F1 --> F2
F2 --> F3
F3 --> F4
F4 --> F5
Input --> Durations
Input --> Channels
end
style Input fill:#e1f5ff
style Filters fill:#fff5e1
```
**FIR Benchmark Configuration** - Tests performance across duration and channel count dimensions.
### FIR Coefficient Pre-Computation
FIR filters require coefficient computation before filtering. The benchmark explicitly pre-computes coefficients to exclude design time from performance measurements:
```python
import torch.nn as nn
from torchfx import Wave
from torchfx.filter import DesignableFIR
SAMPLE_RATE = 44100
# Create filter chain
fchain = nn.Sequential(
DesignableFIR(num_taps=101, cutoff=1000, fs=SAMPLE_RATE),
DesignableFIR(num_taps=102, cutoff=5000, fs=SAMPLE_RATE),
DesignableFIR(num_taps=103, cutoff=1500, fs=SAMPLE_RATE),
DesignableFIR(num_taps=104, cutoff=1800, fs=SAMPLE_RATE),
DesignableFIR(num_taps=105, cutoff=1850, fs=SAMPLE_RATE),
)
# Pre-compute coefficients before timing
for f in fchain:
f.compute_coefficients()
# Now ready for benchmarking (coefficient design excluded)
```
**Why Pre-Compute?**
1. **Separation of Concerns**: Design time vs filtering time are separate
2. **Realistic Use Case**: Coefficients are typically computed once and reused
3. **Fair Comparison**: SciPy baseline also pre-computes coefficients
```{important}
In production code, call {meth}`compute_coefficients()` once during initialization, then reuse the filter for processing multiple audio files.
```
### FIR Device Transfer Pattern
The benchmark demonstrates proper device management for GPU acceleration:
```python
import timeit
# GPU benchmarking
wave.to("cuda")
fchain.to("cuda")
gpu_time = timeit.timeit(lambda: wave | fchain, number=REP)
# CPU benchmarking (transfer back)
wave.to("cpu")
fchain.to("cpu")
cpu_time = timeit.timeit(lambda: wave | fchain, number=REP)
# Calculate speedup
speedup = cpu_time / gpu_time
print(f"GPU speedup: {speedup:.2f}x")
```
**Device Transfer Rules**:
1. Move both `Wave` and filter chain to the same device
2. Pre-compute coefficients **before** moving to GPU
3. Time only the filtering operation (exclude transfers)
4. Move back to CPU for result saving
```{seealso}
{doc}`gpu-acceleration` - Comprehensive guide to GPU device management
```
### FIR Performance Characteristics
FIR filters have distinct performance characteristics compared to IIR filters:
| Characteristic | FIR Filters | Reason |
|---------------|-------------|---------|
| **Computational Cost** | Higher | Convolution with many taps |
| **GPU Advantage** | Excellent | High parallelism in convolution |
| **Memory Footprint** | Larger | Must store all tap coefficients |
| **Scaling with Taps** | Linear O(N) | More taps = more multiply-accumulate operations |
| **Scaling with Channels** | Excellent | Independent per-channel convolution |
**When FIR Filters Excel**:
- **Long signals** (>60s): Amortizes setup overhead
- **Many channels** (≥4): Parallel convolution across channels
- **GPU available**: Convolution is highly parallel
- **Linear phase required**: Only FIR can provide linear phase
**When to Avoid FIR**:
- **Real-time processing**: IIR filters have lower latency
- **Limited memory**: FIR coefficients consume more memory
- **CPU-only, short signals**: IIR may be faster
```{tip}
For steep frequency responses, FIR filters require many taps (100+). Consider IIR filters if phase linearity is not critical.
```
### SciPy FIR Baseline Implementation
The benchmark includes a SciPy baseline for CPU performance comparison:
```python
from scipy.signal import firwin, lfilter
# Design FIR coefficients using scipy
b1 = firwin(101, 1000, fs=SAMPLE_RATE)
b2 = firwin(102, 5000, fs=SAMPLE_RATE)
b3 = firwin(103, 1500, fs=SAMPLE_RATE)
b4 = firwin(104, 1800, fs=SAMPLE_RATE)
b5 = firwin(105, 1850, fs=SAMPLE_RATE)
# Apply filters sequentially
a = [1] # FIR filters have a = [1]
filtered = lfilter(b1, a, signal)
filtered = lfilter(b2, a, filtered)
filtered = lfilter(b3, a, filtered)
filtered = lfilter(b4, a, filtered)
filtered = lfilter(b5, a, filtered)
```
**SciPy Characteristics**:
- **CPU-only**: No GPU acceleration
- **Optimized**: Uses NumPy's optimized C/Fortran backend
- **Reference**: Establishes baseline CPU performance
- **Compatibility**: Requires NumPy arrays (not PyTorch tensors)
## IIR Filter Performance
The `iir_bench.py` benchmark evaluates IIR (Infinite Impulse Response) filter performance using the same test matrix as FIR benchmarks but with recursive filters.
### IIR Test Configuration
**Filter Chain**: 4 cascaded IIR filters
- HiButterworth: 1000 Hz cutoff, order 2
- LoButterworth: 5000 Hz cutoff, order 2
- HiChebyshev1: 1500 Hz cutoff, order 2
- LoChebyshev1: 1800 Hz cutoff, order 2
**Test Parameters**:
- **Durations**: 1s, 5s, 180s, 300s, 600s (10 minutes)
- **Channels**: 1, 2, 4, 8, 12
- **Backends**: GPU (CUDA), CPU (PyTorch), SciPy (NumPy)
- **Repetitions**: 50 per configuration
```{mermaid}
graph TB
subgraph "IIR Benchmark Architecture"
Input["Input Signal
Variable duration & channels"]
subgraph Chain["IIR Filter Chain"]
F1["HiButterworth
1000 Hz, order 2"]
F2["LoButterworth
5000 Hz, order 2"]
F3["HiChebyshev1
1500 Hz, order 2"]
F4["LoChebyshev1
1800 Hz, order 2"]
end
subgraph Setup["IIR-Specific Setup"]
Compute["compute_coefficients()
Design b, a coefficients"]
MoveCoeff["move_coeff('cuda'/'cpu')
Transfer to device"]
end
subgraph Backends["Execution Backends"]
GPU["GPU: fchain(wave.ys)"]
CPU["CPU: fchain(wave.ys)"]
SciPy["SciPy: lfilter(b, a, signal)"]
end
Input --> F1
F1 --> F2
F2 --> F3
F3 --> F4
F1 --> Compute
Compute --> MoveCoeff
MoveCoeff --> GPU
MoveCoeff --> CPU
F1 --> SciPy
end
style Input fill:#e1f5ff
style Chain fill:#fff5e1
style Setup fill:#e8f5e1
```
**IIR Benchmark Structure** - Shows coefficient management and execution backends.
### IIR Coefficient Management
Unlike FIR filters, IIR filters have both numerator (`b`) and denominator (`a`) coefficients that must be explicitly managed:
```python
import torch.nn as nn
from torchfx import Wave
from torchfx.filter import HiButterworth, LoButterworth
SAMPLE_RATE = 44100
# Create IIR filter chain
fchain = nn.Sequential(
HiButterworth(cutoff=1000, order=2, fs=SAMPLE_RATE),
LoButterworth(cutoff=5000, order=2, fs=SAMPLE_RATE),
)
# Move wave and module to GPU
wave.to("cuda")
fchain.to("cuda")
# IIR-specific: compute and move coefficients
for f in fchain:
f.compute_coefficients() # Design b, a coefficients
f.move_coeff("cuda") # Move coefficients to GPU
# Now ready for GPU processing
result = fchain(wave.ys)
```
**Two-Step Device Transfer**:
1. **Module transfer**: `fchain.to("cuda")` moves module parameters
2. **Coefficient transfer**: `f.move_coeff("cuda")` moves filter coefficients
```{warning}
For IIR filters, you must **both** move the module to the device **and** call {meth}`move_coeff()`. Forgetting the second step will cause runtime errors.
```
### IIR vs FIR Performance Trade-offs
IIR and FIR filters have fundamentally different performance characteristics:
| Aspect | IIR Filters | FIR Filters |
|--------|-------------|-------------|
| **Computational Cost** | Lower (fewer operations per sample) | Higher (convolution with many taps) |
| **Memory Footprint** | Small (few coefficients: b, a) | Large (many tap coefficients) |
| **GPU Advantage** | Moderate (less parallelism) | High (highly parallel convolution) |
| **Phase Response** | Non-linear | Can be linear (symmetric taps) |
| **Stability** | Can be unstable if poorly designed | Always stable |
| **Filter Order** | Achieves sharp cutoff with low order | Requires many taps for sharp cutoff |
**Performance Comparison Example**:
```python
# IIR: Order 8 Butterworth (16 coefficients total)
iir_filter = LoButterworth(cutoff=1000, order=8, fs=44100)
# Coefficients: b (9 values) + a (9 values) = 18 total
# Equivalent FIR: ~150+ taps for similar frequency response
fir_filter = DesignableFIR(num_taps=151, cutoff=1000, fs=44100)
# Coefficients: 151 tap values
# IIR is ~8x more memory-efficient and faster on CPU
# FIR has better GPU parallelism and linear phase
```
**Choosing Between IIR and FIR**:
```{mermaid}
flowchart TD
Start["Choose Filter Type"]
LinearPhase{"Linear phase
required?"}
Stability{"Stability
critical?"}
GPU{"GPU
available?"}
Memory{"Memory
constrained?"}
UseFIR["Use FIR Filters
✓ Linear phase
✓ Always stable
✓ GPU-friendly"]
UseIIR["Use IIR Filters
✓ Low memory
✓ Low latency
✓ Efficient CPU"]
Start --> LinearPhase
LinearPhase -->|Yes| UseFIR
LinearPhase -->|No| Stability
Stability -->|Critical| UseFIR
Stability -->|Not critical| GPU
GPU -->|Yes, long signals| UseFIR
GPU -->|No or short signals| Memory
Memory -->|Yes| UseIIR
Memory -->|No| UseFIR
style UseFIR fill:#e1ffe1
style UseIIR fill:#e1f5ff
```
**Filter Type Selection Decision Tree** - Choose based on phase, stability, and resource constraints.
### IIR SciPy Baseline
The IIR benchmark includes SciPy baseline for comparison:
```python
from scipy.signal import butter, cheby1, lfilter
# Design IIR coefficients
b1, a1 = butter(2, 1000, btype='high', fs=SAMPLE_RATE)
b2, a2 = butter(2, 5000, btype='low', fs=SAMPLE_RATE)
b3, a3 = cheby1(2, 0.5, 1500, btype='high', fs=SAMPLE_RATE)
b4, a4 = cheby1(2, 0.5, 1800, btype='low', fs=SAMPLE_RATE)
# Apply filters sequentially
filtered = lfilter(b1, a1, signal)
filtered = lfilter(b2, a2, filtered)
filtered = lfilter(b3, a3, filtered)
filtered = lfilter(b4, a4, filtered)
```
**SciPy IIR Performance**:
- **CPU-optimized**: Highly optimized C implementation
- **No GPU**: SciPy doesn't support CUDA
- **Baseline**: Reference for CPU performance
- **Filter Design**: Uses standard signal processing algorithms
## Performance Optimization Guidelines
Based on the benchmarking results, follow these guidelines to optimize your TorchFX pipelines.
### When to Use GPU Acceleration
GPU acceleration provides the greatest benefit under specific conditions:
```{mermaid}
flowchart TD
Start["Audio Processing Task"]
CheckDuration{"Signal duration
> 60 seconds?"}
CheckChannels{"Channels ≥ 4?"}
CheckBatch{"Batch processing
multiple files?"}
CheckFIR{"Using FIR filters
with >100 taps?"}
CheckRealtime{"Real-time
low-latency requirement?"}
UseGPU["✓ Use GPU
wave.to('cuda')
fchain.to('cuda')"]
UseCPU["✓ Use CPU
Default or wave.to('cpu')"]
Start --> CheckDuration
CheckDuration -->|Yes| UseGPU
CheckDuration -->|No| CheckChannels
CheckChannels -->|Yes| UseGPU
CheckChannels -->|No| CheckBatch
CheckBatch -->|Yes| UseGPU
CheckBatch -->|No| CheckFIR
CheckFIR -->|Yes| UseGPU
CheckFIR -->|No| CheckRealtime
CheckRealtime -->|Yes| UseCPU
CheckRealtime -->|No| UseGPU
style UseGPU fill:#e1ffe1
style UseCPU fill:#e1f5ff
```
**GPU Decision Tree** - Follow this flowchart to determine optimal execution backend.
**GPU Performance Sweet Spot**:
| Factor | Threshold | Reasoning |
|--------|-----------|-----------|
| **Duration** | > 60 seconds | Amortizes data transfer overhead |
| **Channels** | ≥ 4 channels | Exploits parallel processing |
| **Batch Size** | > 5 files | Transfer overhead amortized across batch |
| **FIR Taps** | > 100 taps | Convolution highly parallelizable |
| **IIR Chain** | ≥ 3 filters | Accumulated compute benefits |
**CPU Preferred Cases**:
- **Real-time processing**: More predictable latency
- **Short signals** (<30s): Transfer overhead dominates
- **Single channel**: Insufficient parallelism
- **IIR filters only**: Less GPU benefit than FIR
```{tip}
When in doubt, benchmark your specific workload. Use the patterns from the benchmark suite as templates.
```
### Filter Chain Optimization
Optimize filter chains by pre-computing coefficients and reusing filters:
```python
import torch.nn as nn
from torchfx import Wave
from torchfx.filter import DesignableFIR, HiButterworth
SAMPLE_RATE = 44100
# Create filter chain
fchain = nn.Sequential(
DesignableFIR(num_taps=101, cutoff=1000, fs=SAMPLE_RATE),
HiButterworth(cutoff=500, order=2, fs=SAMPLE_RATE),
)
# Pre-compute coefficients once during initialization
for f in fchain:
f.compute_coefficients()
# For IIR filters, also move coefficients to device
device = "cuda" if torch.cuda.is_available() else "cpu"
fchain.to(device)
for f in fchain:
if hasattr(f, 'move_coeff'):
f.move_coeff(device)
# Process multiple files without re-computing coefficients
audio_files = ["song1.wav", "song2.wav", "song3.wav"]
for audio_file in audio_files:
wave = Wave.from_file(audio_file).to(device)
result = wave | fchain # Uses cached coefficients
result.to("cpu").save(f"processed_{audio_file}")
```
**Optimization Benefits**:
1. **Coefficient caching**: Compute once, reuse for all files
2. **Device pinning**: Keep filters on GPU across iterations
3. **Batch amortization**: Setup cost amortized over multiple files
### Device Placement Strategy
Minimize device transfers by keeping processing on a single device:
```python
import torch
import torchfx as fx
# Strategy 1: Single device throughout (RECOMMENDED)
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load and move to device once
wave = fx.Wave.from_file("audio.wav").to(device)
# Create filter chain on same device
fchain = torch.nn.Sequential(
fx.filter.HiButterworth(cutoff=80, order=2),
fx.filter.LoButterworth(cutoff=12000, order=4),
).to(device)
# All operations on same device
result = wave | fchain
# Move to CPU only for final I/O
result.to("cpu").save("output.wav")
```
**Avoid Inefficient Transfers** (Anti-pattern):
```python
# ❌ WRONG: Unnecessary device transfers
wave = fx.Wave.from_file("audio.wav").to("cuda")
result1 = wave.to("cpu") | cpu_filter # Transfer 1
result2 = result1.to("cuda") | gpu_filter # Transfer 2
result3 = result2.to("cpu") | cpu_filter2 # Transfer 3
# ✅ CORRECT: Single device
device = "cuda" if torch.cuda.is_available() else "cpu"
wave = fx.Wave.from_file("audio.wav").to(device)
cpu_filter.to(device)
gpu_filter.to(device)
cpu_filter2.to(device)
result = wave | cpu_filter | gpu_filter | cpu_filter2
```
**Device Transfer Costs**:
- **CPU → GPU**: O(n) where n = number of samples
- **GPU → CPU**: O(n) where n = number of samples
- **Impact**: Can dominate for short signals
```{seealso}
{doc}`gpu-acceleration` - Comprehensive device management patterns
```
### Memory Management Best Practices
Optimize memory usage for large-scale processing:
| Optimization | Implementation | Impact |
|-------------|----------------|---------|
| **In-place operations** | Use effects that modify tensors in-place where possible | Reduces memory allocations |
| **Chunked processing** | Process long audio in chunks | Prevents GPU OOM errors |
| **Coefficient caching** | Pre-compute and reuse filter coefficients | Eliminates redundant computation |
| **Device pinning** | Keep frequently-used filters on device | Reduces transfer overhead |
| **Batch size tuning** | Adjust batch size to fit GPU memory | Maximizes throughput |
**Memory-Efficient Chunked Processing Example**:
```python
import torch
import torchfx as fx
def process_long_audio_chunked(wave, fchain, chunk_duration=60):
"""Process very long audio in chunks to manage GPU memory.
Parameters
----------
wave : Wave
Input audio (can be on CPU or GPU)
fchain : nn.Module
Filter chain (must be on same device as intended chunks)
chunk_duration : float
Chunk duration in seconds
Returns
-------
Wave
Processed audio
"""
chunk_samples = int(chunk_duration * wave.fs)
num_chunks = (wave.ys.size(-1) + chunk_samples - 1) // chunk_samples
device = "cuda" if torch.cuda.is_available() else "cpu"
fchain.to(device)
results = []
for i in range(num_chunks):
start = i * chunk_samples
end = min((i + 1) * chunk_samples, wave.ys.size(-1))
# Extract chunk and move to GPU
chunk = fx.Wave(wave.ys[..., start:end], wave.fs).to(device)
# Process chunk on GPU
processed_chunk = chunk | fchain
# Move back to CPU and store
results.append(processed_chunk.ys.cpu())
# Clear GPU cache
if device == "cuda":
torch.cuda.empty_cache()
# Concatenate results
return fx.Wave(torch.cat(results, dim=-1), wave.fs)
# Usage
wave = fx.Wave.from_file("10_hour_recording.wav")
fchain = fx.filter.LoButterworth(cutoff=1000, order=4)
result = process_long_audio_chunked(wave, fchain, chunk_duration=60)
result.save("processed.wav")
```
**Chunked Processing Benefits**:
- Processes arbitrarily long audio without OOM errors
- Keeps GPU utilization high
- Balances memory usage with throughput
## Benchmarking Your Own Pipelines
Use these patterns to benchmark your custom TorchFX pipelines.
### Basic Benchmarking Template
```python
import timeit
import torch
import torchfx as fx
import numpy as np
# Configuration
SAMPLE_RATE = 44100
DURATION = 60 # seconds
NUM_CHANNELS = 4
REP = 50 # repetitions for timing
# Generate test signal
signal = np.random.randn(NUM_CHANNELS, int(SAMPLE_RATE * DURATION))
signal = signal.astype(np.float32)
signal /= np.max(np.abs(signal), axis=1, keepdims=True)
wave = fx.Wave(signal, SAMPLE_RATE)
# Create your processing pipeline
pipeline = torch.nn.Sequential(
fx.filter.HiButterworth(cutoff=100, order=2),
fx.filter.LoButterworth(cutoff=10000, order=4),
fx.effect.Normalize(peak=0.9),
)
# Pre-compute coefficients
for module in pipeline:
if hasattr(module, 'compute_coefficients'):
module.compute_coefficients()
# Benchmark CPU
wave.to("cpu")
pipeline.to("cpu")
cpu_time = timeit.timeit(lambda: wave | pipeline, number=REP)
# Benchmark GPU (if available)
if torch.cuda.is_available():
wave.to("cuda")
pipeline.to("cuda")
# Move IIR coefficients if needed
for module in pipeline:
if hasattr(module, 'move_coeff'):
module.move_coeff("cuda")
gpu_time = timeit.timeit(lambda: wave | pipeline, number=REP)
print(f"CPU time: {cpu_time/REP:.6f}s")
print(f"GPU time: {gpu_time/REP:.6f}s")
print(f"Speedup: {cpu_time/gpu_time:.2f}x")
else:
print(f"CPU time: {cpu_time/REP:.6f}s")
print("GPU not available")
```
### Multi-Configuration Benchmark
Test performance across multiple configurations:
```python
import timeit
import torch
import torchfx as fx
import numpy as np
import pandas as pd
SAMPLE_RATE = 44100
REP = 50
# Test configurations
durations = [5, 30, 60, 120, 300] # seconds
channel_counts = [1, 2, 4, 8]
# Create filter chain
pipeline = torch.nn.Sequential(
fx.filter.LoButterworth(cutoff=1000, order=4),
fx.filter.HiButterworth(cutoff=100, order=2),
)
# Pre-compute coefficients
for module in pipeline:
if hasattr(module, 'compute_coefficients'):
module.compute_coefficients()
# Benchmark grid
results = []
for duration in durations:
for channels in channel_counts:
# Generate test signal
signal = np.random.randn(channels, int(SAMPLE_RATE * duration))
signal = signal.astype(np.float32)
signal /= np.max(np.abs(signal), axis=1, keepdims=True)
wave = fx.Wave(signal, SAMPLE_RATE)
# CPU benchmark
wave.to("cpu")
pipeline.to("cpu")
cpu_time = timeit.timeit(lambda: wave | pipeline, number=REP) / REP
# GPU benchmark
if torch.cuda.is_available():
wave.to("cuda")
pipeline.to("cuda")
gpu_time = timeit.timeit(lambda: wave | pipeline, number=REP) / REP
speedup = cpu_time / gpu_time
else:
gpu_time = None
speedup = None
results.append({
'duration': duration,
'channels': channels,
'cpu_time': cpu_time,
'gpu_time': gpu_time,
'speedup': speedup
})
# Convert to DataFrame for analysis
df = pd.DataFrame(results)
print(df.to_string(index=False))
# Save to CSV
df.to_csv("benchmark_results.csv", index=False)
```
### Profiling with PyTorch Profiler
For detailed performance analysis, use PyTorch's built-in profiler:
```python
import torch
import torchfx as fx
# Create pipeline
wave = fx.Wave.from_file("audio.wav").to("cuda")
pipeline = torch.nn.Sequential(
fx.filter.LoButterworth(cutoff=1000, order=4),
fx.filter.HiButterworth(cutoff=100, order=2),
).to("cuda")
# Profile the pipeline
with torch.profiler.profile(
activities=[
torch.profiler.ProfilerActivity.CPU,
torch.profiler.ProfilerActivity.CUDA,
],
record_shapes=True,
profile_memory=True,
with_stack=True
) as prof:
result = wave | pipeline
# Print profiling results
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
# Export for visualization
prof.export_chrome_trace("trace.json")
# Open trace.json in chrome://tracing for detailed visualization
```
```{seealso}
[PyTorch Profiler Documentation](https://pytorch.org/docs/stable/profiler.html) - Official guide to PyTorch profiling tools
```
## Complete Benchmarking Examples
These complete examples demonstrate how to run comprehensive benchmarks for your specific use cases.
### Example 1: API Pattern Comparison
Compare different API patterns for your filter chain:
```python
import timeit
import numpy as np
from torch import nn
from torchfx import Wave
from torchfx.filter import HiChebyshev1, LoButterworth
SAMPLE_RATE = 44100
DURATION = 120 # 2 minutes
NUM_CHANNELS = 8
REP = 50
# Generate test signal
signal_data = np.random.randn(NUM_CHANNELS, int(SAMPLE_RATE * DURATION))
signal_data = signal_data.astype(np.float32)
signal_data /= np.max(np.abs(signal_data), axis=1, keepdims=True)
wave = Wave(signal_data, SAMPLE_RATE)
# Pattern 1: Custom nn.Module class
class FilterChain(nn.Module):
def __init__(self, fs):
super().__init__()
self.f1 = HiChebyshev1(20, fs=fs)
self.f2 = LoButterworth(5000, fs=fs)
def forward(self, x):
x = self.f1(x)
x = self.f2(x)
return x
def test_class():
fchain = FilterChain(wave.fs)
return fchain(wave.ys)
# Pattern 2: nn.Sequential
def test_sequential():
fchain = nn.Sequential(
HiChebyshev1(20, fs=wave.fs),
LoButterworth(5000, fs=wave.fs),
)
return fchain(wave.ys)
# Pattern 3: Pipe operator
def test_pipe():
return wave | HiChebyshev1(20) | LoButterworth(5000)
# Benchmark each pattern
class_time = timeit.timeit(test_class, number=REP)
seq_time = timeit.timeit(test_sequential, number=REP)
pipe_time = timeit.timeit(test_pipe, number=REP)
print(f"Custom class: {class_time/REP:.6f}s")
print(f"nn.Sequential: {seq_time/REP:.6f}s")
print(f"Pipe operator: {pipe_time/REP:.6f}s")
```
### Example 2: FIR Filter Performance Analysis
Comprehensive FIR filter benchmarking across durations and channel counts:
```python
import timeit
import numpy as np
import torch.nn as nn
import pandas as pd
from torchfx import Wave
from torchfx.filter import DesignableFIR
SAMPLE_RATE = 44100
REP = 50
# Test matrix
durations = [5, 60, 180, 300, 600]
channel_counts = [1, 2, 4, 8, 12]
results = []
for duration in durations:
for channels in channel_counts:
# Generate test signal
signal = np.random.randn(channels, int(SAMPLE_RATE * duration))
signal = signal.astype(np.float32)
signal /= np.max(np.abs(signal), axis=1, keepdims=True)
wave = Wave(signal, SAMPLE_RATE)
# Create FIR filter chain
fchain = nn.Sequential(
DesignableFIR(num_taps=101, cutoff=1000, fs=SAMPLE_RATE),
DesignableFIR(num_taps=102, cutoff=5000, fs=SAMPLE_RATE),
DesignableFIR(num_taps=103, cutoff=1500, fs=SAMPLE_RATE),
)
# Pre-compute coefficients
for f in fchain:
f.compute_coefficients()
# GPU benchmark
wave.to("cuda")
fchain.to("cuda")
gpu_time = timeit.timeit(lambda: wave | fchain, number=REP) / REP
# CPU benchmark
wave.to("cpu")
fchain.to("cpu")
cpu_time = timeit.timeit(lambda: wave | fchain, number=REP) / REP
results.append({
'duration_sec': duration,
'channels': channels,
'gpu_time_sec': gpu_time,
'cpu_time_sec': cpu_time,
'speedup': cpu_time / gpu_time
})
print(f"Duration: {duration}s, Channels: {channels}, "
f"GPU: {gpu_time:.6f}s, CPU: {cpu_time:.6f}s, "
f"Speedup: {cpu_time/gpu_time:.2f}x")
# Save results
df = pd.DataFrame(results)
df.to_csv("fir_benchmark.csv", index=False)
print("\nResults saved to fir_benchmark.csv")
```
### Example 3: IIR Filter Performance Analysis
Complete IIR filter benchmarking with coefficient management:
```python
import timeit
import numpy as np
import torch.nn as nn
import pandas as pd
from torchfx import Wave
from torchfx.filter import HiButterworth, LoButterworth, HiChebyshev1, LoChebyshev1
SAMPLE_RATE = 44100
REP = 50
# Test matrix
durations = [1, 5, 180, 300, 600]
channel_counts = [1, 2, 4, 8, 12]
results = []
for duration in durations:
for channels in channel_counts:
# Generate test signal
signal = np.random.randn(channels, int(SAMPLE_RATE * duration))
signal = signal.astype(np.float32)
signal /= np.max(np.abs(signal), axis=1, keepdims=True)
wave = Wave(signal, SAMPLE_RATE)
# Create IIR filter chain
fchain = nn.Sequential(
HiButterworth(cutoff=1000, order=2, fs=SAMPLE_RATE),
LoButterworth(cutoff=5000, order=2, fs=SAMPLE_RATE),
HiChebyshev1(cutoff=1500, order=2, fs=SAMPLE_RATE),
LoChebyshev1(cutoff=1800, order=2, fs=SAMPLE_RATE),
)
# GPU benchmark
wave.to("cuda")
fchain.to("cuda")
# Compute and move coefficients
for f in fchain:
f.compute_coefficients()
f.move_coeff("cuda")
gpu_time = timeit.timeit(
lambda: fchain(wave.ys),
number=REP
) / REP
# CPU benchmark
wave.to("cpu")
fchain.to("cpu")
for f in fchain:
f.move_coeff("cpu")
cpu_time = timeit.timeit(
lambda: fchain(wave.ys),
number=REP
) / REP
results.append({
'duration_sec': duration,
'channels': channels,
'gpu_time_sec': gpu_time,
'cpu_time_sec': cpu_time,
'speedup': cpu_time / gpu_time
})
print(f"Duration: {duration}s, Channels: {channels}, "
f"GPU: {gpu_time:.6f}s, CPU: {cpu_time:.6f}s, "
f"Speedup: {cpu_time/gpu_time:.2f}x")
# Save results
df = pd.DataFrame(results)
df.to_csv("iir_benchmark.csv", index=False)
print("\nResults saved to iir_benchmark.csv")
```
## Summary
Key takeaways for optimizing TorchFX performance:
1. **GPU Acceleration**: Use GPU for long signals (>60s), multi-channel audio (≥4 channels), and batch processing
2. **Filter Choice**: FIR filters excel on GPU with parallel convolution; IIR filters are more CPU-efficient
3. **API Pattern**: Pipeline operator provides best ergonomics with automatic sample rate configuration and minimal overhead
4. **Coefficient Caching**: Pre-compute filter coefficients once and reuse for multiple files
5. **Device Management**: Minimize transfers by keeping all processing on one device
6. **Memory**: Use chunked processing for very long audio files to prevent OOM errors
7. **Benchmarking**: Use the provided templates to measure performance of your specific pipelines
GPU acceleration can provide 5-20x speedups for appropriate workloads. Follow the decision trees and best practices in this guide to maximize throughput in your audio processing pipelines.
## Related Guides
- {doc}`gpu-acceleration` - Comprehensive GPU device management guide
- {doc}`../filters/iir-filters` - IIR filter design and usage
- {doc}`../filters/fir-filters` - FIR filter design and usage
- {doc}`pytorch-integration` - Integration with PyTorch ecosystem
- {doc}`multi-channel` - Multi-channel processing patterns
## External Resources
- [PyTorch Profiler](https://pytorch.org/docs/stable/profiler.html) - Profiling PyTorch code
- [CUDA Best Practices Guide](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/) - NVIDIA optimization guide
- [SciPy Signal Processing](https://docs.scipy.org/doc/scipy/reference/signal.html) - SciPy signal processing reference
- [PyTorch Performance Tuning](https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html) - PyTorch optimization guide