Benchmarking#
Comprehensive guide to the TorchFX benchmarking suite for measuring and comparing performance of audio processing operations.
Overview#
The TorchFX benchmarking suite evaluates performance across three key dimensions:
API patterns: Comparing different usage patterns (FilterChain, Sequential, pipe operator)
FIR filter performance: GPU vs CPU vs SciPy implementations
IIR filter performance: GPU vs CPU vs SciPy implementations
All benchmarks compare TorchFX implementations against SciPy baselines to validate performance characteristics and identify optimization opportunities.
See also
Testing - Testing infrastructure Performance Optimization and Benchmarking - Performance optimization guide GPU Acceleration - GPU acceleration usage
Benchmark Suite Structure#
The benchmarking suite consists of three independent scripts:
Script |
Purpose |
Comparisons |
Output File |
|---|---|---|---|
|
API pattern comparison |
FilterChain, Sequential, pipe operator, SciPy |
|
|
FIR filter performance |
GPU vs CPU vs SciPy across varying durations and channels |
|
|
IIR filter performance |
GPU vs CPU vs SciPy across varying durations and channels |
|
All benchmarks use Python’s timeit module for precise timing measurements and output results in CSV format for analysis and visualization.
Benchmark Architecture#
graph TB
subgraph "Benchmark Scripts"
API["api_bench.py<br/>API pattern comparison"]
FIR["fir_bench.py<br/>FIR filter performance"]
IIR["iir_bench.py<br/>IIR filter performance"]
end
subgraph "Test Signal Generation"
CreateAudio["create_audio()<br/>np.random.randn()<br/>Normalized to [-1, 1]"]
end
subgraph "Implementations Under Test"
TorchFX_GPU["torchfx on CUDA<br/>Wave.to('cuda')<br/>filter.to('cuda')"]
TorchFX_CPU["torchfx on CPU<br/>Wave.to('cpu')<br/>filter.to('cpu')"]
SciPy_Baseline["SciPy baseline<br/>scipy.signal.lfilter()"]
end
subgraph "Timing Infrastructure"
TimeitModule["timeit.timeit()<br/>REP=50 repetitions"]
end
subgraph "Output"
CSV["CSV files:<br/>api_bench.out<br/>fir.out<br/>iir.out"]
Visualization["draw3.py<br/>Generates PNG plots"]
end
API --> CreateAudio
FIR --> CreateAudio
IIR --> CreateAudio
CreateAudio --> TorchFX_GPU
CreateAudio --> TorchFX_CPU
CreateAudio --> SciPy_Baseline
TorchFX_GPU --> TimeitModule
TorchFX_CPU --> TimeitModule
SciPy_Baseline --> TimeitModule
TimeitModule --> CSV
CSV --> Visualization
Common Infrastructure#
All benchmark scripts share common infrastructure for test signal generation and timing measurement.
Test Signal Generation#
Each benchmark uses the create_audio() function to generate synthetic test signals:
def create_audio(duration, num_channels):
"""Create random audio signal for testing.
Parameters
----------
duration : int
Duration in seconds
num_channels : int
Number of audio channels
Returns
-------
np.ndarray
Audio signal with shape (num_channels, samples)
"""
samples = int(duration * SAMPLE_RATE)
audio = np.random.randn(num_channels, samples)
return audio / np.max(np.abs(audio)) # Normalize to [-1, 1]
Normalization: Signals are normalized to the range [-1, 1] to simulate realistic audio levels.
Timing Methodology#
All benchmarks use Python’s timeit.timeit() function with consistent parameters:
REP = 50 # Number of repetitions
# Measure execution time
time = timeit.timeit(lambda: function_under_test(), number=REP)
average_time = time / REP
Why 50 repetitions?
Provides stable averages by reducing variance
Balances accuracy with total benchmark runtime
Minimizes impact of system noise and cache effects
Standard Parameters#
Parameter |
Value |
Description |
|---|---|---|
|
44100 Hz |
Standard CD-quality sampling rate |
|
50 |
Number of timing repetitions for averaging |
|
Varies |
Audio duration in seconds (benchmark-specific) |
|
Varies |
Number of audio channels (benchmark-specific) |
API Benchmark#
The API benchmark (benchmark/api_bench.py) compares different approaches to chaining filters, evaluating both ergonomics and performance.
Tested Implementations#
graph LR
Signal["Wave object<br/>8 channels<br/>120 seconds"]
subgraph "Four API Patterns"
Method1["FilterChain class<br/>nn.Module subclass<br/>explicit forward()"]
Method2["Sequential<br/>torch.nn.Sequential<br/>functional composition"]
Method3["Pipe operator<br/>wave | filter1 | filter2"]
Method4["SciPy baseline<br/>scipy.signal.lfilter()"]
end
Output["Filtered signal"]
Signal --> Method1
Signal --> Method2
Signal --> Method3
Signal --> Method4
Method1 --> Output
Method2 --> Output
Method3 --> Output
Method4 --> Output
Implementation Patterns#
FilterChain Class Pattern#
Traditional PyTorch module composition with explicit forward() method:
class FilterChain(nn.Module):
"""Custom filter chain implementation."""
def __init__(self, filters):
super().__init__()
self.filters = nn.ModuleList(filters)
def forward(self, x):
for f in self.filters:
x = f(x)
return x
# Usage
chain = FilterChain([filter1, filter2, filter3])
output = chain(wave.ys)
Characteristics:
Explicit control over execution
Standard PyTorch pattern
Requires boilerplate code
Sequential Pattern#
PyTorch’s built-in sequential container:
from torch import nn
# Create sequential chain
chain = nn.Sequential(filter1, filter2, filter3)
# Apply to audio
output = chain(wave.ys)
Characteristics:
Built-in PyTorch functionality
Minimal boilerplate
Standard functional composition
Pipe Operator Pattern#
TorchFX’s idiomatic API with automatic configuration:
# Chain filters using pipe operator
output = wave | filter1 | filter2 | filter3
Characteristics:
Most ergonomic syntax
Automatic sample rate configuration
Pythonic and readable
SciPy Baseline#
Pure NumPy/SciPy implementation for comparison:
import scipy.signal as signal
# Design filter coefficients
b1, a1 = signal.butter(N=order, Wn=cutoff, btype='high', fs=fs)
# Apply filter
output = signal.lfilter(b1, a1, audio)
Characteristics:
CPU-only implementation
No PyTorch overhead
Industry-standard baseline
Filter Configuration#
All patterns apply the same six filters in series:
Filter |
Type |
Cutoff Frequency |
Purpose |
|---|---|---|---|
HiChebyshev1 |
High-pass |
20 Hz |
Remove subsonic content |
HiChebyshev1 |
High-pass |
60 Hz |
Remove hum |
HiChebyshev1 |
High-pass |
65 Hz |
Additional hum removal |
LoButterworth |
Low-pass |
5000 Hz |
Anti-aliasing |
LoButterworth |
Low-pass |
4900 Hz |
Transition band shaping |
LoButterworth |
Low-pass |
4850 Hz |
Final rolloff |
Test Parameters#
Duration: 120 seconds (2 minutes)
Channels: 8
Sample Rate: 44100 Hz
Repetitions: 50
Output Format#
CSV with the following structure:
filter_chain,sequential,pipe,scipy
<class_time>,<seq_time>,<pipe_time>,<scipy_time>
Each time value represents average execution time in seconds.
Running API Benchmark#
python benchmark/api_bench.py
Expected output:
API Benchmark
Duration: 120s, Channels: 8, Sample Rate: 44100Hz
Repetitions: 50
FilterChain: 1.234 seconds
Sequential: 1.235 seconds
Pipe: 1.236 seconds
SciPy: 1.450 seconds
Results saved to api_bench.out
FIR Filter Benchmark#
The FIR filter benchmark (benchmark/fir_bench.py) evaluates FIR filter performance across different audio durations and channel counts.
Test Matrix#
The benchmark tests across two dimensions:
Dimension |
Values |
|---|---|
Durations |
5, 60, 180, 300, 600 seconds |
Channels |
1, 2, 4, 8, 12 |
Total test cases: 5 durations × 5 channel counts = 25 data points
graph TB
subgraph "Test Variables"
Durations["Durations (seconds)<br/>5, 60, 180, 300, 600"]
Channels["Channels<br/>1, 2, 4, 8, 12"]
end
subgraph "Filter Chain"
F1["DesignableFIR<br/>101 taps, 1000 Hz"]
F2["DesignableFIR<br/>102 taps, 5000 Hz"]
F3["DesignableFIR<br/>103 taps, 1500 Hz"]
F4["DesignableFIR<br/>104 taps, 1800 Hz"]
F5["DesignableFIR<br/>105 taps, 1850 Hz"]
end
subgraph "Implementations"
GPU["GPU Implementation<br/>wave.to('cuda')<br/>fchain.to('cuda')"]
CPU["CPU Implementation<br/>wave.to('cpu')<br/>fchain.to('cpu')"]
SciPy["SciPy Implementation<br/>scipy.signal.firwin()<br/>scipy.signal.lfilter()"]
end
Durations --> F1
Channels --> F1
F1 --> F2
F2 --> F3
F3 --> F4
F4 --> F5
F5 --> GPU
F5 --> CPU
F5 --> SciPy
Filter Configuration#
The benchmark applies five DesignableFIR filters in series:
# Create filter chain
fchain = nn.Sequential(
DesignableFIR(numtaps=101, cutoff=1000, fs=44100),
DesignableFIR(numtaps=102, cutoff=5000, fs=44100),
DesignableFIR(numtaps=103, cutoff=1500, fs=44100),
DesignableFIR(numtaps=104, cutoff=1800, fs=44100),
DesignableFIR(numtaps=105, cutoff=1850, fs=44100),
)
# Pre-compute coefficients (excluded from timing)
for f in fchain:
f.compute_coefficients()
Important: Filter coefficients are pre-computed before timing to measure only filtering performance, not coefficient design.
Implementation Functions#
GPU FIR Function#
def gpu_fir(wave):
"""Apply FIR filter chain on GPU."""
return (wave | fchain).ys
Moves audio to GPU and applies filter chain using pipe operator.
CPU FIR Function#
def cpu_fir(wave):
"""Apply FIR filter chain on CPU."""
return (wave | fchain).ys
Applies filter chain on CPU.
SciPy FIR Function#
def scipy_fir(audio):
"""Apply FIR filters using SciPy."""
for f in fchain:
b = f.coefficients.cpu().numpy()
audio = signal.lfilter(b, [1.0], audio)
return audio
Uses scipy.signal.lfilter() for baseline comparison.
Test Execution Flow#
For each combination of duration and channel count:
Generate test signal with
create_audio(duration, channels)Create
Waveobject from signalBuild filter chain with
nn.SequentialPre-compute all filter coefficients
Move to GPU, time GPU execution
Move to CPU, time CPU execution
Design SciPy coefficients, time SciPy execution
Output Format#
CSV with the following structure:
time,channels,gpu,cpu,scipy
5,1,0.012,0.015,0.018
5,2,0.013,0.016,0.020
...
600,12,1.234,1.567,1.890
Running FIR Benchmark#
python benchmark/fir_bench.py
Expected output:
FIR Filter Benchmark
Sample Rate: 44100Hz
Repetitions: 50
Testing: 5s, 1 channel...
GPU: 0.012s
CPU: 0.015s
SciPy: 0.018s
Testing: 5s, 2 channels...
GPU: 0.013s
CPU: 0.016s
SciPy: 0.020s
...
Results saved to fir.out
IIR Filter Benchmark#
The IIR filter benchmark (benchmark/iir_bench.py) evaluates IIR filter performance with similar methodology to the FIR benchmark.
Test Matrix#
Dimension |
Values |
|---|---|
Durations |
1, 5, 180, 300, 600 seconds |
Channels |
1, 2, 4, 8, 12 |
Total test cases: 5 durations × 5 channel counts = 25 data points
Filter Configuration#
The benchmark applies four IIR filters in series:
fchain = nn.Sequential(
HiButterworth(cutoff=1000, order=2, fs=44100),
LoButterworth(cutoff=5000, order=2, fs=44100),
HiChebyshev1(cutoff=1500, order=2, ripple=0.5, fs=44100),
LoChebyshev1(cutoff=1800, order=2, ripple=0.5, fs=44100),
)
Filter |
Type |
Cutoff |
Order |
Purpose |
|---|---|---|---|---|
|
High-pass |
1000 Hz |
2 |
Remove low frequencies |
|
Low-pass |
5000 Hz |
2 |
Remove high frequencies |
|
High-pass |
1500 Hz |
2 |
Additional high-pass |
|
Low-pass |
1800 Hz |
2 |
Additional low-pass |
Implementation Details#
graph TB
subgraph "GPU Execution Path"
GPU_Wave["Wave.to('cuda')"]
GPU_Chain["fchain.to('cuda')"]
GPU_Coeff["f.compute_coefficients()<br/>f.move_coeff('cuda')<br/>for each filter"]
GPU_Execute["fchain(wave.ys)"]
GPU_Wave --> GPU_Chain
GPU_Chain --> GPU_Coeff
GPU_Coeff --> GPU_Execute
end
subgraph "CPU Execution Path"
CPU_Wave["Wave.to('cpu')"]
CPU_Chain["fchain.to('cpu')"]
CPU_Coeff["f.move_coeff('cpu')<br/>for each filter"]
CPU_Execute["fchain(wave.ys)"]
CPU_Wave --> CPU_Chain
CPU_Chain --> CPU_Coeff
CPU_Coeff --> CPU_Execute
end
subgraph "SciPy Execution Path"
SciPy_Design["butter() / cheby1()<br/>Design filter coefficients"]
SciPy_Filter["lfilter()<br/>Apply filters"]
SciPy_Design --> SciPy_Filter
end
GPU Filter Function#
def gpu_iir(wave):
"""Apply IIR filter chain on GPU."""
# CRITICAL: Move both module and coefficients to GPU
for f in fchain:
f.compute_coefficients()
f.move_coeff("cuda")
return (wave | fchain).ys
Important: IIR filters require explicit coefficient movement to GPU.
CPU Filter Function#
def cpu_iir(wave):
"""Apply IIR filter chain on CPU."""
# Move coefficients back to CPU
for f in fchain:
f.move_coeff("cpu")
return (wave | fchain).ys
SciPy Filter Function#
def scipy_iir(audio):
"""Apply IIR filters using SciPy."""
# Design Butterworth coefficients
b1, a1 = signal.butter(N=2, Wn=1000, btype='high', fs=44100)
# ... design other filters ...
# Apply filters sequentially
audio = signal.lfilter(b1, a1, audio)
# ... apply other filters ...
return audio
Output Format#
CSV with the following structure:
time,channels,gpu,cpu,scipy
1,1,0.005,0.008,0.010
1,2,0.006,0.009,0.012
...
600,12,0.987,1.234,1.567
Running IIR Benchmark#
python benchmark/iir_bench.py
Expected output:
IIR Filter Benchmark
Sample Rate: 44100Hz
Repetitions: 50
Testing: 1s, 1 channel...
GPU: 0.005s
CPU: 0.008s
SciPy: 0.010s
Testing: 1s, 2 channels...
GPU: 0.006s
CPU: 0.009s
SciPy: 0.012s
...
Results saved to iir.out
Interpreting Results#
Performance Metrics#
All timing values are reported in seconds, representing average execution time over 50 repetitions. Lower values indicate better performance.
Expected Performance Characteristics#
Scenario |
Expected Behavior |
|---|---|
Short audio, few channels |
CPU may outperform GPU due to transfer overhead |
Long audio, many channels |
GPU should significantly outperform CPU |
Simple operations |
SciPy may be competitive with CPU implementation |
Complex filter chains |
TorchFX benefits from vectorization and batching |
API Benchmark Interpretation#
The API benchmark compares ergonomics and performance:
FilterChain: Traditional PyTorch pattern with explicit control
Sequential: Standard PyTorch composition with automatic forwarding
Pipe operator: Most ergonomic with automatic configuration
SciPy: CPU-only baseline
Expected results:
Performance differences between FilterChain, Sequential, and Pipe should be minimal (same underlying operations)
Pipe operator provides automatic sampling rate configuration
SciPy may be slower due to lack of GPU acceleration
FIR/IIR Benchmark Interpretation#
These benchmarks generate multi-dimensional data for analysis:
Duration scaling: How performance scales with audio length
Linear scaling expected for both CPU and GPU
GPU overhead amortized over longer durations
Channel scaling: How performance scales with channel count
GPU should show better scaling for many channels
CPU performance degrades more with channel count
GPU vs CPU: When GPU acceleration provides benefits
Crossover point varies by filter complexity
Generally favorable for >2 channels and >60s duration
TorchFX vs SciPy: Overhead of PyTorch abstraction
TorchFX CPU should be competitive with SciPy
GPU should outperform SciPy for suitable workloads
Visualization#
The draw3.py script generates PNG plots from CSV output files:
python benchmark/draw3.py
Generated plots:
api_bench.png: Bar chart comparing API patternsfir_bench.png: Performance curves across durations/channelsiir_bench.png: Performance curves across durations/channels
Running All Benchmarks#
Prerequisites#
Ensure development environment is set up:
# Sync dependencies
uv sync
# Verify CUDA availability (for GPU benchmarks)
python -c "import torch; print(torch.cuda.is_available())"
Execution Script#
Run all benchmarks sequentially:
# Run individual benchmarks
python benchmark/api_bench.py
python benchmark/fir_bench.py
python benchmark/iir_bench.py
# Generate visualizations
python benchmark/draw3.py
GPU Configuration#
To disable GPU benchmarks, comment out CUDA calls:
# In benchmark script
# wave.to("cuda") # Comment to disable GPU
Benchmark Maintenance#
Adding New Benchmarks#
To add a new benchmark:
Create new Python file in
benchmark/directoryImplement
create_audio()for test signal generationUse
timeit.timeit()withREP=50for timingCompare against SciPy baseline when applicable
Output results in CSV format
Update this documentation
Template:
import timeit
import numpy as np
SAMPLE_RATE = 44100
REP = 50
def create_audio(duration, num_channels):
samples = int(duration * SAMPLE_RATE)
audio = np.random.randn(num_channels, samples)
return audio / np.max(np.abs(audio))
def benchmark():
# Setup
audio = create_audio(duration=60, num_channels=2)
# Time execution
def run():
# Code to benchmark
pass
time = timeit.timeit(run, number=REP)
avg_time = time / REP
print(f"Average time: {avg_time:.4f}s")
if __name__ == "__main__":
benchmark()
Modifying Test Parameters#
Common parameters to adjust:
# Sample rate (default: 44100 Hz)
SAMPLE_RATE = 48000 # Change to 48kHz
# Repetitions (default: 50)
REP = 100 # Increase for more stable results
# Duration range (default varies by benchmark)
DURATIONS = [1, 10, 30, 60, 120] # Custom duration range
# Channel counts (default varies by benchmark)
CHANNELS = [1, 2, 4, 8, 16] # Custom channel counts
Coefficient Pre-computation#
For fair comparison, filter coefficients should be pre-computed:
# Pre-compute coefficients before timing
for f in fchain:
f.compute_coefficients()
# Now time only the filtering operation
time = timeit.timeit(lambda: fchain(wave.ys), number=REP)
This ensures timing measures filtering performance, not coefficient design.
Best Practices#
Fair Comparisons#
# ✅ GOOD: Pre-compute coefficients
for f in fchain:
f.compute_coefficients()
time = timeit.timeit(lambda: fchain(wave.ys), number=REP)
# ❌ BAD: Include coefficient design in timing
time = timeit.timeit(lambda: fchain(wave.ys), number=REP)
Sufficient Repetitions#
# ✅ GOOD: Use 50+ repetitions
REP = 50
time = timeit.timeit(func, number=REP) / REP
# ❌ BAD: Too few repetitions (high variance)
REP = 5
time = timeit.timeit(func, number=REP) / REP
Realistic Test Data#
# ✅ GOOD: Normalized random noise
audio = np.random.randn(channels, samples)
audio = audio / np.max(np.abs(audio)) # [-1, 1]
# ❌ BAD: Unrealistic data
audio = np.ones((channels, samples)) # All ones