Benchmarking#

Comprehensive guide to the TorchFX benchmarking suite for measuring and comparing performance of audio processing operations.

Overview#

The TorchFX benchmarking suite evaluates performance across three key dimensions:

API patterns: Comparing different usage patterns (FilterChain, Sequential, pipe operator)
FIR filter performance: GPU vs CPU vs SciPy implementations
IIR filter performance: GPU vs CPU vs SciPy implementations

All benchmarks compare TorchFX implementations against SciPy baselines to validate performance characteristics and identify optimization opportunities.

Benchmark Suite Structure#

The benchmarking suite consists of three independent scripts:

Script	Purpose	Comparisons	Output File
`api_bench.py`	API pattern comparison	FilterChain, Sequential, pipe operator, SciPy	`api_bench.out`
`fir_bench.py`	FIR filter performance	GPU vs CPU vs SciPy across varying durations and channels	`fir.out`
`iir_bench.py`	IIR filter performance	GPU vs CPU vs SciPy across varying durations and channels	`iir.out`

All benchmarks use Python’s timeit module for precise timing measurements and output results in CSV format for analysis and visualization.

Benchmark Architecture#

        graph TB
    subgraph "Benchmark Scripts"
        API["api_bench.py<br/>API pattern comparison"]
        FIR["fir_bench.py<br/>FIR filter performance"]
        IIR["iir_bench.py<br/>IIR filter performance"]
    end

    subgraph "Test Signal Generation"
        CreateAudio["create_audio()<br/>np.random.randn()<br/>Normalized to [-1, 1]"]
    end

    subgraph "Implementations Under Test"
        TorchFX_GPU["torchfx on CUDA<br/>Wave.to('cuda')<br/>filter.to('cuda')"]
        TorchFX_CPU["torchfx on CPU<br/>Wave.to('cpu')<br/>filter.to('cpu')"]
        SciPy_Baseline["SciPy baseline<br/>scipy.signal.lfilter()"]
    end

    subgraph "Timing Infrastructure"
        TimeitModule["timeit.timeit()<br/>REP=50 repetitions"]
    end

    subgraph "Output"
        CSV["CSV files:<br/>api_bench.out<br/>fir.out<br/>iir.out"]
        Visualization["draw3.py<br/>Generates PNG plots"]
    end

    API --> CreateAudio
    FIR --> CreateAudio
    IIR --> CreateAudio

    CreateAudio --> TorchFX_GPU
    CreateAudio --> TorchFX_CPU
    CreateAudio --> SciPy_Baseline

    TorchFX_GPU --> TimeitModule
    TorchFX_CPU --> TimeitModule
    SciPy_Baseline --> TimeitModule

    TimeitModule --> CSV
    CSV --> Visualization

Common Infrastructure#

All benchmark scripts share common infrastructure for test signal generation and timing measurement.

Test Signal Generation#

Each benchmark uses the create_audio() function to generate synthetic test signals:

def create_audio(duration, num_channels):
    """Create random audio signal for testing.

    Parameters
    ----------
    duration : int
        Duration in seconds
    num_channels : int
        Number of audio channels

    Returns
    -------
    np.ndarray
        Audio signal with shape (num_channels, samples)
    """
    samples = int(duration * SAMPLE_RATE)
    audio = np.random.randn(num_channels, samples)
    return audio / np.max(np.abs(audio))  # Normalize to [-1, 1]

Normalization: Signals are normalized to the range [-1, 1] to simulate realistic audio levels.

Timing Methodology#

All benchmarks use Python’s timeit.timeit() function with consistent parameters:

REP = 50  # Number of repetitions

# Measure execution time
time = timeit.timeit(lambda: function_under_test(), number=REP)
average_time = time / REP

Why 50 repetitions?

Provides stable averages by reducing variance
Balances accuracy with total benchmark runtime
Minimizes impact of system noise and cache effects

Standard Parameters#

Parameter	Value	Description
`SAMPLE_RATE`	44100 Hz	Standard CD-quality sampling rate
`REP`	50	Number of timing repetitions for averaging
`DURATION`	Varies	Audio duration in seconds (benchmark-specific)
`NUM_CHANNELS`	Varies	Number of audio channels (benchmark-specific)

API Benchmark#

The API benchmark (benchmark/api_bench.py) compares different approaches to chaining filters, evaluating both ergonomics and performance.

Tested Implementations#

        graph LR
    Signal["Wave object<br/>8 channels<br/>120 seconds"]

    subgraph "Four API Patterns"
        Method1["FilterChain class<br/>nn.Module subclass<br/>explicit forward()"]
        Method2["Sequential<br/>torch.nn.Sequential<br/>functional composition"]
        Method3["Pipe operator<br/>wave | filter1 | filter2"]
        Method4["SciPy baseline<br/>scipy.signal.lfilter()"]
    end

    Output["Filtered signal"]

    Signal --> Method1
    Signal --> Method2
    Signal --> Method3
    Signal --> Method4

    Method1 --> Output
    Method2 --> Output
    Method3 --> Output
    Method4 --> Output

Implementation Patterns#

FilterChain Class Pattern#

Traditional PyTorch module composition with explicit forward() method:

class FilterChain(nn.Module):
    """Custom filter chain implementation."""
    def __init__(self, filters):
        super().__init__()
        self.filters = nn.ModuleList(filters)

    def forward(self, x):
        for f in self.filters:
            x = f(x)
        return x

# Usage
chain = FilterChain([filter1, filter2, filter3])
output = chain(wave.ys)

Characteristics:

Explicit control over execution
Standard PyTorch pattern
Requires boilerplate code

Sequential Pattern#

PyTorch’s built-in sequential container:

from torch import nn

# Create sequential chain
chain = nn.Sequential(filter1, filter2, filter3)

# Apply to audio
output = chain(wave.ys)

Characteristics:

Built-in PyTorch functionality
Minimal boilerplate
Standard functional composition

Pipe Operator Pattern#

TorchFX’s idiomatic API with automatic configuration:

# Chain filters using pipe operator
output = wave | filter1 | filter2 | filter3

Characteristics:

Most ergonomic syntax
Automatic sample rate configuration
Pythonic and readable

SciPy Baseline#

Pure NumPy/SciPy implementation for comparison:

import scipy.signal as signal

# Design filter coefficients
b1, a1 = signal.butter(N=order, Wn=cutoff, btype='high', fs=fs)

# Apply filter
output = signal.lfilter(b1, a1, audio)

Characteristics:

CPU-only implementation
No PyTorch overhead
Industry-standard baseline

Filter Configuration#

All patterns apply the same six filters in series:

Filter	Type	Cutoff Frequency	Purpose
HiChebyshev1	High-pass	20 Hz	Remove subsonic content
HiChebyshev1	High-pass	60 Hz	Remove hum
HiChebyshev1	High-pass	65 Hz	Additional hum removal
LoButterworth	Low-pass	5000 Hz	Anti-aliasing
LoButterworth	Low-pass	4900 Hz	Transition band shaping
LoButterworth	Low-pass	4850 Hz	Final rolloff

Test Parameters#

Duration: 120 seconds (2 minutes)
Channels: 8
Sample Rate: 44100 Hz
Repetitions: 50

Output Format#

CSV with the following structure:

filter_chain,sequential,pipe,scipy
<class_time>,<seq_time>,<pipe_time>,<scipy_time>

Each time value represents average execution time in seconds.

Running API Benchmark#

python benchmark/api_bench.py

Expected output:

API Benchmark
Duration: 120s, Channels: 8, Sample Rate: 44100Hz
Repetitions: 50

FilterChain: 1.234 seconds
Sequential:  1.235 seconds
Pipe:        1.236 seconds
SciPy:       1.450 seconds

Results saved to api_bench.out

FIR Filter Benchmark#

The FIR filter benchmark (benchmark/fir_bench.py) evaluates FIR filter performance across different audio durations and channel counts.

Test Matrix#

The benchmark tests across two dimensions:

Dimension	Values
Durations	5, 60, 180, 300, 600 seconds
Channels	1, 2, 4, 8, 12

Total test cases: 5 durations × 5 channel counts = 25 data points

        graph TB
    subgraph "Test Variables"
        Durations["Durations (seconds)<br/>5, 60, 180, 300, 600"]
        Channels["Channels<br/>1, 2, 4, 8, 12"]
    end

    subgraph "Filter Chain"
        F1["DesignableFIR<br/>101 taps, 1000 Hz"]
        F2["DesignableFIR<br/>102 taps, 5000 Hz"]
        F3["DesignableFIR<br/>103 taps, 1500 Hz"]
        F4["DesignableFIR<br/>104 taps, 1800 Hz"]
        F5["DesignableFIR<br/>105 taps, 1850 Hz"]
    end

    subgraph "Implementations"
        GPU["GPU Implementation<br/>wave.to('cuda')<br/>fchain.to('cuda')"]
        CPU["CPU Implementation<br/>wave.to('cpu')<br/>fchain.to('cpu')"]
        SciPy["SciPy Implementation<br/>scipy.signal.firwin()<br/>scipy.signal.lfilter()"]
    end

    Durations --> F1
    Channels --> F1

    F1 --> F2
    F2 --> F3
    F3 --> F4
    F4 --> F5

    F5 --> GPU
    F5 --> CPU
    F5 --> SciPy

Filter Configuration#

The benchmark applies five DesignableFIR filters in series:

# Create filter chain
fchain = nn.Sequential(
    DesignableFIR(numtaps=101, cutoff=1000, fs=44100),
    DesignableFIR(numtaps=102, cutoff=5000, fs=44100),
    DesignableFIR(numtaps=103, cutoff=1500, fs=44100),
    DesignableFIR(numtaps=104, cutoff=1800, fs=44100),
    DesignableFIR(numtaps=105, cutoff=1850, fs=44100),
)

# Pre-compute coefficients (excluded from timing)
for f in fchain:
    f.compute_coefficients()

Important: Filter coefficients are pre-computed before timing to measure only filtering performance, not coefficient design.

Implementation Functions#

GPU FIR Function#

def gpu_fir(wave):
    """Apply FIR filter chain on GPU."""
    return (wave | fchain).ys

Moves audio to GPU and applies filter chain using pipe operator.

CPU FIR Function#

def cpu_fir(wave):
    """Apply FIR filter chain on CPU."""
    return (wave | fchain).ys

Applies filter chain on CPU.

SciPy FIR Function#

def scipy_fir(audio):
    """Apply FIR filters using SciPy."""
    for f in fchain:
        b = f.coefficients.cpu().numpy()
        audio = signal.lfilter(b, [1.0], audio)
    return audio

Uses scipy.signal.lfilter() for baseline comparison.

Test Execution Flow#

For each combination of duration and channel count:

Generate test signal with create_audio(duration, channels)
Create Wave object from signal
Build filter chain with nn.Sequential
Pre-compute all filter coefficients
Move to GPU, time GPU execution
Move to CPU, time CPU execution
Design SciPy coefficients, time SciPy execution

Output Format#

CSV with the following structure:

time,channels,gpu,cpu,scipy
5,1,0.012,0.015,0.018
5,2,0.013,0.016,0.020
...
600,12,1.234,1.567,1.890

Running FIR Benchmark#

python benchmark/fir_bench.py

Expected output:

FIR Filter Benchmark
Sample Rate: 44100Hz
Repetitions: 50

Testing: 5s, 1 channel...
  GPU:   0.012s
  CPU:   0.015s
  SciPy: 0.018s

Testing: 5s, 2 channels...
  GPU:   0.013s
  CPU:   0.016s
  SciPy: 0.020s

...

Results saved to fir.out

IIR Filter Benchmark#

The IIR filter benchmark (benchmark/iir_bench.py) evaluates IIR filter performance with similar methodology to the FIR benchmark.

Test Matrix#

Dimension	Values
Durations	1, 5, 180, 300, 600 seconds
Channels	1, 2, 4, 8, 12

Total test cases: 5 durations × 5 channel counts = 25 data points

Filter Configuration#

The benchmark applies four IIR filters in series:

fchain = nn.Sequential(
    HiButterworth(cutoff=1000, order=2, fs=44100),
    LoButterworth(cutoff=5000, order=2, fs=44100),
    HiChebyshev1(cutoff=1500, order=2, ripple=0.5, fs=44100),
    LoChebyshev1(cutoff=1800, order=2, ripple=0.5, fs=44100),
)

Filter	Type	Cutoff	Order	Purpose
`HiButterworth`	High-pass	1000 Hz	2	Remove low frequencies
`LoButterworth`	Low-pass	5000 Hz	2	Remove high frequencies
`HiChebyshev1`	High-pass	1500 Hz	2	Additional high-pass
`LoChebyshev1`	Low-pass	1800 Hz	2	Additional low-pass

Implementation Details#

        graph TB
    subgraph "GPU Execution Path"
        GPU_Wave["Wave.to('cuda')"]
        GPU_Chain["fchain.to('cuda')"]
        GPU_Coeff["f.compute_coefficients()<br/>f.move_coeff('cuda')<br/>for each filter"]
        GPU_Execute["fchain(wave.ys)"]

        GPU_Wave --> GPU_Chain
        GPU_Chain --> GPU_Coeff
        GPU_Coeff --> GPU_Execute
    end

    subgraph "CPU Execution Path"
        CPU_Wave["Wave.to('cpu')"]
        CPU_Chain["fchain.to('cpu')"]
        CPU_Coeff["f.move_coeff('cpu')<br/>for each filter"]
        CPU_Execute["fchain(wave.ys)"]

        CPU_Wave --> CPU_Chain
        CPU_Chain --> CPU_Coeff
        CPU_Coeff --> CPU_Execute
    end

    subgraph "SciPy Execution Path"
        SciPy_Design["butter() / cheby1()<br/>Design filter coefficients"]
        SciPy_Filter["lfilter()<br/>Apply filters"]

        SciPy_Design --> SciPy_Filter
    end

GPU Filter Function#

def gpu_iir(wave):
    """Apply IIR filter chain on GPU."""
    # CRITICAL: Move both module and coefficients to GPU
    for f in fchain:
        f.compute_coefficients()
        f.move_coeff("cuda")
    return (wave | fchain).ys

Important: IIR filters require explicit coefficient movement to GPU.

CPU Filter Function#

def cpu_iir(wave):
    """Apply IIR filter chain on CPU."""
    # Move coefficients back to CPU
    for f in fchain:
        f.move_coeff("cpu")
    return (wave | fchain).ys

SciPy Filter Function#

def scipy_iir(audio):
    """Apply IIR filters using SciPy."""
    # Design Butterworth coefficients
    b1, a1 = signal.butter(N=2, Wn=1000, btype='high', fs=44100)
    # ... design other filters ...

    # Apply filters sequentially
    audio = signal.lfilter(b1, a1, audio)
    # ... apply other filters ...
    return audio

Output Format#

CSV with the following structure:

time,channels,gpu,cpu,scipy
1,1,0.005,0.008,0.010
1,2,0.006,0.009,0.012
...
600,12,0.987,1.234,1.567

Running IIR Benchmark#

python benchmark/iir_bench.py

Expected output:

IIR Filter Benchmark
Sample Rate: 44100Hz
Repetitions: 50

Testing: 1s, 1 channel...
  GPU:   0.005s
  CPU:   0.008s
  SciPy: 0.010s

Testing: 1s, 2 channels...
  GPU:   0.006s
  CPU:   0.009s
  SciPy: 0.012s

...

Results saved to iir.out

Interpreting Results#

Performance Metrics#

All timing values are reported in seconds, representing average execution time over 50 repetitions. Lower values indicate better performance.

Expected Performance Characteristics#

Scenario	Expected Behavior
Short audio, few channels	CPU may outperform GPU due to transfer overhead
Long audio, many channels	GPU should significantly outperform CPU
Simple operations	SciPy may be competitive with CPU implementation
Complex filter chains	TorchFX benefits from vectorization and batching

API Benchmark Interpretation#

The API benchmark compares ergonomics and performance:

FilterChain: Traditional PyTorch pattern with explicit control
Sequential: Standard PyTorch composition with automatic forwarding
Pipe operator: Most ergonomic with automatic configuration
SciPy: CPU-only baseline

Expected results:

Performance differences between FilterChain, Sequential, and Pipe should be minimal (same underlying operations)
Pipe operator provides automatic sampling rate configuration
SciPy may be slower due to lack of GPU acceleration

FIR/IIR Benchmark Interpretation#

These benchmarks generate multi-dimensional data for analysis:

Duration scaling: How performance scales with audio length
- Linear scaling expected for both CPU and GPU
- GPU overhead amortized over longer durations
Channel scaling: How performance scales with channel count
- GPU should show better scaling for many channels
- CPU performance degrades more with channel count
GPU vs CPU: When GPU acceleration provides benefits
- Crossover point varies by filter complexity
- Generally favorable for >2 channels and >60s duration
TorchFX vs SciPy: Overhead of PyTorch abstraction
- TorchFX CPU should be competitive with SciPy
- GPU should outperform SciPy for suitable workloads

Visualization#

The draw3.py script generates PNG plots from CSV output files:

python benchmark/draw3.py

Generated plots:

api_bench.png: Bar chart comparing API patterns
fir_bench.png: Performance curves across durations/channels
iir_bench.png: Performance curves across durations/channels

Running All Benchmarks#

Prerequisites#

Ensure development environment is set up:

# Sync dependencies
uv sync

# Verify CUDA availability (for GPU benchmarks)
python -c "import torch; print(torch.cuda.is_available())"

Execution Script#

Run all benchmarks sequentially:

# Run individual benchmarks
python benchmark/api_bench.py
python benchmark/fir_bench.py
python benchmark/iir_bench.py

# Generate visualizations
python benchmark/draw3.py

GPU Configuration#

To disable GPU benchmarks, comment out CUDA calls:

# In benchmark script
# wave.to("cuda")  # Comment to disable GPU

Benchmark Maintenance#

Adding New Benchmarks#

To add a new benchmark:

Create new Python file in benchmark/ directory
Implement create_audio() for test signal generation
Use timeit.timeit() with REP=50 for timing
Compare against SciPy baseline when applicable
Output results in CSV format
Update this documentation

Template:

import timeit
import numpy as np

SAMPLE_RATE = 44100
REP = 50

def create_audio(duration, num_channels):
    samples = int(duration * SAMPLE_RATE)
    audio = np.random.randn(num_channels, samples)
    return audio / np.max(np.abs(audio))

def benchmark():
    # Setup
    audio = create_audio(duration=60, num_channels=2)

    # Time execution
    def run():
        # Code to benchmark
        pass

    time = timeit.timeit(run, number=REP)
    avg_time = time / REP

    print(f"Average time: {avg_time:.4f}s")

if __name__ == "__main__":
    benchmark()

Modifying Test Parameters#

Common parameters to adjust:

# Sample rate (default: 44100 Hz)
SAMPLE_RATE = 48000  # Change to 48kHz

# Repetitions (default: 50)
REP = 100  # Increase for more stable results

# Duration range (default varies by benchmark)
DURATIONS = [1, 10, 30, 60, 120]  # Custom duration range

# Channel counts (default varies by benchmark)
CHANNELS = [1, 2, 4, 8, 16]  # Custom channel counts

Coefficient Pre-computation#

For fair comparison, filter coefficients should be pre-computed:

# Pre-compute coefficients before timing
for f in fchain:
    f.compute_coefficients()

# Now time only the filtering operation
time = timeit.timeit(lambda: fchain(wave.ys), number=REP)

This ensures timing measures filtering performance, not coefficient design.

Best Practices#

Fair Comparisons#

# ✅ GOOD: Pre-compute coefficients
for f in fchain:
    f.compute_coefficients()
time = timeit.timeit(lambda: fchain(wave.ys), number=REP)

# ❌ BAD: Include coefficient design in timing
time = timeit.timeit(lambda: fchain(wave.ys), number=REP)

Sufficient Repetitions#

# ✅ GOOD: Use 50+ repetitions
REP = 50
time = timeit.timeit(func, number=REP) / REP

# ❌ BAD: Too few repetitions (high variance)
REP = 5
time = timeit.timeit(func, number=REP) / REP

Realistic Test Data#

# ✅ GOOD: Normalized random noise
audio = np.random.randn(channels, samples)
audio = audio / np.max(np.abs(audio))  # [-1, 1]

# ❌ BAD: Unrealistic data
audio = np.ones((channels, samples))  # All ones

Benchmarking#

Overview#

Benchmark Suite Structure#

Benchmark Architecture#

Common Infrastructure#

Test Signal Generation#

Timing Methodology#

Standard Parameters#

API Benchmark#

Tested Implementations#

Implementation Patterns#

FilterChain Class Pattern#

Sequential Pattern#

Pipe Operator Pattern#

SciPy Baseline#

Filter Configuration#

Test Parameters#

Output Format#

Running API Benchmark#

FIR Filter Benchmark#

Test Matrix#

Filter Configuration#

Implementation Functions#

GPU FIR Function#

CPU FIR Function#

SciPy FIR Function#

Test Execution Flow#

Output Format#

Running FIR Benchmark#

IIR Filter Benchmark#

Test Matrix#

Filter Configuration#

Implementation Details#

GPU Filter Function#

CPU Filter Function#

SciPy Filter Function#

Output Format#

Running IIR Benchmark#

Interpreting Results#

Performance Metrics#

Expected Performance Characteristics#

API Benchmark Interpretation#

FIR/IIR Benchmark Interpretation#

Visualization#

Running All Benchmarks#

Prerequisites#

Execution Script#

GPU Configuration#

Benchmark Maintenance#

Adding New Benchmarks#

Modifying Test Parameters#

Coefficient Pre-computation#

Best Practices#

Fair Comparisons#

Sufficient Repetitions#

Realistic Test Data#

Related Resources#

This Page