Benchmarking#

Comprehensive guide to the TorchFX benchmarking suite for measuring and comparing performance of audio processing operations.

Overview#

The TorchFX benchmarking suite evaluates performance across three key dimensions:

  1. API patterns: Comparing different usage patterns (FilterChain, Sequential, pipe operator)

  2. FIR filter performance: GPU vs CPU vs SciPy implementations

  3. IIR filter performance: GPU vs CPU vs SciPy implementations

All benchmarks compare TorchFX implementations against SciPy baselines to validate performance characteristics and identify optimization opportunities.

See also

Testing - Testing infrastructure Performance Optimization and Benchmarking - Performance optimization guide GPU Acceleration - GPU acceleration usage

Benchmark Suite Structure#

The benchmarking suite consists of three independent scripts:

Script

Purpose

Comparisons

Output File

api_bench.py

API pattern comparison

FilterChain, Sequential, pipe operator, SciPy

api_bench.out

fir_bench.py

FIR filter performance

GPU vs CPU vs SciPy across varying durations and channels

fir.out

iir_bench.py

IIR filter performance

GPU vs CPU vs SciPy across varying durations and channels

iir.out

All benchmarks use Python’s timeit module for precise timing measurements and output results in CSV format for analysis and visualization.

Benchmark Architecture#

        graph TB
    subgraph "Benchmark Scripts"
        API["api_bench.py<br/>API pattern comparison"]
        FIR["fir_bench.py<br/>FIR filter performance"]
        IIR["iir_bench.py<br/>IIR filter performance"]
    end

    subgraph "Test Signal Generation"
        CreateAudio["create_audio()<br/>np.random.randn()<br/>Normalized to [-1, 1]"]
    end

    subgraph "Implementations Under Test"
        TorchFX_GPU["torchfx on CUDA<br/>Wave.to('cuda')<br/>filter.to('cuda')"]
        TorchFX_CPU["torchfx on CPU<br/>Wave.to('cpu')<br/>filter.to('cpu')"]
        SciPy_Baseline["SciPy baseline<br/>scipy.signal.lfilter()"]
    end

    subgraph "Timing Infrastructure"
        TimeitModule["timeit.timeit()<br/>REP=50 repetitions"]
    end

    subgraph "Output"
        CSV["CSV files:<br/>api_bench.out<br/>fir.out<br/>iir.out"]
        Visualization["draw3.py<br/>Generates PNG plots"]
    end

    API --> CreateAudio
    FIR --> CreateAudio
    IIR --> CreateAudio

    CreateAudio --> TorchFX_GPU
    CreateAudio --> TorchFX_CPU
    CreateAudio --> SciPy_Baseline

    TorchFX_GPU --> TimeitModule
    TorchFX_CPU --> TimeitModule
    SciPy_Baseline --> TimeitModule

    TimeitModule --> CSV
    CSV --> Visualization
    

Common Infrastructure#

All benchmark scripts share common infrastructure for test signal generation and timing measurement.

Test Signal Generation#

Each benchmark uses the create_audio() function to generate synthetic test signals:

def create_audio(duration, num_channels):
    """Create random audio signal for testing.

    Parameters
    ----------
    duration : int
        Duration in seconds
    num_channels : int
        Number of audio channels

    Returns
    -------
    np.ndarray
        Audio signal with shape (num_channels, samples)
    """
    samples = int(duration * SAMPLE_RATE)
    audio = np.random.randn(num_channels, samples)
    return audio / np.max(np.abs(audio))  # Normalize to [-1, 1]

Normalization: Signals are normalized to the range [-1, 1] to simulate realistic audio levels.

Timing Methodology#

All benchmarks use Python’s timeit.timeit() function with consistent parameters:

REP = 50  # Number of repetitions

# Measure execution time
time = timeit.timeit(lambda: function_under_test(), number=REP)
average_time = time / REP

Why 50 repetitions?

  • Provides stable averages by reducing variance

  • Balances accuracy with total benchmark runtime

  • Minimizes impact of system noise and cache effects

Standard Parameters#

Parameter

Value

Description

SAMPLE_RATE

44100 Hz

Standard CD-quality sampling rate

REP

50

Number of timing repetitions for averaging

DURATION

Varies

Audio duration in seconds (benchmark-specific)

NUM_CHANNELS

Varies

Number of audio channels (benchmark-specific)

API Benchmark#

The API benchmark (benchmark/api_bench.py) compares different approaches to chaining filters, evaluating both ergonomics and performance.

Tested Implementations#

        graph LR
    Signal["Wave object<br/>8 channels<br/>120 seconds"]

    subgraph "Four API Patterns"
        Method1["FilterChain class<br/>nn.Module subclass<br/>explicit forward()"]
        Method2["Sequential<br/>torch.nn.Sequential<br/>functional composition"]
        Method3["Pipe operator<br/>wave | filter1 | filter2"]
        Method4["SciPy baseline<br/>scipy.signal.lfilter()"]
    end

    Output["Filtered signal"]

    Signal --> Method1
    Signal --> Method2
    Signal --> Method3
    Signal --> Method4

    Method1 --> Output
    Method2 --> Output
    Method3 --> Output
    Method4 --> Output
    

Implementation Patterns#

FilterChain Class Pattern#

Traditional PyTorch module composition with explicit forward() method:

class FilterChain(nn.Module):
    """Custom filter chain implementation."""
    def __init__(self, filters):
        super().__init__()
        self.filters = nn.ModuleList(filters)

    def forward(self, x):
        for f in self.filters:
            x = f(x)
        return x

# Usage
chain = FilterChain([filter1, filter2, filter3])
output = chain(wave.ys)

Characteristics:

  • Explicit control over execution

  • Standard PyTorch pattern

  • Requires boilerplate code

Sequential Pattern#

PyTorch’s built-in sequential container:

from torch import nn

# Create sequential chain
chain = nn.Sequential(filter1, filter2, filter3)

# Apply to audio
output = chain(wave.ys)

Characteristics:

  • Built-in PyTorch functionality

  • Minimal boilerplate

  • Standard functional composition

Pipe Operator Pattern#

TorchFX’s idiomatic API with automatic configuration:

# Chain filters using pipe operator
output = wave | filter1 | filter2 | filter3

Characteristics:

  • Most ergonomic syntax

  • Automatic sample rate configuration

  • Pythonic and readable

SciPy Baseline#

Pure NumPy/SciPy implementation for comparison:

import scipy.signal as signal

# Design filter coefficients
b1, a1 = signal.butter(N=order, Wn=cutoff, btype='high', fs=fs)

# Apply filter
output = signal.lfilter(b1, a1, audio)

Characteristics:

  • CPU-only implementation

  • No PyTorch overhead

  • Industry-standard baseline

Filter Configuration#

All patterns apply the same six filters in series:

Filter

Type

Cutoff Frequency

Purpose

HiChebyshev1

High-pass

20 Hz

Remove subsonic content

HiChebyshev1

High-pass

60 Hz

Remove hum

HiChebyshev1

High-pass

65 Hz

Additional hum removal

LoButterworth

Low-pass

5000 Hz

Anti-aliasing

LoButterworth

Low-pass

4900 Hz

Transition band shaping

LoButterworth

Low-pass

4850 Hz

Final rolloff

Test Parameters#

  • Duration: 120 seconds (2 minutes)

  • Channels: 8

  • Sample Rate: 44100 Hz

  • Repetitions: 50

Output Format#

CSV with the following structure:

filter_chain,sequential,pipe,scipy
<class_time>,<seq_time>,<pipe_time>,<scipy_time>

Each time value represents average execution time in seconds.

Running API Benchmark#

python benchmark/api_bench.py

Expected output:

API Benchmark
Duration: 120s, Channels: 8, Sample Rate: 44100Hz
Repetitions: 50

FilterChain: 1.234 seconds
Sequential:  1.235 seconds
Pipe:        1.236 seconds
SciPy:       1.450 seconds

Results saved to api_bench.out

FIR Filter Benchmark#

The FIR filter benchmark (benchmark/fir_bench.py) evaluates FIR filter performance across different audio durations and channel counts.

Test Matrix#

The benchmark tests across two dimensions:

Dimension

Values

Durations

5, 60, 180, 300, 600 seconds

Channels

1, 2, 4, 8, 12

Total test cases: 5 durations × 5 channel counts = 25 data points

        graph TB
    subgraph "Test Variables"
        Durations["Durations (seconds)<br/>5, 60, 180, 300, 600"]
        Channels["Channels<br/>1, 2, 4, 8, 12"]
    end

    subgraph "Filter Chain"
        F1["DesignableFIR<br/>101 taps, 1000 Hz"]
        F2["DesignableFIR<br/>102 taps, 5000 Hz"]
        F3["DesignableFIR<br/>103 taps, 1500 Hz"]
        F4["DesignableFIR<br/>104 taps, 1800 Hz"]
        F5["DesignableFIR<br/>105 taps, 1850 Hz"]
    end

    subgraph "Implementations"
        GPU["GPU Implementation<br/>wave.to('cuda')<br/>fchain.to('cuda')"]
        CPU["CPU Implementation<br/>wave.to('cpu')<br/>fchain.to('cpu')"]
        SciPy["SciPy Implementation<br/>scipy.signal.firwin()<br/>scipy.signal.lfilter()"]
    end

    Durations --> F1
    Channels --> F1

    F1 --> F2
    F2 --> F3
    F3 --> F4
    F4 --> F5

    F5 --> GPU
    F5 --> CPU
    F5 --> SciPy
    

Filter Configuration#

The benchmark applies five DesignableFIR filters in series:

# Create filter chain
fchain = nn.Sequential(
    DesignableFIR(numtaps=101, cutoff=1000, fs=44100),
    DesignableFIR(numtaps=102, cutoff=5000, fs=44100),
    DesignableFIR(numtaps=103, cutoff=1500, fs=44100),
    DesignableFIR(numtaps=104, cutoff=1800, fs=44100),
    DesignableFIR(numtaps=105, cutoff=1850, fs=44100),
)

# Pre-compute coefficients (excluded from timing)
for f in fchain:
    f.compute_coefficients()

Important: Filter coefficients are pre-computed before timing to measure only filtering performance, not coefficient design.

Implementation Functions#

GPU FIR Function#

def gpu_fir(wave):
    """Apply FIR filter chain on GPU."""
    return (wave | fchain).ys

Moves audio to GPU and applies filter chain using pipe operator.

CPU FIR Function#

def cpu_fir(wave):
    """Apply FIR filter chain on CPU."""
    return (wave | fchain).ys

Applies filter chain on CPU.

SciPy FIR Function#

def scipy_fir(audio):
    """Apply FIR filters using SciPy."""
    for f in fchain:
        b = f.coefficients.cpu().numpy()
        audio = signal.lfilter(b, [1.0], audio)
    return audio

Uses scipy.signal.lfilter() for baseline comparison.

Test Execution Flow#

For each combination of duration and channel count:

  1. Generate test signal with create_audio(duration, channels)

  2. Create Wave object from signal

  3. Build filter chain with nn.Sequential

  4. Pre-compute all filter coefficients

  5. Move to GPU, time GPU execution

  6. Move to CPU, time CPU execution

  7. Design SciPy coefficients, time SciPy execution

Output Format#

CSV with the following structure:

time,channels,gpu,cpu,scipy
5,1,0.012,0.015,0.018
5,2,0.013,0.016,0.020
...
600,12,1.234,1.567,1.890

Running FIR Benchmark#

python benchmark/fir_bench.py

Expected output:

FIR Filter Benchmark
Sample Rate: 44100Hz
Repetitions: 50

Testing: 5s, 1 channel...
  GPU:   0.012s
  CPU:   0.015s
  SciPy: 0.018s

Testing: 5s, 2 channels...
  GPU:   0.013s
  CPU:   0.016s
  SciPy: 0.020s

...

Results saved to fir.out

IIR Filter Benchmark#

The IIR filter benchmark (benchmark/iir_bench.py) evaluates IIR filter performance with similar methodology to the FIR benchmark.

Test Matrix#

Dimension

Values

Durations

1, 5, 180, 300, 600 seconds

Channels

1, 2, 4, 8, 12

Total test cases: 5 durations × 5 channel counts = 25 data points

Filter Configuration#

The benchmark applies four IIR filters in series:

fchain = nn.Sequential(
    HiButterworth(cutoff=1000, order=2, fs=44100),
    LoButterworth(cutoff=5000, order=2, fs=44100),
    HiChebyshev1(cutoff=1500, order=2, ripple=0.5, fs=44100),
    LoChebyshev1(cutoff=1800, order=2, ripple=0.5, fs=44100),
)

Filter

Type

Cutoff

Order

Purpose

HiButterworth

High-pass

1000 Hz

2

Remove low frequencies

LoButterworth

Low-pass

5000 Hz

2

Remove high frequencies

HiChebyshev1

High-pass

1500 Hz

2

Additional high-pass

LoChebyshev1

Low-pass

1800 Hz

2

Additional low-pass

Implementation Details#

        graph TB
    subgraph "GPU Execution Path"
        GPU_Wave["Wave.to('cuda')"]
        GPU_Chain["fchain.to('cuda')"]
        GPU_Coeff["f.compute_coefficients()<br/>f.move_coeff('cuda')<br/>for each filter"]
        GPU_Execute["fchain(wave.ys)"]

        GPU_Wave --> GPU_Chain
        GPU_Chain --> GPU_Coeff
        GPU_Coeff --> GPU_Execute
    end

    subgraph "CPU Execution Path"
        CPU_Wave["Wave.to('cpu')"]
        CPU_Chain["fchain.to('cpu')"]
        CPU_Coeff["f.move_coeff('cpu')<br/>for each filter"]
        CPU_Execute["fchain(wave.ys)"]

        CPU_Wave --> CPU_Chain
        CPU_Chain --> CPU_Coeff
        CPU_Coeff --> CPU_Execute
    end

    subgraph "SciPy Execution Path"
        SciPy_Design["butter() / cheby1()<br/>Design filter coefficients"]
        SciPy_Filter["lfilter()<br/>Apply filters"]

        SciPy_Design --> SciPy_Filter
    end
    

GPU Filter Function#

def gpu_iir(wave):
    """Apply IIR filter chain on GPU."""
    # CRITICAL: Move both module and coefficients to GPU
    for f in fchain:
        f.compute_coefficients()
        f.move_coeff("cuda")
    return (wave | fchain).ys

Important: IIR filters require explicit coefficient movement to GPU.

CPU Filter Function#

def cpu_iir(wave):
    """Apply IIR filter chain on CPU."""
    # Move coefficients back to CPU
    for f in fchain:
        f.move_coeff("cpu")
    return (wave | fchain).ys

SciPy Filter Function#

def scipy_iir(audio):
    """Apply IIR filters using SciPy."""
    # Design Butterworth coefficients
    b1, a1 = signal.butter(N=2, Wn=1000, btype='high', fs=44100)
    # ... design other filters ...

    # Apply filters sequentially
    audio = signal.lfilter(b1, a1, audio)
    # ... apply other filters ...
    return audio

Output Format#

CSV with the following structure:

time,channels,gpu,cpu,scipy
1,1,0.005,0.008,0.010
1,2,0.006,0.009,0.012
...
600,12,0.987,1.234,1.567

Running IIR Benchmark#

python benchmark/iir_bench.py

Expected output:

IIR Filter Benchmark
Sample Rate: 44100Hz
Repetitions: 50

Testing: 1s, 1 channel...
  GPU:   0.005s
  CPU:   0.008s
  SciPy: 0.010s

Testing: 1s, 2 channels...
  GPU:   0.006s
  CPU:   0.009s
  SciPy: 0.012s

...

Results saved to iir.out

Interpreting Results#

Performance Metrics#

All timing values are reported in seconds, representing average execution time over 50 repetitions. Lower values indicate better performance.

Expected Performance Characteristics#

Scenario

Expected Behavior

Short audio, few channels

CPU may outperform GPU due to transfer overhead

Long audio, many channels

GPU should significantly outperform CPU

Simple operations

SciPy may be competitive with CPU implementation

Complex filter chains

TorchFX benefits from vectorization and batching

API Benchmark Interpretation#

The API benchmark compares ergonomics and performance:

  • FilterChain: Traditional PyTorch pattern with explicit control

  • Sequential: Standard PyTorch composition with automatic forwarding

  • Pipe operator: Most ergonomic with automatic configuration

  • SciPy: CPU-only baseline

Expected results:

  • Performance differences between FilterChain, Sequential, and Pipe should be minimal (same underlying operations)

  • Pipe operator provides automatic sampling rate configuration

  • SciPy may be slower due to lack of GPU acceleration

FIR/IIR Benchmark Interpretation#

These benchmarks generate multi-dimensional data for analysis:

  1. Duration scaling: How performance scales with audio length

    • Linear scaling expected for both CPU and GPU

    • GPU overhead amortized over longer durations

  2. Channel scaling: How performance scales with channel count

    • GPU should show better scaling for many channels

    • CPU performance degrades more with channel count

  3. GPU vs CPU: When GPU acceleration provides benefits

    • Crossover point varies by filter complexity

    • Generally favorable for >2 channels and >60s duration

  4. TorchFX vs SciPy: Overhead of PyTorch abstraction

    • TorchFX CPU should be competitive with SciPy

    • GPU should outperform SciPy for suitable workloads

Visualization#

The draw3.py script generates PNG plots from CSV output files:

python benchmark/draw3.py

Generated plots:

  • api_bench.png: Bar chart comparing API patterns

  • fir_bench.png: Performance curves across durations/channels

  • iir_bench.png: Performance curves across durations/channels

Running All Benchmarks#

Prerequisites#

Ensure development environment is set up:

# Sync dependencies
uv sync

# Verify CUDA availability (for GPU benchmarks)
python -c "import torch; print(torch.cuda.is_available())"

Execution Script#

Run all benchmarks sequentially:

# Run individual benchmarks
python benchmark/api_bench.py
python benchmark/fir_bench.py
python benchmark/iir_bench.py

# Generate visualizations
python benchmark/draw3.py

GPU Configuration#

To disable GPU benchmarks, comment out CUDA calls:

# In benchmark script
# wave.to("cuda")  # Comment to disable GPU

Benchmark Maintenance#

Adding New Benchmarks#

To add a new benchmark:

  1. Create new Python file in benchmark/ directory

  2. Implement create_audio() for test signal generation

  3. Use timeit.timeit() with REP=50 for timing

  4. Compare against SciPy baseline when applicable

  5. Output results in CSV format

  6. Update this documentation

Template:

import timeit
import numpy as np

SAMPLE_RATE = 44100
REP = 50

def create_audio(duration, num_channels):
    samples = int(duration * SAMPLE_RATE)
    audio = np.random.randn(num_channels, samples)
    return audio / np.max(np.abs(audio))

def benchmark():
    # Setup
    audio = create_audio(duration=60, num_channels=2)

    # Time execution
    def run():
        # Code to benchmark
        pass

    time = timeit.timeit(run, number=REP)
    avg_time = time / REP

    print(f"Average time: {avg_time:.4f}s")

if __name__ == "__main__":
    benchmark()

Modifying Test Parameters#

Common parameters to adjust:

# Sample rate (default: 44100 Hz)
SAMPLE_RATE = 48000  # Change to 48kHz

# Repetitions (default: 50)
REP = 100  # Increase for more stable results

# Duration range (default varies by benchmark)
DURATIONS = [1, 10, 30, 60, 120]  # Custom duration range

# Channel counts (default varies by benchmark)
CHANNELS = [1, 2, 4, 8, 16]  # Custom channel counts

Coefficient Pre-computation#

For fair comparison, filter coefficients should be pre-computed:

# Pre-compute coefficients before timing
for f in fchain:
    f.compute_coefficients()

# Now time only the filtering operation
time = timeit.timeit(lambda: fchain(wave.ys), number=REP)

This ensures timing measures filtering performance, not coefficient design.

Best Practices#

Fair Comparisons#

# ✅ GOOD: Pre-compute coefficients
for f in fchain:
    f.compute_coefficients()
time = timeit.timeit(lambda: fchain(wave.ys), number=REP)

# ❌ BAD: Include coefficient design in timing
time = timeit.timeit(lambda: fchain(wave.ys), number=REP)

Sufficient Repetitions#

# ✅ GOOD: Use 50+ repetitions
REP = 50
time = timeit.timeit(func, number=REP) / REP

# ❌ BAD: Too few repetitions (high variance)
REP = 5
time = timeit.timeit(func, number=REP) / REP

Realistic Test Data#

# ✅ GOOD: Normalized random noise
audio = np.random.randn(channels, samples)
audio = audio / np.max(np.abs(audio))  # [-1, 1]

# ❌ BAD: Unrealistic data
audio = np.ones((channels, samples))  # All ones