Performance Optimization and Benchmarking#

Learn how to measure, optimize, and maximize the performance of your TorchFX audio processing pipelines. This comprehensive guide covers benchmarking methodologies, GPU vs CPU performance comparisons, filter type trade-offs, and best practices for building high-throughput audio processing systems.

Prerequisites#

Before starting this guide, you should be familiar with:

Wave - Digital Audio Representation - Wave class fundamentals
Pipeline Operator - Functional Composition - Pipeline operator basics
GPU Acceleration - GPU device management
../filters/iir-filters - IIR filter characteristics
../filters/fir-filters - FIR filter characteristics

Overview#

Performance optimization in TorchFX involves understanding three key trade-offs:

Dimension	Trade-off	Optimization Strategy
Execution Backend	GPU vs CPU vs SciPy	Choose based on duration, channels, batch size
Filter Type	FIR vs IIR	Balance computational cost vs phase response
API Pattern	Classes vs Sequential vs Pipe	Select based on ergonomics vs performance needs

        graph TB
    subgraph "Performance Optimization Dimensions"
        AudioTask["Audio Processing Task"]

        subgraph Backend["Backend Selection"]
            GPU["GPU (CUDA)<br/>High parallelism"]
            CPU["CPU (PyTorch)<br/>Moderate performance"]
            SciPy["SciPy<br/>Baseline reference"]
        end

        subgraph FilterType["Filter Type"]
            FIR["FIR Filters<br/>Linear phase<br/>Higher compute"]
            IIR["IIR Filters<br/>Non-linear phase<br/>Lower compute"]
        end

        subgraph API["API Pattern"]
            Pipe["Pipeline Operator<br/>Auto sample rate"]
            Sequential["nn.Sequential<br/>Standard PyTorch"]
            Custom["Custom Module<br/>Maximum control"]
        end

        AudioTask --> Backend
        AudioTask --> FilterType
        AudioTask --> API

        Backend --> Optimize["Optimized Pipeline"]
        FilterType --> Optimize
        API --> Optimize
    end

    style AudioTask fill:#e1f5ff
    style Optimize fill:#e1ffe1
    style GPU fill:#fff5e1
    style FIR fill:#f5e1ff
    style Pipe fill:#ffe1e1

Performance Optimization Framework - Three key dimensions for optimizing TorchFX pipelines.

See also

For detailed GPU acceleration patterns, see GPU Acceleration. For comprehensive benchmarking infrastructure, see the benchmark suite in the benchmark/ directory.

Benchmark Methodology#

TorchFX includes a comprehensive benchmarking suite that evaluates performance across different dimensions. The suite consists of three benchmark scripts, each targeting a specific aspect of performance.

Benchmark Suite Architecture#

        graph TB
    subgraph "TorchFX Benchmark Suite"
        API["api_bench.py<br/>API Pattern Comparison"]
        FIR["fir_bench.py<br/>FIR Filter Performance"]
        IIR["iir_bench.py<br/>IIR Filter Performance"]
    end

    subgraph "Common Test Parameters"
        SR["Sample Rate: 44.1 kHz"]
        REP["Repetitions: 50"]
        Signal["Signal: Random noise<br/>Float32, normalized"]
    end

    subgraph "Variable Parameters"
        Duration["Duration: 1s - 10 min"]
        Channels["Channels: 1, 2, 4, 8, 12"]
        Backends["Backends: GPU, CPU, SciPy"]
    end

    subgraph "Output"
        CSV["CSV Results<br/>.out files"]
    end

    API --> SR
    FIR --> SR
    IIR --> SR

    API --> REP
    FIR --> REP
    IIR --> REP

    FIR --> Duration
    FIR --> Channels
    IIR --> Duration
    IIR --> Channels

    API --> Backends
    FIR --> Backends
    IIR --> Backends

    API --> CSV
    FIR --> CSV
    IIR --> CSV

    style API fill:#e1f5ff
    style FIR fill:#e8f5e1
    style IIR fill:#fff5e1

Benchmark Suite Organization - Three complementary benchmarks measuring different performance aspects.

Test Signal Generation#

All benchmarks use consistent signal generation to ensure comparable results:

import numpy as np

def create_audio(sample_rate, duration, num_channels):
    """Generate multi-channel random noise for benchmarking.

    Parameters
    ----------
    sample_rate : int
        Sample rate in Hz (typically 44100)
    duration : float
        Duration in seconds
    num_channels : int
        Number of audio channels

    Returns
    -------
    signal : np.ndarray
        Shape (num_channels, num_samples), float32, normalized to [-1, 1]
    """
    signal = np.random.randn(num_channels, int(sample_rate * duration))
    signal = signal.astype(np.float32)
    # Normalize each channel independently
    signal /= np.max(np.abs(signal), axis=1, keepdims=True)
    return signal

Key Characteristics:

Distribution: Gaussian random noise (np.random.randn)
Data Type: Float32 for GPU compatibility
Normalization: Per-channel normalization to [-1, 1] range
Deterministic: Same random seed produces consistent results (when seeded)

Tip

Using random noise ensures benchmarks test worst-case performance without special structure or patterns that could be optimized by the hardware.

Timing Methodology#

All benchmarks use Python’s timeit.timeit() for accurate timing measurements:

import timeit

# Standard pattern across all benchmarks
REP = 50  # Number of repetitions

# Time the operation
elapsed = timeit.timeit(
    lambda: process_audio(wave, filter_chain),
    number=REP
)

# Report average time per iteration
avg_time = elapsed / REP
print(f"Average time: {avg_time:.6f}s")

Timing Best Practices:

Multiple Repetitions: 50 repetitions minimize variance and warm-up effects
Lambda Wrapper: Captures all setup costs in the closure
Average Reporting: Reports per-iteration time for easy comparison
Excluded Overhead: Setup (loading, coefficient computation) excluded from timing

Important

Timing only measures the core processing operation. Setup costs like loading audio files, computing filter coefficients, and device transfers are performed before timing begins.

API Performance Comparison#

The api_bench.py benchmark compares different API patterns for applying the same filter chain to audio. This helps users understand the performance and ergonomic trade-offs of different coding styles.

Test Configuration#

Filter Chain: 6 cascaded IIR filters

3 × High-pass Chebyshev Type I filters (20 Hz, 60 Hz, 65 Hz)
3 × Low-pass Butterworth filters (5000 Hz, 4900 Hz, 4850 Hz)

Test Signal: 8-channel audio, 2 minutes duration, 44.1 kHz sample rate

        graph LR
    Input["Input Audio<br/>8 channels, 2 min"] --> F1["HiChebyshev1<br/>20 Hz"]
    F1 --> F2["HiChebyshev1<br/>60 Hz"]
    F2 --> F3["HiChebyshev1<br/>65 Hz"]
    F3 --> F4["LoButterworth<br/>5000 Hz"]
    F4 --> F5["LoButterworth<br/>4900 Hz"]
    F5 --> F6["LoButterworth<br/>4850 Hz"]
    F6 --> Output["Filtered Output<br/>8 channels, 2 min"]

    style Input fill:#e1f5ff
    style Output fill:#e1ffe1

API Benchmark Filter Chain - Six cascaded IIR filters applied to 8-channel audio.

API Pattern 1: Custom nn.Module Class#

A traditional PyTorch approach using a custom torch.nn.Module:

from torch import nn
from torchfx.filter import HiChebyshev1, LoButterworth

class FilterChain(nn.Module):
    """Custom module for filter chain."""

    def __init__(self, fs):
        super().__init__()
        self.f1 = HiChebyshev1(20, fs=fs)
        self.f2 = HiChebyshev1(60, fs=fs)
        self.f3 = HiChebyshev1(65, fs=fs)
        self.f4 = LoButterworth(5000, fs=fs)
        self.f5 = LoButterworth(4900, fs=fs)
        self.f6 = LoButterworth(4850, fs=fs)

    def forward(self, x):
        x = self.f1(x)
        x = self.f2(x)
        x = self.f3(x)
        x = self.f4(x)
        x = self.f5(x)
        x = self.f6(x)
        return x

# Usage
fchain = FilterChain(signal.fs)
result = fchain(signal.ys)

Characteristics:

Sample Rate: Must be passed explicitly to __init__
Reusability: Can be instantiated once and reused
Flexibility: Full control over forward pass logic
Overhead: Standard PyTorch nn.Module call overhead

API Pattern 2: nn.Sequential#

Using PyTorch’s built-in torch.nn.Sequential container:

from torch.nn import Sequential
from torchfx.filter import HiChebyshev1, LoButterworth

# Create filter chain inline
fchain = Sequential(
    HiChebyshev1(20, fs=signal.fs),
    HiChebyshev1(60, fs=signal.fs),
    HiChebyshev1(65, fs=signal.fs),
    LoButterworth(5000, fs=signal.fs),
    LoButterworth(4900, fs=signal.fs),
    LoButterworth(4850, fs=signal.fs),
)

# Apply to audio tensor
result = fchain(signal.ys)

Characteristics:

Sample Rate: Must be passed explicitly to each filter
Simplicity: No custom class needed
Performance: Identical to custom nn.Module (same underlying mechanism)
Flexibility: Can add/remove filters easily

Note

nn.Sequential and custom nn.Module classes have identical performance characteristics. The choice between them is purely about code organization and readability.

API Pattern 3: Pipeline Operator (Pipe)#

TorchFX’s idiomatic pipeline operator pattern:

from torchfx import Wave
from torchfx.filter import HiChebyshev1, LoButterworth

# Apply filters using pipe operator
result = (
    signal
    | HiChebyshev1(20)
    | HiChebyshev1(60)
    | HiChebyshev1(65)
    | LoButterworth(5000)
    | LoButterworth(4900)
    | LoButterworth(4850)
)

Characteristics:

Sample Rate: Automatically configured from Wave object
Ergonomics: Most readable and concise
Safety: Eliminates sample rate mismatch errors
Performance: Minimal overhead for automatic configuration

Tip

The pipeline operator is the recommended pattern for TorchFX. It provides the best balance of readability, safety (automatic sample rate configuration), and performance.

API Pattern 4: SciPy Baseline#

Pure NumPy/SciPy implementation for baseline comparison:

from scipy.signal import butter, cheby1, lfilter

# Pre-compute filter coefficients
b1, a1 = cheby1(2, 0.5, 20, btype='high', fs=SAMPLE_RATE)
b2, a2 = cheby1(2, 0.5, 60, btype='high', fs=SAMPLE_RATE)
b3, a3 = cheby1(2, 0.5, 65, btype='high', fs=SAMPLE_RATE)
b4, a4 = butter(2, 5000, btype='low', fs=SAMPLE_RATE)
b5, a5 = butter(2, 4900, btype='low', fs=SAMPLE_RATE)
b6, a6 = butter(2, 4850, btype='low', fs=SAMPLE_RATE)

# Apply filters sequentially
filtered = lfilter(b1, a1, signal)
filtered = lfilter(b2, a2, filtered)
filtered = lfilter(b3, a3, filtered)
filtered = lfilter(b4, a4, filtered)
filtered = lfilter(b5, a5, filtered)
filtered = lfilter(b6, a6, filtered)

Characteristics:

Performance: Optimized C implementation
GPU: No GPU acceleration available
Integration: Requires NumPy arrays (no PyTorch tensors)
Baseline: Reference for CPU performance comparison

API Performance Summary#

API Pattern	Sample Rate Config	Ergonomics	Performance	Use Case
Custom `nn.Module`	Manual (`fs=`)	Good	Fast	Complex custom logic
`nn.Sequential`	Manual (`fs=`)	Very Good	Fast	Standard PyTorch integration
Pipeline Operator	Automatic	Excellent	Fast	Recommended for TorchFX
SciPy `lfilter`	Manual (`fs=`)	Fair	Fast (CPU only)	Baseline comparison

Key Insight: The pipeline operator provides automatic sample rate configuration with negligible performance overhead, making it the most ergonomic choice without sacrificing speed.

See also

Pipeline Operator - Functional Composition - Detailed documentation on the pipeline operator pattern

FIR Filter Performance#

The fir_bench.py benchmark evaluates FIR filter performance across varying signal durations, channel counts, and execution backends (GPU, CPU, SciPy).

FIR Test Configuration#

Filter Chain: 5 cascaded FIR filters with varying tap counts

DesignableFIR: 101 taps, 1000 Hz cutoff
DesignableFIR: 102 taps, 5000 Hz cutoff
DesignableFIR: 103 taps, 1500 Hz cutoff
DesignableFIR: 104 taps, 1800 Hz cutoff
DesignableFIR: 105 taps, 1850 Hz cutoff

Test Parameters:

Durations: 5s, 60s, 180s, 300s, 600s (10 minutes)
Channels: 1, 2, 4, 8, 12
Backends: GPU (CUDA), CPU (PyTorch), SciPy (NumPy)
Repetitions: 50 per configuration

        graph TB
    subgraph "FIR Benchmark Test Matrix"
        Input["Input Signal<br/>Variable duration & channels"]

        subgraph Filters["FIR Filter Chain"]
            F1["FIR 101 taps<br/>1000 Hz"]
            F2["FIR 102 taps<br/>5000 Hz"]
            F3["FIR 103 taps<br/>1500 Hz"]
            F4["FIR 104 taps<br/>1800 Hz"]
            F5["FIR 105 taps<br/>1850 Hz"]
        end

        subgraph Durations["Duration Sweep"]
            D1["5 seconds"]
            D2["60 seconds"]
            D3["180 seconds"]
            D4["300 seconds"]
            D5["600 seconds"]
        end

        subgraph Channels["Channel Sweep"]
            C1["1 channel"]
            C2["2 channels"]
            C3["4 channels"]
            C4["8 channels"]
            C5["12 channels"]
        end

        Input --> F1
        F1 --> F2
        F2 --> F3
        F3 --> F4
        F4 --> F5

        Input --> Durations
        Input --> Channels
    end

    style Input fill:#e1f5ff
    style Filters fill:#fff5e1

FIR Benchmark Configuration - Tests performance across duration and channel count dimensions.

FIR Coefficient Pre-Computation#

FIR filters require coefficient computation before filtering. The benchmark explicitly pre-computes coefficients to exclude design time from performance measurements:

import torch.nn as nn
from torchfx import Wave
from torchfx.filter import DesignableFIR

SAMPLE_RATE = 44100

# Create filter chain
fchain = nn.Sequential(
    DesignableFIR(num_taps=101, cutoff=1000, fs=SAMPLE_RATE),
    DesignableFIR(num_taps=102, cutoff=5000, fs=SAMPLE_RATE),
    DesignableFIR(num_taps=103, cutoff=1500, fs=SAMPLE_RATE),
    DesignableFIR(num_taps=104, cutoff=1800, fs=SAMPLE_RATE),
    DesignableFIR(num_taps=105, cutoff=1850, fs=SAMPLE_RATE),
)

# Pre-compute coefficients before timing
for f in fchain:
    f.compute_coefficients()

# Now ready for benchmarking (coefficient design excluded)

Why Pre-Compute?

Separation of Concerns: Design time vs filtering time are separate
Realistic Use Case: Coefficients are typically computed once and reused
Fair Comparison: SciPy baseline also pre-computes coefficients

Important

In production code, call compute_coefficients() once during initialization, then reuse the filter for processing multiple audio files.

FIR Device Transfer Pattern#

The benchmark demonstrates proper device management for GPU acceleration:

import timeit

# GPU benchmarking
wave.to("cuda")
fchain.to("cuda")
gpu_time = timeit.timeit(lambda: wave | fchain, number=REP)

# CPU benchmarking (transfer back)
wave.to("cpu")
fchain.to("cpu")
cpu_time = timeit.timeit(lambda: wave | fchain, number=REP)

# Calculate speedup
speedup = cpu_time / gpu_time
print(f"GPU speedup: {speedup:.2f}x")

Device Transfer Rules:

Move both Wave and filter chain to the same device
Pre-compute coefficients before moving to GPU
Time only the filtering operation (exclude transfers)
Move back to CPU for result saving

See also

GPU Acceleration - Comprehensive guide to GPU device management

FIR Performance Characteristics#

FIR filters have distinct performance characteristics compared to IIR filters:

Characteristic	FIR Filters	Reason
Computational Cost	Higher	Convolution with many taps
GPU Advantage	Excellent	High parallelism in convolution
Memory Footprint	Larger	Must store all tap coefficients
Scaling with Taps	Linear O(N)	More taps = more multiply-accumulate operations
Scaling with Channels	Excellent	Independent per-channel convolution

When FIR Filters Excel:

Long signals (>60s): Amortizes setup overhead
Many channels (≥4): Parallel convolution across channels
GPU available: Convolution is highly parallel
Linear phase required: Only FIR can provide linear phase

When to Avoid FIR:

Real-time processing: IIR filters have lower latency
Limited memory: FIR coefficients consume more memory
CPU-only, short signals: IIR may be faster

Tip

For steep frequency responses, FIR filters require many taps (100+). Consider IIR filters if phase linearity is not critical.

SciPy FIR Baseline Implementation#

The benchmark includes a SciPy baseline for CPU performance comparison:

from scipy.signal import firwin, lfilter

# Design FIR coefficients using scipy
b1 = firwin(101, 1000, fs=SAMPLE_RATE)
b2 = firwin(102, 5000, fs=SAMPLE_RATE)
b3 = firwin(103, 1500, fs=SAMPLE_RATE)
b4 = firwin(104, 1800, fs=SAMPLE_RATE)
b5 = firwin(105, 1850, fs=SAMPLE_RATE)

# Apply filters sequentially
a = [1]  # FIR filters have a = [1]
filtered = lfilter(b1, a, signal)
filtered = lfilter(b2, a, filtered)
filtered = lfilter(b3, a, filtered)
filtered = lfilter(b4, a, filtered)
filtered = lfilter(b5, a, filtered)

SciPy Characteristics:

CPU-only: No GPU acceleration
Optimized: Uses NumPy’s optimized C/Fortran backend
Reference: Establishes baseline CPU performance
Compatibility: Requires NumPy arrays (not PyTorch tensors)

IIR Filter Performance#

The iir_bench.py benchmark evaluates IIR (Infinite Impulse Response) filter performance using the same test matrix as FIR benchmarks but with recursive filters.

IIR Test Configuration#

Filter Chain: 4 cascaded IIR filters

HiButterworth: 1000 Hz cutoff, order 2
LoButterworth: 5000 Hz cutoff, order 2
HiChebyshev1: 1500 Hz cutoff, order 2
LoChebyshev1: 1800 Hz cutoff, order 2

Test Parameters:

Durations: 1s, 5s, 180s, 300s, 600s (10 minutes)
Channels: 1, 2, 4, 8, 12
Backends: GPU (CUDA), CPU (PyTorch), SciPy (NumPy)
Repetitions: 50 per configuration

        graph TB
    subgraph "IIR Benchmark Architecture"
        Input["Input Signal<br/>Variable duration & channels"]

        subgraph Chain["IIR Filter Chain"]
            F1["HiButterworth<br/>1000 Hz, order 2"]
            F2["LoButterworth<br/>5000 Hz, order 2"]
            F3["HiChebyshev1<br/>1500 Hz, order 2"]
            F4["LoChebyshev1<br/>1800 Hz, order 2"]
        end

        subgraph Setup["IIR-Specific Setup"]
            Compute["compute_coefficients()<br/>Design b, a coefficients"]
            MoveCoeff["move_coeff('cuda'/'cpu')<br/>Transfer to device"]
        end

        subgraph Backends["Execution Backends"]
            GPU["GPU: fchain(wave.ys)"]
            CPU["CPU: fchain(wave.ys)"]
            SciPy["SciPy: lfilter(b, a, signal)"]
        end

        Input --> F1
        F1 --> F2
        F2 --> F3
        F3 --> F4

        F1 --> Compute
        Compute --> MoveCoeff
        MoveCoeff --> GPU
        MoveCoeff --> CPU

        F1 --> SciPy
    end

    style Input fill:#e1f5ff
    style Chain fill:#fff5e1
    style Setup fill:#e8f5e1

IIR Benchmark Structure - Shows coefficient management and execution backends.

IIR Coefficient Management#

Unlike FIR filters, IIR filters have both numerator (b) and denominator (a) coefficients that must be explicitly managed:

import torch.nn as nn
from torchfx import Wave
from torchfx.filter import HiButterworth, LoButterworth

SAMPLE_RATE = 44100

# Create IIR filter chain
fchain = nn.Sequential(
    HiButterworth(cutoff=1000, order=2, fs=SAMPLE_RATE),
    LoButterworth(cutoff=5000, order=2, fs=SAMPLE_RATE),
)

# Move wave and module to GPU
wave.to("cuda")
fchain.to("cuda")

# IIR-specific: compute and move coefficients
for f in fchain:
    f.compute_coefficients()  # Design b, a coefficients
    f.move_coeff("cuda")       # Move coefficients to GPU

# Now ready for GPU processing
result = fchain(wave.ys)

Two-Step Device Transfer:

Module transfer: fchain.to("cuda") moves module parameters
Coefficient transfer: f.move_coeff("cuda") moves filter coefficients

Warning

For IIR filters, you must both move the module to the device and call move_coeff(). Forgetting the second step will cause runtime errors.

IIR vs FIR Performance Trade-offs#

IIR and FIR filters have fundamentally different performance characteristics:

Aspect	IIR Filters	FIR Filters
Computational Cost	Lower (fewer operations per sample)	Higher (convolution with many taps)
Memory Footprint	Small (few coefficients: b, a)	Large (many tap coefficients)
GPU Advantage	Moderate (less parallelism)	High (highly parallel convolution)
Phase Response	Non-linear	Can be linear (symmetric taps)
Stability	Can be unstable if poorly designed	Always stable
Filter Order	Achieves sharp cutoff with low order	Requires many taps for sharp cutoff

Performance Comparison Example:

# IIR: Order 8 Butterworth (16 coefficients total)
iir_filter = LoButterworth(cutoff=1000, order=8, fs=44100)
# Coefficients: b (9 values) + a (9 values) = 18 total

# Equivalent FIR: ~150+ taps for similar frequency response
fir_filter = DesignableFIR(num_taps=151, cutoff=1000, fs=44100)
# Coefficients: 151 tap values

# IIR is ~8x more memory-efficient and faster on CPU
# FIR has better GPU parallelism and linear phase

Choosing Between IIR and FIR:

        flowchart TD
    Start["Choose Filter Type"]

    LinearPhase{"Linear phase<br/>required?"}
    Stability{"Stability<br/>critical?"}
    GPU{"GPU<br/>available?"}
    Memory{"Memory<br/>constrained?"}

    UseFIR["Use FIR Filters<br/>✓ Linear phase<br/>✓ Always stable<br/>✓ GPU-friendly"]
    UseIIR["Use IIR Filters<br/>✓ Low memory<br/>✓ Low latency<br/>✓ Efficient CPU"]

    Start --> LinearPhase
    LinearPhase -->|Yes| UseFIR
    LinearPhase -->|No| Stability
    Stability -->|Critical| UseFIR
    Stability -->|Not critical| GPU
    GPU -->|Yes, long signals| UseFIR
    GPU -->|No or short signals| Memory
    Memory -->|Yes| UseIIR
    Memory -->|No| UseFIR

    style UseFIR fill:#e1ffe1
    style UseIIR fill:#e1f5ff

Filter Type Selection Decision Tree - Choose based on phase, stability, and resource constraints.

IIR SciPy Baseline#

The IIR benchmark includes SciPy baseline for comparison:

from scipy.signal import butter, cheby1, lfilter

# Design IIR coefficients
b1, a1 = butter(2, 1000, btype='high', fs=SAMPLE_RATE)
b2, a2 = butter(2, 5000, btype='low', fs=SAMPLE_RATE)
b3, a3 = cheby1(2, 0.5, 1500, btype='high', fs=SAMPLE_RATE)
b4, a4 = cheby1(2, 0.5, 1800, btype='low', fs=SAMPLE_RATE)

# Apply filters sequentially
filtered = lfilter(b1, a1, signal)
filtered = lfilter(b2, a2, filtered)
filtered = lfilter(b3, a3, filtered)
filtered = lfilter(b4, a4, filtered)

SciPy IIR Performance:

CPU-optimized: Highly optimized C implementation
No GPU: SciPy doesn’t support CUDA
Baseline: Reference for CPU performance
Filter Design: Uses standard signal processing algorithms

Performance Optimization Guidelines#

Based on the benchmarking results, follow these guidelines to optimize your TorchFX pipelines.

When to Use GPU Acceleration#

GPU acceleration provides the greatest benefit under specific conditions:

        flowchart TD
    Start["Audio Processing Task"]

    CheckDuration{"Signal duration<br/>> 60 seconds?"}
    CheckChannels{"Channels ≥ 4?"}
    CheckBatch{"Batch processing<br/>multiple files?"}
    CheckFIR{"Using FIR filters<br/>with >100 taps?"}
    CheckRealtime{"Real-time<br/>low-latency requirement?"}

    UseGPU["✓ Use GPU<br/>wave.to('cuda')<br/>fchain.to('cuda')"]
    UseCPU["✓ Use CPU<br/>Default or wave.to('cpu')"]

    Start --> CheckDuration
    CheckDuration -->|Yes| UseGPU
    CheckDuration -->|No| CheckChannels
    CheckChannels -->|Yes| UseGPU
    CheckChannels -->|No| CheckBatch
    CheckBatch -->|Yes| UseGPU
    CheckBatch -->|No| CheckFIR
    CheckFIR -->|Yes| UseGPU
    CheckFIR -->|No| CheckRealtime
    CheckRealtime -->|Yes| UseCPU
    CheckRealtime -->|No| UseGPU

    style UseGPU fill:#e1ffe1
    style UseCPU fill:#e1f5ff

GPU Decision Tree - Follow this flowchart to determine optimal execution backend.

GPU Performance Sweet Spot:

Factor	Threshold	Reasoning
Duration	> 60 seconds	Amortizes data transfer overhead
Channels	≥ 4 channels	Exploits parallel processing
Batch Size	> 5 files	Transfer overhead amortized across batch
FIR Taps	> 100 taps	Convolution highly parallelizable
IIR Chain	≥ 3 filters	Accumulated compute benefits

CPU Preferred Cases:

Real-time processing: More predictable latency
Short signals (<30s): Transfer overhead dominates
Single channel: Insufficient parallelism
IIR filters only: Less GPU benefit than FIR

Tip

When in doubt, benchmark your specific workload. Use the patterns from the benchmark suite as templates.

Filter Chain Optimization#

Optimize filter chains by pre-computing coefficients and reusing filters:

import torch.nn as nn
from torchfx import Wave
from torchfx.filter import DesignableFIR, HiButterworth

SAMPLE_RATE = 44100

# Create filter chain
fchain = nn.Sequential(
    DesignableFIR(num_taps=101, cutoff=1000, fs=SAMPLE_RATE),
    HiButterworth(cutoff=500, order=2, fs=SAMPLE_RATE),
)

# Pre-compute coefficients once during initialization
for f in fchain:
    f.compute_coefficients()

# For IIR filters, also move coefficients to device
device = "cuda" if torch.cuda.is_available() else "cpu"
fchain.to(device)

for f in fchain:
    if hasattr(f, 'move_coeff'):
        f.move_coeff(device)

# Process multiple files without re-computing coefficients
audio_files = ["song1.wav", "song2.wav", "song3.wav"]

for audio_file in audio_files:
    wave = Wave.from_file(audio_file).to(device)
    result = wave | fchain  # Uses cached coefficients
    result.to("cpu").save(f"processed_{audio_file}")

Optimization Benefits:

Coefficient caching: Compute once, reuse for all files
Device pinning: Keep filters on GPU across iterations
Batch amortization: Setup cost amortized over multiple files

Device Placement Strategy#

Minimize device transfers by keeping processing on a single device:

import torch
import torchfx as fx

# Strategy 1: Single device throughout (RECOMMENDED)
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load and move to device once
wave = fx.Wave.from_file("audio.wav").to(device)

# Create filter chain on same device
fchain = torch.nn.Sequential(
    fx.filter.HiButterworth(cutoff=80, order=2),
    fx.filter.LoButterworth(cutoff=12000, order=4),
).to(device)

# All operations on same device
result = wave | fchain

# Move to CPU only for final I/O
result.to("cpu").save("output.wav")

Avoid Inefficient Transfers (Anti-pattern):

# ❌ WRONG: Unnecessary device transfers
wave = fx.Wave.from_file("audio.wav").to("cuda")
result1 = wave.to("cpu") | cpu_filter  # Transfer 1
result2 = result1.to("cuda") | gpu_filter  # Transfer 2
result3 = result2.to("cpu") | cpu_filter2  # Transfer 3

# ✅ CORRECT: Single device
device = "cuda" if torch.cuda.is_available() else "cpu"
wave = fx.Wave.from_file("audio.wav").to(device)
cpu_filter.to(device)
gpu_filter.to(device)
cpu_filter2.to(device)
result = wave | cpu_filter | gpu_filter | cpu_filter2

Device Transfer Costs:

CPU → GPU: O(n) where n = number of samples
GPU → CPU: O(n) where n = number of samples
Impact: Can dominate for short signals

See also

GPU Acceleration - Comprehensive device management patterns

Memory Management Best Practices#

Optimize memory usage for large-scale processing:

Optimization	Implementation	Impact
In-place operations	Use effects that modify tensors in-place where possible	Reduces memory allocations
Chunked processing	Process long audio in chunks	Prevents GPU OOM errors
Coefficient caching	Pre-compute and reuse filter coefficients	Eliminates redundant computation
Device pinning	Keep frequently-used filters on device	Reduces transfer overhead
Batch size tuning	Adjust batch size to fit GPU memory	Maximizes throughput

Memory-Efficient Chunked Processing Example:

import torch
import torchfx as fx

def process_long_audio_chunked(wave, fchain, chunk_duration=60):
    """Process very long audio in chunks to manage GPU memory.

    Parameters
    ----------
    wave : Wave
        Input audio (can be on CPU or GPU)
    fchain : nn.Module
        Filter chain (must be on same device as intended chunks)
    chunk_duration : float
        Chunk duration in seconds

    Returns
    -------
    Wave
        Processed audio
    """
    chunk_samples = int(chunk_duration * wave.fs)
    num_chunks = (wave.ys.size(-1) + chunk_samples - 1) // chunk_samples

    device = "cuda" if torch.cuda.is_available() else "cpu"
    fchain.to(device)

    results = []
    for i in range(num_chunks):
        start = i * chunk_samples
        end = min((i + 1) * chunk_samples, wave.ys.size(-1))

        # Extract chunk and move to GPU
        chunk = fx.Wave(wave.ys[..., start:end], wave.fs).to(device)

        # Process chunk on GPU
        processed_chunk = chunk | fchain

        # Move back to CPU and store
        results.append(processed_chunk.ys.cpu())

        # Clear GPU cache
        if device == "cuda":
            torch.cuda.empty_cache()

    # Concatenate results
    return fx.Wave(torch.cat(results, dim=-1), wave.fs)

# Usage
wave = fx.Wave.from_file("10_hour_recording.wav")
fchain = fx.filter.LoButterworth(cutoff=1000, order=4)
result = process_long_audio_chunked(wave, fchain, chunk_duration=60)
result.save("processed.wav")

Chunked Processing Benefits:

Processes arbitrarily long audio without OOM errors
Keeps GPU utilization high
Balances memory usage with throughput

Benchmarking Your Own Pipelines#

Use these patterns to benchmark your custom TorchFX pipelines.

Basic Benchmarking Template#

import timeit
import torch
import torchfx as fx
import numpy as np

# Configuration
SAMPLE_RATE = 44100
DURATION = 60  # seconds
NUM_CHANNELS = 4
REP = 50  # repetitions for timing

# Generate test signal
signal = np.random.randn(NUM_CHANNELS, int(SAMPLE_RATE * DURATION))
signal = signal.astype(np.float32)
signal /= np.max(np.abs(signal), axis=1, keepdims=True)
wave = fx.Wave(signal, SAMPLE_RATE)

# Create your processing pipeline
pipeline = torch.nn.Sequential(
    fx.filter.HiButterworth(cutoff=100, order=2),
    fx.filter.LoButterworth(cutoff=10000, order=4),
    fx.effect.Normalize(peak=0.9),
)

# Pre-compute coefficients
for module in pipeline:
    if hasattr(module, 'compute_coefficients'):
        module.compute_coefficients()

# Benchmark CPU
wave.to("cpu")
pipeline.to("cpu")
cpu_time = timeit.timeit(lambda: wave | pipeline, number=REP)

# Benchmark GPU (if available)
if torch.cuda.is_available():
    wave.to("cuda")
    pipeline.to("cuda")

    # Move IIR coefficients if needed
    for module in pipeline:
        if hasattr(module, 'move_coeff'):
            module.move_coeff("cuda")

    gpu_time = timeit.timeit(lambda: wave | pipeline, number=REP)

    print(f"CPU time: {cpu_time/REP:.6f}s")
    print(f"GPU time: {gpu_time/REP:.6f}s")
    print(f"Speedup: {cpu_time/gpu_time:.2f}x")
else:
    print(f"CPU time: {cpu_time/REP:.6f}s")
    print("GPU not available")

Multi-Configuration Benchmark#

Test performance across multiple configurations:

import timeit
import torch
import torchfx as fx
import numpy as np
import pandas as pd

SAMPLE_RATE = 44100
REP = 50

# Test configurations
durations = [5, 30, 60, 120, 300]  # seconds
channel_counts = [1, 2, 4, 8]

# Create filter chain
pipeline = torch.nn.Sequential(
    fx.filter.LoButterworth(cutoff=1000, order=4),
    fx.filter.HiButterworth(cutoff=100, order=2),
)

# Pre-compute coefficients
for module in pipeline:
    if hasattr(module, 'compute_coefficients'):
        module.compute_coefficients()

# Benchmark grid
results = []

for duration in durations:
    for channels in channel_counts:
        # Generate test signal
        signal = np.random.randn(channels, int(SAMPLE_RATE * duration))
        signal = signal.astype(np.float32)
        signal /= np.max(np.abs(signal), axis=1, keepdims=True)
        wave = fx.Wave(signal, SAMPLE_RATE)

        # CPU benchmark
        wave.to("cpu")
        pipeline.to("cpu")
        cpu_time = timeit.timeit(lambda: wave | pipeline, number=REP) / REP

        # GPU benchmark
        if torch.cuda.is_available():
            wave.to("cuda")
            pipeline.to("cuda")
            gpu_time = timeit.timeit(lambda: wave | pipeline, number=REP) / REP
            speedup = cpu_time / gpu_time
        else:
            gpu_time = None
            speedup = None

        results.append({
            'duration': duration,
            'channels': channels,
            'cpu_time': cpu_time,
            'gpu_time': gpu_time,
            'speedup': speedup
        })

# Convert to DataFrame for analysis
df = pd.DataFrame(results)
print(df.to_string(index=False))

# Save to CSV
df.to_csv("benchmark_results.csv", index=False)

Profiling with PyTorch Profiler#

For detailed performance analysis, use PyTorch’s built-in profiler:

import torch
import torchfx as fx

# Create pipeline
wave = fx.Wave.from_file("audio.wav").to("cuda")
pipeline = torch.nn.Sequential(
    fx.filter.LoButterworth(cutoff=1000, order=4),
    fx.filter.HiButterworth(cutoff=100, order=2),
).to("cuda")

# Profile the pipeline
with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ],
    record_shapes=True,
    profile_memory=True,
    with_stack=True
) as prof:
    result = wave | pipeline

# Print profiling results
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

# Export for visualization
prof.export_chrome_trace("trace.json")
# Open trace.json in chrome://tracing for detailed visualization

See also

PyTorch Profiler Documentation - Official guide to PyTorch profiling tools

Complete Benchmarking Examples#

These complete examples demonstrate how to run comprehensive benchmarks for your specific use cases.

Example 1: API Pattern Comparison#

Compare different API patterns for your filter chain:

import timeit
import numpy as np
from torch import nn
from torchfx import Wave
from torchfx.filter import HiChebyshev1, LoButterworth

SAMPLE_RATE = 44100
DURATION = 120  # 2 minutes
NUM_CHANNELS = 8
REP = 50

# Generate test signal
signal_data = np.random.randn(NUM_CHANNELS, int(SAMPLE_RATE * DURATION))
signal_data = signal_data.astype(np.float32)
signal_data /= np.max(np.abs(signal_data), axis=1, keepdims=True)
wave = Wave(signal_data, SAMPLE_RATE)

# Pattern 1: Custom nn.Module class
class FilterChain(nn.Module):
    def __init__(self, fs):
        super().__init__()
        self.f1 = HiChebyshev1(20, fs=fs)
        self.f2 = LoButterworth(5000, fs=fs)

    def forward(self, x):
        x = self.f1(x)
        x = self.f2(x)
        return x

def test_class():
    fchain = FilterChain(wave.fs)
    return fchain(wave.ys)

# Pattern 2: nn.Sequential
def test_sequential():
    fchain = nn.Sequential(
        HiChebyshev1(20, fs=wave.fs),
        LoButterworth(5000, fs=wave.fs),
    )
    return fchain(wave.ys)

# Pattern 3: Pipe operator
def test_pipe():
    return wave | HiChebyshev1(20) | LoButterworth(5000)

# Benchmark each pattern
class_time = timeit.timeit(test_class, number=REP)
seq_time = timeit.timeit(test_sequential, number=REP)
pipe_time = timeit.timeit(test_pipe, number=REP)

print(f"Custom class: {class_time/REP:.6f}s")
print(f"nn.Sequential: {seq_time/REP:.6f}s")
print(f"Pipe operator: {pipe_time/REP:.6f}s")

Example 2: FIR Filter Performance Analysis#

Comprehensive FIR filter benchmarking across durations and channel counts:

import timeit
import numpy as np
import torch.nn as nn
import pandas as pd
from torchfx import Wave
from torchfx.filter import DesignableFIR

SAMPLE_RATE = 44100
REP = 50

# Test matrix
durations = [5, 60, 180, 300, 600]
channel_counts = [1, 2, 4, 8, 12]

results = []

for duration in durations:
    for channels in channel_counts:
        # Generate test signal
        signal = np.random.randn(channels, int(SAMPLE_RATE * duration))
        signal = signal.astype(np.float32)
        signal /= np.max(np.abs(signal), axis=1, keepdims=True)
        wave = Wave(signal, SAMPLE_RATE)

        # Create FIR filter chain
        fchain = nn.Sequential(
            DesignableFIR(num_taps=101, cutoff=1000, fs=SAMPLE_RATE),
            DesignableFIR(num_taps=102, cutoff=5000, fs=SAMPLE_RATE),
            DesignableFIR(num_taps=103, cutoff=1500, fs=SAMPLE_RATE),
        )

        # Pre-compute coefficients
        for f in fchain:
            f.compute_coefficients()

        # GPU benchmark
        wave.to("cuda")
        fchain.to("cuda")
        gpu_time = timeit.timeit(lambda: wave | fchain, number=REP) / REP

        # CPU benchmark
        wave.to("cpu")
        fchain.to("cpu")
        cpu_time = timeit.timeit(lambda: wave | fchain, number=REP) / REP

        results.append({
            'duration_sec': duration,
            'channels': channels,
            'gpu_time_sec': gpu_time,
            'cpu_time_sec': cpu_time,
            'speedup': cpu_time / gpu_time
        })

        print(f"Duration: {duration}s, Channels: {channels}, "
              f"GPU: {gpu_time:.6f}s, CPU: {cpu_time:.6f}s, "
              f"Speedup: {cpu_time/gpu_time:.2f}x")

# Save results
df = pd.DataFrame(results)
df.to_csv("fir_benchmark.csv", index=False)
print("\nResults saved to fir_benchmark.csv")

Example 3: IIR Filter Performance Analysis#

Complete IIR filter benchmarking with coefficient management:

import timeit
import numpy as np
import torch.nn as nn
import pandas as pd
from torchfx import Wave
from torchfx.filter import HiButterworth, LoButterworth, HiChebyshev1, LoChebyshev1

SAMPLE_RATE = 44100
REP = 50

# Test matrix
durations = [1, 5, 180, 300, 600]
channel_counts = [1, 2, 4, 8, 12]

results = []

for duration in durations:
    for channels in channel_counts:
        # Generate test signal
        signal = np.random.randn(channels, int(SAMPLE_RATE * duration))
        signal = signal.astype(np.float32)
        signal /= np.max(np.abs(signal), axis=1, keepdims=True)
        wave = Wave(signal, SAMPLE_RATE)

        # Create IIR filter chain
        fchain = nn.Sequential(
            HiButterworth(cutoff=1000, order=2, fs=SAMPLE_RATE),
            LoButterworth(cutoff=5000, order=2, fs=SAMPLE_RATE),
            HiChebyshev1(cutoff=1500, order=2, fs=SAMPLE_RATE),
            LoChebyshev1(cutoff=1800, order=2, fs=SAMPLE_RATE),
        )

        # GPU benchmark
        wave.to("cuda")
        fchain.to("cuda")

        # Compute and move coefficients
        for f in fchain:
            f.compute_coefficients()
            f.move_coeff("cuda")

        gpu_time = timeit.timeit(
            lambda: fchain(wave.ys),
            number=REP
        ) / REP

        # CPU benchmark
        wave.to("cpu")
        fchain.to("cpu")

        for f in fchain:
            f.move_coeff("cpu")

        cpu_time = timeit.timeit(
            lambda: fchain(wave.ys),
            number=REP
        ) / REP

        results.append({
            'duration_sec': duration,
            'channels': channels,
            'gpu_time_sec': gpu_time,
            'cpu_time_sec': cpu_time,
            'speedup': cpu_time / gpu_time
        })

        print(f"Duration: {duration}s, Channels: {channels}, "
              f"GPU: {gpu_time:.6f}s, CPU: {cpu_time:.6f}s, "
              f"Speedup: {cpu_time/gpu_time:.2f}x")

# Save results
df = pd.DataFrame(results)
df.to_csv("iir_benchmark.csv", index=False)
print("\nResults saved to iir_benchmark.csv")

Summary#

Key takeaways for optimizing TorchFX performance:

GPU Acceleration: Use GPU for long signals (>60s), multi-channel audio (≥4 channels), and batch processing
Filter Choice: FIR filters excel on GPU with parallel convolution; IIR filters are more CPU-efficient
API Pattern: Pipeline operator provides best ergonomics with automatic sample rate configuration and minimal overhead
Coefficient Caching: Pre-compute filter coefficients once and reuse for multiple files
Device Management: Minimize transfers by keeping all processing on one device
Memory: Use chunked processing for very long audio files to prevent OOM errors
Benchmarking: Use the provided templates to measure performance of your specific pipelines

GPU acceleration can provide 5-20x speedups for appropriate workloads. Follow the decision trees and best practices in this guide to maximize throughput in your audio processing pipelines.

External Resources#

PyTorch Profiler - Profiling PyTorch code
CUDA Best Practices Guide - NVIDIA optimization guide
SciPy Signal Processing - SciPy signal processing reference
PyTorch Performance Tuning - PyTorch optimization guide

Performance Optimization and Benchmarking#

Prerequisites#

Overview#

Benchmark Methodology#

Benchmark Suite Architecture#

Test Signal Generation#

Timing Methodology#

API Performance Comparison#

Test Configuration#

API Pattern 1: Custom nn.Module Class#

API Pattern 2: nn.Sequential#

API Pattern 3: Pipeline Operator (Pipe)#

API Pattern 4: SciPy Baseline#

API Performance Summary#

FIR Filter Performance#

FIR Test Configuration#

FIR Coefficient Pre-Computation#

FIR Device Transfer Pattern#

FIR Performance Characteristics#

SciPy FIR Baseline Implementation#

IIR Filter Performance#

IIR Test Configuration#

IIR Coefficient Management#

IIR vs FIR Performance Trade-offs#

IIR SciPy Baseline#

Performance Optimization Guidelines#

When to Use GPU Acceleration#

Filter Chain Optimization#

Device Placement Strategy#

Memory Management Best Practices#

Benchmarking Your Own Pipelines#

Basic Benchmarking Template#

Multi-Configuration Benchmark#

Profiling with PyTorch Profiler#

Complete Benchmarking Examples#

Example 1: API Pattern Comparison#

Example 2: FIR Filter Performance Analysis#

Example 3: IIR Filter Performance Analysis#

Summary#

Related Guides#

External Resources#

This Page