(gpu-acceleration)=
# GPU Acceleration
Learn how to leverage CUDA-enabled GPUs for accelerated audio processing in TorchFX. This tutorial covers device management, performance optimization, and best practices for moving audio processing workflows to the GPU.
## Prerequisites
Before starting this tutorial, you should be familiar with:
- {doc}`../core-concepts/wave` - Wave class fundamentals
- {doc}`../core-concepts/pipeline-operator` - Pipeline operator basics
- [PyTorch CUDA Semantics](https://pytorch.org/docs/stable/notes/cuda.html) - PyTorch device management
- Basic understanding of GPU computing concepts
## Overview
TorchFX leverages PyTorch's device management system to enable GPU acceleration for audio processing. All audio data ({class}`~torchfx.Wave` objects) and filter coefficients can be seamlessly moved between CPU and GPU memory using standard PyTorch device APIs.
### When to Use GPU Acceleration
GPU acceleration provides significant performance benefits in specific scenarios:
| Scenario | GPU Advantage | Reason |
|----------|---------------|---------|
| Long audio files (>60 seconds) | **High** | Amortizes data transfer overhead |
| Multi-channel audio (≥4 channels) | **High** | Parallel processing across channels |
| Complex filter chains (≥3 filters) | **Medium-High** | Accumulated compute savings |
| Short audio (<5 seconds) | **Low** | Data transfer overhead dominates |
| Single channel, simple processing | **Low-Medium** | Insufficient parallelism |
```{tip}
For batch processing of many audio files, GPU acceleration can provide substantial speedups even for shorter files, as the overhead is amortized across the entire batch.
```
## Device Management Architecture
TorchFX uses PyTorch's device management system for both {class}`~torchfx.Wave` objects and filter modules.
```{mermaid}
graph TB
subgraph CPU["CPU Memory Space"]
WaveCPU["Wave Object
ys: Tensor (CPU)
fs: int
device: 'cpu'"]
FilterCPU["Filter Modules
coefficients on CPU"]
end
subgraph GPU["GPU Memory Space (CUDA)"]
WaveGPU["Wave Object
ys: Tensor (CUDA)
fs: int
device: 'cuda'"]
FilterGPU["Filter Modules
coefficients on CUDA"]
end
subgraph API["Device Management API"]
ToMethod["Wave.to(device)"]
DeviceProp["Wave.device property"]
ModuleTo["nn.Module.to(device)"]
end
WaveCPU -->|"wave.to('cuda')"| ToMethod
ToMethod -->|"moves ys tensor"| WaveGPU
WaveGPU -->|"wave.to('cpu')"| ToMethod
ToMethod -->|"moves ys tensor"| WaveCPU
DeviceProp -->|"setter calls to()"| ToMethod
FilterCPU -->|"filter.to('cuda')"| ModuleTo
ModuleTo -->|"moves parameters"| FilterGPU
FilterGPU -->|"filter.to('cpu')"| ModuleTo
ModuleTo -->|"moves parameters"| FilterCPU
style WaveCPU fill:#e1f5ff
style WaveGPU fill:#e1ffe1
style FilterCPU fill:#fff5e1
style FilterGPU fill:#fff5e1
```
**Device Transfer Architecture** - Wave objects and filters can be moved between CPU and GPU memory using standard PyTorch APIs.
## Moving Wave Objects to GPU
The {class}`~torchfx.Wave` class provides two methods for device management: the `to()` method and the `device` property setter.
### The `to()` Method
The primary method for moving a {class}`~torchfx.Wave` object between devices is {meth}`~torchfx.Wave.to`, which returns the modified {class}`~torchfx.Wave` object for method chaining:
```python
import torchfx as fx
# Load audio file (defaults to CPU)
wave = fx.Wave.from_file("audio.wav")
print(wave.device) # 'cpu'
# Move to GPU
wave.to("cuda")
print(wave.device) # 'cuda'
# Move back to CPU
wave.to("cpu")
print(wave.device) # 'cpu'
```
The `to()` method performs two operations:
1. Updates the internal `__device` field to track the current device
2. Moves the underlying `ys` tensor using PyTorch's `Tensor.to(device)` method
```{seealso}
{meth}`torchfx.Wave.to` - API documentation for the `to()` method
```
### The `device` Property
The {attr}`~torchfx.Wave.device` property provides both getter and setter functionality:
```python
import torchfx as fx
wave = fx.Wave.from_file("audio.wav")
# Reading current device
current_device = wave.device # Returns "cpu" or "cuda"
print(f"Wave is on: {current_device}")
# Setting device via property (equivalent to wave.to("cuda"))
wave.device = "cuda"
print(f"Wave moved to: {wave.device}")
```
The property setter internally calls `to()`, so both approaches are equivalent. Use whichever is more readable in your code.
### Method Chaining
The `to()` method returns `self`, enabling method chaining with other {class}`~torchfx.Wave` operations:
```python
import torchfx as fx
# Method chaining with device transfer
result = (
fx.Wave.from_file("audio.wav")
.to("cuda") # Move to GPU
| fx.filter.LoButterworth(cutoff=1000, order=4)
| fx.effect.Normalize(peak=0.9)
)
# Save result (automatically on same device as input)
result.to("cpu").save("output.wav")
```
## Filter and Effect Device Management
All filters and effects in TorchFX inherit from {class}`torch.nn.Module`, enabling standard PyTorch device management for their parameters and buffers.
### Moving Filters to GPU
Filters store their coefficients as PyTorch tensors or buffers. To enable GPU-accelerated filtering, move these coefficients to the GPU:
```python
import torchfx as fx
# Create and configure filter
lowpass = fx.filter.LoButterworth(cutoff=1000, order=4, fs=44100)
lowpass.compute_coefficients() # Compute coefficients on CPU
# Move filter to GPU
lowpass.to("cuda")
# Now the filter is ready for GPU processing
```
### Moving Filter Chains to GPU
When using {class}`torch.nn.Sequential` or other PyTorch containers, all modules in the chain are moved together:
```python
import torch.nn as nn
import torchfx as fx
# Create filter chain
filter_chain = nn.Sequential(
fx.filter.HiButterworth(cutoff=100, order=2),
fx.filter.LoButterworth(cutoff=5000, order=4),
fx.effect.Normalize(peak=0.9)
)
# Move entire chain to GPU
filter_chain.to("cuda") # All filters and effects now on CUDA
```
The `to()` method propagates through all child modules, ensuring consistent device placement.
## Device Coordination in Processing Pipelines
When using the {term}`pipeline operator` (`|`), device compatibility is the user's responsibility. Both the {class}`~torchfx.Wave` object and the filter/effect must be on the same device.
```{mermaid}
sequenceDiagram
participant User
participant Wave as "Wave Object"
participant Filter as "Filter Module"
participant GPU as "CUDA Device"
User->>Wave: Wave.from_file("audio.wav")
Note over Wave: ys on CPU
device = "cpu"
User->>Wave: wave.to("cuda")
Wave->>GPU: Transfer ys tensor
Note over Wave: ys on CUDA
device = "cuda"
User->>Filter: filter.to("cuda")
Filter->>GPU: Transfer coefficients
Note over Filter: coefficients on CUDA
User->>Wave: wave | filter
Note over Wave,Filter: Both on same device ✓
Wave->>Filter: forward(ys)
Filter->>GPU: Execute convolution on GPU
GPU-->>Filter: Result tensor (CUDA)
Filter-->>Wave: Return new Wave (CUDA)
Note over Wave: New Wave object
ys on CUDA
```
**Pipeline Processing Flow with GPU** - Shows the sequence of device transfers and processing operations.
### Device Compatibility Rules
The pipeline operator validates device compatibility at runtime:
| Wave Device | Filter/Effect Device | Result |
|-------------|----------------------|---------|
| `"cuda"` | `"cuda"` | ✅ Processing on GPU |
| `"cpu"` | `"cpu"` | ✅ Processing on CPU |
| `"cuda"` | `"cpu"` | ❌ Runtime error |
| `"cpu"` | `"cuda"` | ❌ Runtime error |
```{warning}
Device mismatches will raise a runtime error from PyTorch. Always ensure the {class}`~torchfx.Wave` object and all filters/effects in the pipeline are on the same device.
```
### Automatic Device Propagation Pattern
While TorchFX doesn't automatically move filters to match the Wave's device, you can establish a consistent pattern:
```python
import torch
import torchfx as fx
# Determine device
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load and move Wave to device
wave = fx.Wave.from_file("audio.wav").to(device)
# Create filters (they start on CPU by default)
lowpass = fx.filter.LoButterworth(cutoff=1000, order=4)
highpass = fx.filter.HiButterworth(cutoff=100, order=2)
# Move filters to match Wave's device
lowpass.to(device)
highpass.to(device)
# Now processing works on the selected device
result = wave | lowpass | highpass
```
```{tip}
The tensor returned by the filter's `forward()` method maintains the same device as the input tensor, so all intermediate {class}`~torchfx.Wave` objects in a pipeline chain stay on the same device.
```
## Performance Considerations
GPU acceleration provides the greatest benefits when data transfer overhead is amortized by significant computation.
### Data Transfer Overhead
Moving data between CPU and GPU incurs overhead from PCIe bus transfers:
| Operation | Cost | Notes |
|-----------|------|-------|
| `Wave.to("cuda")` | O(n) where n = sample count | Transfer audio data to GPU |
| `nn.Module.to("cuda")` | O(p) where p = parameter count | Transfer filter coefficients |
| `Tensor.cpu()` | O(n) where n = tensor size | Transfer results back to CPU |
**Optimization Strategy**: Minimize device transfers by:
1. Loading and moving to GPU **once** at the start
2. Performing **all processing** on GPU
3. Moving back to CPU **only** for final I/O operations
### Benchmarking Example
The following example demonstrates proper device management for performance:
```python
import torch
import torchfx as fx
from torchfx.filter import DesignableFIR
import torch.nn as nn
import timeit
# Configuration
SAMPLE_RATE = 44100
DURATION = 60 # seconds
NUM_CHANNELS = 4
# Create test audio
signal = torch.randn(NUM_CHANNELS, int(SAMPLE_RATE * DURATION))
wave = fx.Wave(signal, SAMPLE_RATE)
# Create filter chain
filter_chain = nn.Sequential(
DesignableFIR(num_taps=101, cutoff=1000, fs=SAMPLE_RATE),
DesignableFIR(num_taps=102, cutoff=5000, fs=SAMPLE_RATE),
DesignableFIR(num_taps=103, cutoff=1500, fs=SAMPLE_RATE),
)
# Compute coefficients before moving to GPU
for f in filter_chain:
f.compute_coefficients()
# Benchmark GPU processing
wave.to("cuda")
filter_chain.to("cuda")
gpu_time = timeit.timeit(lambda: wave | filter_chain, number=10)
# Benchmark CPU processing
wave.to("cpu")
filter_chain.to("cpu")
cpu_time = timeit.timeit(lambda: wave | filter_chain, number=10)
print(f"GPU time: {gpu_time/10:.4f}s")
print(f"CPU time: {cpu_time/10:.4f}s")
print(f"Speedup: {cpu_time/gpu_time:.2f}x")
```
### When GPU Provides Maximum Benefit
Based on empirical benchmarking, GPU acceleration is most beneficial when:
**Audio Duration**
- Files longer than 60 seconds see significant speedups
- Transfer overhead is amortized over longer computation time
**Number of Channels**
- 4+ channels leverage GPU's parallel processing capabilities
- Single-channel audio sees modest gains
**Filter Complexity**
- FIR filters with >100 taps benefit significantly
- IIR filter chains (3+ cascaded stages) show good speedups
- Parallel filter combinations ({doc}`series-parallel-filters`) see excellent performance
**Batch Processing**
- Processing multiple files in a batch maximizes GPU utilization
- Transfer overhead amortized across entire batch
```{seealso}
{doc}`performance` - Comprehensive performance benchmarks and optimization guidelines
```
### Memory Considerations
GPU memory is typically more limited than system RAM:
| Constraint | Typical Limit | Mitigation Strategy |
|------------|---------------|---------------------|
| GPU VRAM capacity | 4-24 GB (consumer GPUs) | Process audio in chunks |
| Audio file size | Limited by VRAM | Stream processing for very long files |
| Filter coefficient storage | Usually negligible | Pre-compute coefficients before transfer |
| Batch size | Limited by VRAM | Reduce batch size if OOM errors occur |
For very long audio files (e.g., >1 hour), consider chunked processing:
```python
import torch
import torchfx as fx
def process_in_chunks(wave, filter_chain, chunk_duration=60):
"""Process audio in chunks to manage GPU memory."""
chunk_samples = int(chunk_duration * wave.fs)
num_chunks = (wave.ys.size(-1) + chunk_samples - 1) // chunk_samples
results = []
for i in range(num_chunks):
start = i * chunk_samples
end = min((i + 1) * chunk_samples, wave.ys.size(-1))
# Extract chunk
chunk = fx.Wave(wave.ys[..., start:end], wave.fs)
chunk.to("cuda")
# Process chunk
processed_chunk = chunk | filter_chain
# Move back to CPU and store
results.append(processed_chunk.ys.cpu())
# Concatenate results
return fx.Wave(torch.cat(results, dim=-1), wave.fs)
# Usage
wave = fx.Wave.from_file("very_long_audio.wav")
filter_chain = nn.Sequential(
fx.filter.LoButterworth(cutoff=1000, order=4),
fx.filter.HiButterworth(cutoff=100, order=2),
).to("cuda")
result = process_in_chunks(wave, filter_chain, chunk_duration=60)
result.save("processed.wav")
```
## Best Practices
### Conditional Device Selection
Production code should handle systems without CUDA support gracefully:
```python
import torch
import torchfx as fx
# Conditional device selection
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
# Load and move to selected device
wave = fx.Wave.from_file("audio.wav").to(device)
# Create and move filters
filter_chain = torch.nn.Sequential(
fx.filter.LoButterworth(cutoff=1000, order=4),
fx.filter.HiButterworth(cutoff=100, order=2),
).to(device)
# Process on appropriate device
result = wave | filter_chain
```
This pattern:
- Checks for CUDA availability at runtime
- Falls back to CPU if CUDA is unavailable
- Enables cross-platform compatibility
```{tip}
For multi-GPU systems, you can specify a specific GPU using `"cuda:0"`, `"cuda:1"`, etc. Use {func}`torch.cuda.device_count()` to check available GPUs.
```
### CPU Transfer for I/O Operations
File I/O operations require CPU tensors. Always move tensors to CPU before saving:
```python
import torchfx as fx
import torchaudio
# Process on GPU
wave = fx.Wave.from_file("input.wav").to("cuda")
result = wave | filter_chain # Processing on GPU
# Option 1: Use ys.cpu() for saving
torchaudio.save("output.wav", result.ys.cpu(), result.fs)
# Option 2: Move entire Wave to CPU
result.to("cpu").save("output.wav")
```
The `Tensor.cpu()` method creates a copy on CPU without modifying the original GPU tensor, while `Wave.to("cpu")` moves the Wave object's internal state.
### Complete Processing Pipeline Pattern
Here's a complete example demonstrating best practices for GPU-accelerated audio processing:
```python
import torch
import torch.nn as nn
import torchfx as fx
import torchaudio
def process_audio_gpu(input_path, output_path):
"""Process audio with GPU acceleration and proper device management."""
# Step 1: Determine device
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
# Step 2: Load audio
wave = fx.Wave.from_file(input_path)
print(f"Loaded audio: {wave.ys.shape}, fs={wave.fs}")
# Step 3: Create processing chain
processing_chain = nn.Sequential(
# Pre-processing: remove rumble and noise
fx.filter.HiButterworth(cutoff=80, order=2),
fx.filter.LoButterworth(cutoff=15000, order=4),
# Main processing: EQ and dynamics
fx.effect.Normalize(peak=0.8),
)
# Step 4: Move to selected device
wave = wave.to(device)
processing_chain = processing_chain.to(device)
# Step 5: Process audio (all on same device)
result = wave | processing_chain
print(f"Processing completed on {device}")
# Step 6: Save result (move to CPU if needed)
if result.device == "cuda":
result = result.to("cpu")
result.save(output_path)
print(f"Saved to: {output_path}")
# Usage
process_audio_gpu("input.wav", "output.wav")
```
### Processing Pipeline Visualization
```{mermaid}
graph TD
Start([Start]) --> CheckGPU{torch.cuda
.is_available?}
CheckGPU -->|Yes| SetCUDA["device = 'cuda'"]
CheckGPU -->|No| SetCPU["device = 'cpu'"]
SetCUDA --> Load[Load Audio
Wave.from_file]
SetCPU --> Load
Load --> CreateChain[Create Processing Chain
nn.Sequential]
CreateChain --> MoveData["Move to Device
wave.to(device)
chain.to(device)"]
MoveData --> Process[Process Audio
wave | chain]
Process --> CheckDevice{result.device
== 'cuda'?}
CheckDevice -->|Yes| MoveCPU["Move to CPU
result.to('cpu')"]
CheckDevice -->|No| Save
MoveCPU --> Save[Save to File
result.save]
Save --> End([End])
style Start fill:#e1f5ff
style End fill:#e1f5ff
style Process fill:#e1ffe1
style CheckGPU fill:#fff5e1
style CheckDevice fill:#fff5e1
```
**Complete GPU Processing Workflow** - Shows the full lifecycle from device selection to final output.
### Reusable Device Management Wrapper
For production code, consider creating a wrapper class:
```python
import torch
import torchfx as fx
from pathlib import Path
class GPUAudioProcessor:
"""Wrapper for GPU-accelerated audio processing."""
def __init__(self, processing_chain, device=None):
"""Initialize processor with a processing chain.
Parameters
----------
processing_chain : nn.Module
PyTorch module for audio processing
device : str or None
Device to use ('cuda', 'cpu', or None for auto-detect)
"""
if device is None:
device = "cuda" if torch.cuda.is_available() else "cpu"
self.device = device
self.processing_chain = processing_chain.to(device)
print(f"Initialized on device: {device}")
def process_file(self, input_path, output_path):
"""Process a single audio file.
Parameters
----------
input_path : str or Path
Path to input audio file
output_path : str or Path
Path to save processed audio
"""
# Load and move to device
wave = fx.Wave.from_file(input_path).to(self.device)
# Process
result = wave | self.processing_chain
# Save (automatically moves to CPU)
result.to("cpu").save(output_path)
def process_batch(self, input_files, output_dir):
"""Process multiple audio files.
Parameters
----------
input_files : list of str or Path
List of input audio files
output_dir : str or Path
Directory to save processed files
"""
output_dir = Path(output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
for input_path in input_files:
input_path = Path(input_path)
output_path = output_dir / f"processed_{input_path.name}"
print(f"Processing: {input_path.name}")
self.process_file(input_path, output_path)
# Usage
import torch.nn as nn
# Create processing chain
chain = nn.Sequential(
fx.filter.HiButterworth(cutoff=80, order=2),
fx.filter.LoButterworth(cutoff=12000, order=4),
fx.effect.Normalize(peak=0.9),
)
# Create processor (auto-detects GPU)
processor = GPUAudioProcessor(chain)
# Process single file
processor.process_file("input.wav", "output.wav")
# Process batch
files = ["song1.wav", "song2.wav", "song3.wav"]
processor.process_batch(files, "processed/")
```
## Working Examples
### Example 1: Basic GPU Processing
```python
import torch
import torchfx as fx
# Check GPU availability
if torch.cuda.is_available():
print(f"GPU available: {torch.cuda.get_device_name(0)}")
else:
print("No GPU available, using CPU")
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load audio and move to GPU
wave = fx.Wave.from_file("audio.wav").to(device)
# Create and move filter to GPU
lowpass = fx.filter.LoButterworth(cutoff=1000, order=4).to(device)
# Process on GPU
result = wave | lowpass
# Save (move to CPU first)
result.to("cpu").save("filtered.wav")
```
### Example 2: Multi-Stage Pipeline
```python
import torch
import torch.nn as nn
import torchfx as fx
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load audio
wave = fx.Wave.from_file("vocal.wav").to(device)
# Create complex processing chain
processing = nn.Sequential(
# Stage 1: Remove rumble
fx.filter.HiButterworth(cutoff=80, order=2),
# Stage 2: Parallel filters for thickness
fx.filter.HiButterworth(cutoff=2000, order=4) +
fx.filter.HiChebyshev1(cutoff=2000, order=2),
# Stage 3: Normalize
fx.effect.Normalize(peak=0.9),
).to(device)
# Process
result = wave | processing
# Save
result.to("cpu").save("processed_vocal.wav")
```
### Example 3: Batch Processing with Progress
```python
import torch
import torchfx as fx
from pathlib import Path
from tqdm import tqdm
def batch_process_gpu(input_files, output_dir, filter_chain):
"""Process multiple audio files on GPU with progress bar."""
device = "cuda" if torch.cuda.is_available() else "cpu"
filter_chain = filter_chain.to(device)
output_dir = Path(output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
for input_path in tqdm(input_files, desc="Processing"):
# Load and process
wave = fx.Wave.from_file(input_path).to(device)
result = wave | filter_chain
# Save
output_path = output_dir / Path(input_path).name
result.to("cpu").save(output_path)
# Usage
files = list(Path("audio_dataset").glob("*.wav"))
chain = fx.filter.LoButterworth(cutoff=1000, order=4)
batch_process_gpu(files, "processed_dataset", chain)
```
### Example 4: Memory-Efficient Chunked Processing
```python
import torch
import torchfx as fx
def process_long_audio(input_path, output_path, filter_chain, chunk_seconds=30):
"""Process very long audio files in chunks to manage GPU memory."""
device = "cuda" if torch.cuda.is_available() else "cpu"
filter_chain = filter_chain.to(device)
# Load entire audio on CPU
wave = fx.Wave.from_file(input_path)
chunk_samples = int(chunk_seconds * wave.fs)
processed_chunks = []
num_chunks = (wave.ys.size(-1) + chunk_samples - 1) // chunk_samples
print(f"Processing {num_chunks} chunks on {device}")
for i in range(num_chunks):
start = i * chunk_samples
end = min((i + 1) * chunk_samples, wave.ys.size(-1))
# Extract, process, and move back to CPU
chunk = fx.Wave(wave.ys[..., start:end], wave.fs)
chunk.to(device)
processed = chunk | filter_chain
processed_chunks.append(processed.ys.cpu())
# Clear GPU cache periodically
if device == "cuda":
torch.cuda.empty_cache()
# Combine chunks and save
result = fx.Wave(torch.cat(processed_chunks, dim=-1), wave.fs)
result.save(output_path)
print(f"Saved to {output_path}")
# Usage
chain = fx.filter.LoButterworth(cutoff=1000, order=4)
process_long_audio("long_recording.wav", "processed.wav", chain, chunk_seconds=30)
```
## Common Pitfalls and Solutions
### Pitfall 1: Device Mismatch Errors
**Problem**: RuntimeError when Wave and filter are on different devices
```python
# ❌ WRONG: Device mismatch
wave = fx.Wave.from_file("audio.wav") # CPU
filter = fx.filter.LoButterworth(cutoff=1000).to("cuda") # GPU
result = wave | filter # RuntimeError!
```
**Solution**: Ensure both are on the same device
```python
# ✅ CORRECT: Both on same device
device = "cuda" if torch.cuda.is_available() else "cpu"
wave = fx.Wave.from_file("audio.wav").to(device)
filter = fx.filter.LoButterworth(cutoff=1000).to(device)
result = wave | filter # Works!
```
### Pitfall 2: Forgetting to Move Back to CPU for I/O
**Problem**: Error when trying to save GPU tensors
```python
# ❌ WRONG: Trying to save GPU tensor
wave = fx.Wave.from_file("audio.wav").to("cuda")
result = wave | filter_chain
result.save("output.wav") # May fail depending on backend
```
**Solution**: Always move to CPU before saving
```python
# ✅ CORRECT: Move to CPU before saving
wave = fx.Wave.from_file("audio.wav").to("cuda")
result = wave | filter_chain
result.to("cpu").save("output.wav") # Works!
# Or use ys.cpu() directly with torchaudio
import torchaudio
torchaudio.save("output.wav", result.ys.cpu(), result.fs)
```
### Pitfall 3: Inefficient Repeated Transfers
**Problem**: Moving data back and forth unnecessarily
```python
# ❌ WRONG: Inefficient transfers
wave = fx.Wave.from_file("audio.wav").to("cuda")
result1 = wave.to("cpu") | filter1 # CPU
result2 = result1.to("cuda") | filter2 # GPU
result3 = result2.to("cpu") | filter3 # CPU
```
**Solution**: Do all processing on one device
```python
# ✅ CORRECT: Single device for entire pipeline
device = "cuda" if torch.cuda.is_available() else "cpu"
wave = fx.Wave.from_file("audio.wav").to(device)
filter1.to(device)
filter2.to(device)
filter3.to(device)
result = wave | filter1 | filter2 | filter3 # All on same device
```
### Pitfall 4: Out of Memory on GPU
**Problem**: CUDA out of memory error with large audio files
```python
# ❌ WRONG: Loading entire 2-hour file on GPU
wave = fx.Wave.from_file("2_hour_recording.wav").to("cuda") # OOM!
```
**Solution**: Use chunked processing (see Example 4 above) or reduce batch size
```python
# ✅ CORRECT: Process in chunks
process_long_audio("2_hour_recording.wav", "output.wav", filter_chain, chunk_seconds=30)
```
## Related Concepts
- {doc}`../core-concepts/wave` - Wave class architecture and methods
- {doc}`series-parallel-filters` - Combining filters in complex chains
- {doc}`performance` - Performance benchmarks and optimization
- {doc}`pytorch-integration` - Integration with PyTorch ecosystem
## External Resources
- [PyTorch CUDA Semantics](https://pytorch.org/docs/stable/notes/cuda.html) - Official PyTorch CUDA documentation
- [NVIDIA CUDA Programming Guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/) - CUDA programming fundamentals
- [PyTorch Device Management](https://pytorch.org/docs/stable/tensor_attributes.html#torch.device) - Device attribute documentation
- [torchaudio GPU Tutorial](https://pytorch.org/audio/stable/tutorials/device_avsr.html) - GPU acceleration in torchaudio
## Summary
Key takeaways for GPU acceleration in TorchFX:
1. **Device Management**: Use `Wave.to(device)` and `Module.to(device)` for consistent device placement
2. **Compatibility**: Ensure Wave objects and filters are on the same device
3. **Performance**: GPU acceleration is most beneficial for long audio, multi-channel files, and complex filter chains
4. **I/O Operations**: Always move tensors to CPU before saving to disk
5. **Best Practices**: Use conditional device selection and minimize data transfers
GPU acceleration can provide significant speedups for audio processing workflows when used correctly. Follow the patterns and best practices in this tutorial to leverage CUDA-enabled GPUs effectively in your TorchFX pipelines.