The core issue appears to be architecture and library compatibility on aarch64 / Grace-Blackwell, particularly when working with audio ML stacks that rely on:
-
Torch audio CUDA kernels
-
Conformer-based TTS models
-
s3tokenizer / unit vocoders
-
Riva ASR deployment containers
-
CUDA/CuDNN-accelerated audio feature extraction
-
Stable ARM64 wheels for PyTorch, Torchaudio, and related ecosystem packages
In practice, most of these packages attempt to pull x86_64-only wheels or try to JIT CUDA kernels that are not yet optimized for GB200 / ARM64 targets, resulting in one or more of the following:
-
Missing or incompatible wheels during installation
-
CUDA kernel fallback to CPU during runtime
-
Inability to launch certain Riva services
-
torchaudio codec failures due to missing torchcodec GPU bindings
-
Huge performance degradation (10–50× slower) compared to x86 systems