ONNX Runtime GPU inference on DGX Spark (GX10) — build guide and prebuilt binaries

I’ve successfully built ONNX Runtime 1.24.4 with CUDA Execution Provider on the DGX Spark (GX10, Grace + GB10 Blackwell, sm_121, CUDA 13.0, cuDNN 9.20, Ubuntu 24.04).

The short version: No prebuilt ONNX Runtime GPU binaries exist for aarch64 Linux as of April 2026. You have to build from source.

Prebuilt binaries and full build instructions:

Key findings:

  • ORT v1.20.1 is incompatible with CUDA 13 (thrust::unary_function removed in CCCL/Thrust). Use v1.24.4+.
  • GB10 is compute capability 12.1, not 12.0. Set CMAKE_CUDA_ARCHITECTURES=121.
  • INT8 quantized ONNX models lack sm_121 CUDA kernels. Use FP32 models for GPU inference.
  • cuDNN 9.x must be installed separately: sudo apt install libcudnn9-cuda-13 libcudnn9-dev-cuda-13

Performance (embedding model: snowflake-arctic-embed-m-v2.0, 768-dim):

  • CPU (tract-onnx, pure Rust): 3,400ms
  • CPU (ORT 1.24.4): 135ms
  • GPU (ORT 1.24.4 + CUDA, GB10): 149ms cold start

Build command:
git clone --recursive --branch v1.24.4 --depth 1 GitHub - microsoft/onnxruntime: ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator · GitHub
cd onnxruntime
./build.sh --config Release --use_cuda --cuda_home /usr/local/cuda --cudnn_home /usr --build_shared_lib --parallel $(nproc) --skip_tests --cmake_extra_defines CMAKE_CUDA_ARCHITECTURES=121

For Rust users: the ort crate’s native CUDA EP registration works correctly with load-dynamic. Critical tip: add tracing-subscriber to your app — without it, RUST_LOG produces no output and you have zero visibility into whether CUDA is active. Full usage notes in the repo README.

Happy to help anyone else getting ML inference working on the Spark.

2 Likes