Hi everyone,
I’m hoping someone here can help me — I’ve been stuck on this for a few days and I’m running out of ideas.
The short version
I’m trying to use cuPQC SDK 0.4.1 on an NVIDIA DGX Spark for a GPU-accelerated PQC project. The library loads fine and CUDA initializes, but the program crashes with double free or corruption the moment the first PQC function is called. After a lot of debugging, I found out the GB10 GPU has compute capability sm_121 — which isn’t in the cuPQC SDK’s supported list (it goes up to sm_90).
My system
-
Machine: NVIDIA DGX Spark (Rev A.7)
-
GPU: NVIDIA GB10 — nvidia-smi reports compute cap 12.1
-
CPU: ARM aarch64 (Cortex-X925 / A725)
-
Host CUDA: 13.1, Driver 580.95.05
-
cuPQC SDK: 0.4.1 (aarch64)
-
Docker container: nvidia/cuda:12.8.0-devel-ubuntu22.04
What I see
Loaded cuPQC library from: /opt/cupqc-lib/libcupqc_wrapper.so
CUDA initialized
double free or corruption (!prev)
Exited with code 133 (SIGABRT)
This happens on the very first call to cupqc_kem_keypair() — right after cudaSetDevice(0) succeeds. No kernel output, just a crash.
Root cause I identified
The cuPQC SDK 0.4.1 precompiles its internal libraries (cupqc-pk_static, cupqc-hash_static) using LTO code for architectures sm_70 through sm_90. The GB10 is sm_121, which has no native or PTX code path in the SDK. So when the kernel tries to launch, it either picks the wrong code or fails to find any.
Related: cuPQC examples fail to compile on Jetson Orin Nano — similar pattern of cuPQC being incompatible with specific platforms.
What I’ve tried
-
Updated CUDA base image in Docker from 12.6.2 → 12.8.0 in
Dockerfile(SDK requires 12.8+) -
Added PTX fallback to
CMakeLists.txt:
--generate-code=arch=compute_90,code=compute_90
This embeds sm_90 PTX in our wrapper so CUDA might JIT it for sm_121. Waiting to confirm if this works with the cuPQC LTO internals.
- Confirmed aarch64 SDK matches machine architecture ✓
My questions for the community / NVIDIA team
-
Has anyone successfully run cuPQC SDK on a Blackwell GPU (sm_100, sm_121)? What did you do?
-
Does embedding sm_90 PTX in the wrapper help? Or will the cuPQC LTO libraries still fail to JIT on sm_121?
-
Is there a newer SDK version being worked on with Blackwell support?
-
Any other workaround you’d recommend for getting GPU-accelerated PQC running on a DGX Spark?
Thanks in advance — any help is hugely appreciated!