On the DGX Spark, you can install a newer CUDA toolkit (as in cuda-toolkit-13-2) package and the cuda-compat-13-2 compatibility package and then:
export LD_LIBRARY_PATH=/usr/local/cuda-13.2/compat
and then you’ll have access to a new CUDA user-mode driver without updating the whole system/driver set.
Note that OpenGL/Vulkan interop is broken in that setup, as described in Forward Compatibility — CUDA Compatibility but CUDA works great.
Alternatively, if you don’t rely on runtime PTX compilation, Minor Version Compatibility — CUDA Compatibility exists too, without needing to rely on the compat package.
3 Likes
Now we have nvidia image with cuda13.2
I build llama.cpp with 13.1.1 and now i can compare the same llama.cpp build with 13.1.1 and 13.2
docker run -it -v /home/pont/.cache/llama.cpp:/root/.cache/llama.cpp --gpus=all ghcr.io/pontostroy/llama.cpp:full-cuda13 --bench --model /root/.cache/llama.cpp/unsloth_Qwen3-Coder-Next-GGUF_Qwen3-Coder-Next-MXFP4_MOE.gguf -fa 1 --mmap 0 -d 0,16000,64000
13.1.1
| model |
size |
params |
backend |
ngl |
fa |
mmap |
test |
t/s |
| gpt-oss 20B MXFP4 MoE |
11.77 GiB |
20.91 B |
CUDA |
99 |
1 |
0 |
pp512 |
4106.71 ± 43.47 |
| gpt-oss 20B MXFP4 MoE |
11.77 GiB |
20.91 B |
CUDA |
99 |
1 |
0 |
tg128 |
70.80 ± 0.10 |
| gpt-oss 20B MXFP4 MoE |
11.77 GiB |
20.91 B |
CUDA |
99 |
1 |
0 |
pp512 @ d16000 |
3154.20 ± 4.91 |
| gpt-oss 20B MXFP4 MoE |
11.77 GiB |
20.91 B |
CUDA |
99 |
1 |
0 |
tg128 @ d16000 |
61.56 ± 0.14 |
| gpt-oss 20B MXFP4 MoE |
11.77 GiB |
20.91 B |
CUDA |
99 |
1 |
0 |
pp512 @ d64000 |
1764.61 ± 10.73 |
| gpt-oss 20B MXFP4 MoE |
11.77 GiB |
20.91 B |
CUDA |
99 |
1 |
0 |
tg128 @ d64000 |
45.20 ± 0.04 |
| model |
size |
params |
backend |
ngl |
fa |
mmap |
test |
t/s |
| nemotron_h_moe 120B.A12B Q4_K - Medium |
65.10 GiB |
120.67 B |
CUDA |
99 |
1 |
0 |
pp512 |
597.03 ± 2.77 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |
65.10 GiB |
120.67 B |
CUDA |
99 |
1 |
0 |
tg128 |
20.42 ± 0.01 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |
65.10 GiB |
120.67 B |
CUDA |
99 |
1 |
0 |
pp512 @ d16000 |
587.79 ± 1.99 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |
65.10 GiB |
120.67 B |
CUDA |
99 |
1 |
0 |
tg128 @ d16000 |
20.11 ± 0.03 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |
65.10 GiB |
120.67 B |
CUDA |
99 |
1 |
0 |
pp512 @ d64000 |
546.69 ± 7.41 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |
65.10 GiB |
120.67 B |
CUDA |
99 |
1 |
0 |
tg128 @ d64000 |
19.20 ± 0.03 |
| model |
size |
params |
backend |
ngl |
fa |
mmap |
test |
t/s |
| qwen3next 80B.A3B MXFP4 MoE |
40.73 GiB |
79.67 B |
CUDA |
99 |
1 |
0 |
pp512 |
1488.48 ± 5.17 |
| qwen3next 80B.A3B MXFP4 MoE |
40.73 GiB |
79.67 B |
CUDA |
99 |
1 |
0 |
tg128 |
50.16 ± 0.09 |
| qwen3next 80B.A3B MXFP4 MoE |
40.73 GiB |
79.67 B |
CUDA |
99 |
1 |
0 |
pp512 @ d16000 |
1373.83 ± 4.78 |
| qwen3next 80B.A3B MXFP4 MoE |
40.73 GiB |
79.67 B |
CUDA |
99 |
1 |
0 |
tg128 @ d16000 |
45.05 ± 0.30 |
| qwen3next 80B.A3B MXFP4 MoE |
40.73 GiB |
79.67 B |
CUDA |
99 |
1 |
0 |
pp512 @ d64000 |
1142.44 ± 9.48 |
| qwen3next 80B.A3B MXFP4 MoE |
40.73 GiB |
79.67 B |
CUDA |
99 |
1 |
0 |
tg128 @ d64000 |
35.38 ± 0.19 |
13.2
| model |
size |
params |
backend |
ngl |
fa |
mmap |
test |
t/s |
| gpt-oss 20B MXFP4 MoE |
11.77 GiB |
20.91 B |
CUDA |
99 |
1 |
0 |
pp512 |
4223.89 ± 50.50 |
| gpt-oss 20B MXFP4 MoE |
11.77 GiB |
20.91 B |
CUDA |
99 |
1 |
0 |
tg128 |
72.01 ± 0.05 |
| gpt-oss 20B MXFP4 MoE |
11.77 GiB |
20.91 B |
CUDA |
99 |
1 |
0 |
pp512 @ d16000 |
3207.67 ± 18.94 |
| gpt-oss 20B MXFP4 MoE |
11.77 GiB |
20.91 B |
CUDA |
99 |
1 |
0 |
tg128 @ d16000 |
61.52 ± 0.19 |
| gpt-oss 20B MXFP4 MoE |
11.77 GiB |
20.91 B |
CUDA |
99 |
1 |
0 |
pp512 @ d64000 |
1953.58 ± 18.24 |
| gpt-oss 20B MXFP4 MoE |
11.77 GiB |
20.91 B |
CUDA |
99 |
1 |
0 |
tg128 @ d64000 |
45.75 ± 0.03 |
| model |
size |
params |
backend |
ngl |
fa |
mmap |
test |
t/s |
| nemotron_h_moe 120B.A12B Q4_K - Medium |
65.10 GiB |
120.67 B |
CUDA |
99 |
1 |
0 |
pp512 |
600.68 ± 2.50 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |
65.10 GiB |
120.67 B |
CUDA |
99 |
1 |
0 |
tg128 |
20.84 ± 0.02 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |
65.10 GiB |
120.67 B |
CUDA |
99 |
1 |
0 |
pp512 @ d16000 |
588.74 ± 1.05 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |
65.10 GiB |
120.67 B |
CUDA |
99 |
1 |
0 |
tg128 @ d16000 |
20.20 ± 0.05 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |
65.10 GiB |
120.67 B |
CUDA |
99 |
1 |
0 |
pp512 @ d64000 |
551.32 ± 2.63 |
| nemotron_h_moe 120B.A12B Q4_K - Medium |
65.10 GiB |
120.67 B |
CUDA |
99 |
1 |
0 |
tg128 @ d64000 |
19.50 ± 0.03 |
| model |
size |
params |
backend |
ngl |
fa |
mmap |
test |
t/s |
| qwen3next 80B.A3B MXFP4 MoE |
40.73 GiB |
79.67 B |
CUDA |
99 |
1 |
0 |
pp512 |
1480.93 ± 7.48 |
| qwen3next 80B.A3B MXFP4 MoE |
40.73 GiB |
79.67 B |
CUDA |
99 |
1 |
0 |
tg128 |
52.38 ± 0.16 |
| qwen3next 80B.A3B MXFP4 MoE |
40.73 GiB |
79.67 B |
CUDA |
99 |
1 |
0 |
pp512 @ d16000 |
1382.59 ± 11.23 |
| qwen3next 80B.A3B MXFP4 MoE |
40.73 GiB |
79.67 B |
CUDA |
99 |
1 |
0 |
tg128 @ d16000 |
45.19 ± 0.28 |
| qwen3next 80B.A3B MXFP4 MoE |
40.73 GiB |
79.67 B |
CUDA |
99 |
1 |
0 |
pp512 @ d64000 |
1149.42 ± 11.09 |
| qwen3next 80B.A3B MXFP4 MoE |
40.73 GiB |
79.67 B |
CUDA |
99 |
1 |
0 |
tg128 @ d64000 |
36.61 ± 0.04 |
3 Likes