1) Executive Summary
We are seeing an extreme performance cliff on a new DGX Spark with NVIDIA GB10: performing many small CPU→GPU (H2D) copies from pageable host memory is orders of magnitude slower than from pinned host memory. This is large enough to turn real model loading (thousands of tensors moved individually to CUDA) from a few seconds into tens of seconds.
This appears to be a platform/driver path issue (page-locking / DMA mapping / page fault interaction on ARM64/UMA), because:
-
large-copy pinned bandwidth is healthy,
-
the slowdown is specifically tied to pageable + many-small copies, and
-
toggling pinning at the framework level fully flips the behavior.
We request NVIDIA to confirm whether this is expected for GB10/DGX Spark, and if not, investigate the driver/kernel interaction.
2) Environment (as observed)
-
System: DGX Spark (new purchase)
-
CPU arch:
aarch64 -
GPU: NVIDIA GB10, compute capability 12.1
-
Unified memory: 128GB (platform UMA / “unified memory system”)
-
SSD: 4TB local NVMe
-
Filesystem: ext4, weights stored locally on NVMe (
/dev/nvme0n1p2, mount/) -
Driver:
580.95.05 -
CUDA:
13.0(nvcc 13.0.88) -
Python:
3.12.12 -
PyTorch: custom build
2.10.0a0+gitae7d5b8withsm_121enabled (torch.cuda.get_arch_list(): ['sm_121', 'compute_121'])
3) Observed Symptoms / User Impact
-
Any workload that does thousands of small tensor H2D copies from CPU to CUDA becomes extremely slow on GB10.
-
A real-world example: quantized transformer weight loading (many small tensors) can take ~30–40s when the host tensors are pageable, but drops to ~2–4s once host tensors are pinned before
.to("cuda").
4) Key Measurements (strong evidence)
4.1 Large-copy H2D bandwidth (1 GiB)
H2D pageable: 4.139 GB/s (1.00GiB, dt=0.259s)
H2D pinned: 20.681 GB/s (1.00GiB, dt=0.052s)
Pinned bandwidth looks fine.
4.2 Many-small copies: 2500 × (1024×1024 fp16)
small copies pageable: n=2500 dt=10.284s
small copies pinned: n=2500 dt=0.198s
That is ~52× slowdown for pageable vs pinned.
4.3 Real model-load-like benchmark (page cache dropped each run)
We measured from_pretrained() load time for a quantized transformer (safetensors) under several configurations; we also controlled safetensors mmap behavior (disable_mmap) and explicitly dropped Linux page cache before each run:
drop_caches=OK (echo 3 > /proc/sys/vm/drop_caches)
baseline(defaults): 5.240s
pin_memory=auto: 2.613s
disable_mmap=True (pin default/auto): 2.648s
disable_mmap=True (pin_memory=False): 38.451s <-- pathological
pin_memory=auto + disable_mmap=True: 2.588s
The key point: forcing pin_memory=False reliably reproduces the large slowdown (~38s).
5) Minimal Reproduction (PyTorch only)
import time
import torch
device = torch.device("cuda:0")
torch.cuda.synchronize(device)
n = 2500
tensors = [torch.empty((1024, 1024), dtype=torch.float16, device="cpu") for _ in range(n)]
# pageable copies
torch.cuda.synchronize(device)
t0 = time.perf_counter()
_ = [t.to(device) for t in tensors]
torch.cuda.synchronize(device)
print("pageable dt =", time.perf_counter() - t0)
# pinned copies
tensors_pin = [t.pin_memory() for t in tensors]
torch.cuda.synchronize(device)
t0 = time.perf_counter()
_ = [t.to(device, non_blocking=True) for t in tensors_pin]
torch.cuda.synchronize(device)
print("pinned dt =", time.perf_counter() - t0)
6) Why we believe this might be driver/platform related (not just “user code”)
The workload pattern “many small copies” is common in ML model loading. The performance cliff suggests high per-copy overhead in one or more of:
-
pageable→device path using internal staging buffers
-
page-locking/page-pin operations per transfer
-
DMA mapping overhead (ARM SMMU/IOMMU path)
-
interaction with UMA memory management / page fault cost
-
kernel/driver synchronization behavior for many small transfers
On other GPU platforms (e.g., RTX 3090), the same load patterns do not exhibit such extreme slowdown, suggesting GB10/DGX Spark is amplifying the overhead.
7) Requests to NVIDIA (what we want help with)
-
Is this behavior expected on GB10/DGX Spark?
-
If not expected, can NVIDIA investigate/optimize:
-
pageable small-copy H2D path
-
IOMMU/SMMU/DMA mapping overhead
-
any driver/kernel tuning recommendations
-
-
Are there recommended best practices/settings for GB10 to avoid this cliff without application-level pinning?
We can run additional targeted tests if you suggest specific diagnostics or driver debug flags.
8) Additional System Info Requested (we can provide)
Please let us know what you need; we can attach:
-
uname -a -
cat /etc/os-release -
nvidia-smi -q -
dmesg | grep -i iommu(and full dmesg if needed) -
lspci -vv(GB10-related devices) -
/proc/meminfo,/proc/cmdline -
Any NVIDIA recommended profiling traces
9) Notes / Workarounds found
-
Pinned host memory (
tensor.pin_memory()+non_blocking=True) resolves the performance cliff. -
For workloads using safetensors,
disable_mmap=Truecan help in some environments, but the dominant factor on this system is pinning for many-small copies.