(DGX Spark, ARM64, CUDA 13) pathological slowdown for many-small H2D copies from pageable CPU memory (≈50× vs pinned); impacts PyTorch model load patt

1) Executive Summary

We are seeing an extreme performance cliff on a new DGX Spark with NVIDIA GB10: performing many small CPU→GPU (H2D) copies from pageable host memory is orders of magnitude slower than from pinned host memory. This is large enough to turn real model loading (thousands of tensors moved individually to CUDA) from a few seconds into tens of seconds.

This appears to be a platform/driver path issue (page-locking / DMA mapping / page fault interaction on ARM64/UMA), because:

  • large-copy pinned bandwidth is healthy,

  • the slowdown is specifically tied to pageable + many-small copies, and

  • toggling pinning at the framework level fully flips the behavior.

We request NVIDIA to confirm whether this is expected for GB10/DGX Spark, and if not, investigate the driver/kernel interaction.


2) Environment (as observed)

  • System: DGX Spark (new purchase)

  • CPU arch: aarch64

  • GPU: NVIDIA GB10, compute capability 12.1

  • Unified memory: 128GB (platform UMA / “unified memory system”)

  • SSD: 4TB local NVMe

  • Filesystem: ext4, weights stored locally on NVMe (/dev/nvme0n1p2, mount /)

  • Driver: 580.95.05

  • CUDA: 13.0 (nvcc 13.0.88)

  • Python: 3.12.12

  • PyTorch: custom build 2.10.0a0+gitae7d5b8 with sm_121 enabled (torch.cuda.get_arch_list(): ['sm_121', 'compute_121'])


3) Observed Symptoms / User Impact

  • Any workload that does thousands of small tensor H2D copies from CPU to CUDA becomes extremely slow on GB10.

  • A real-world example: quantized transformer weight loading (many small tensors) can take ~30–40s when the host tensors are pageable, but drops to ~2–4s once host tensors are pinned before .to("cuda").


4) Key Measurements (strong evidence)

4.1 Large-copy H2D bandwidth (1 GiB)

H2D pageable: 4.139 GB/s (1.00GiB, dt=0.259s)
H2D pinned:   20.681 GB/s (1.00GiB, dt=0.052s)

Pinned bandwidth looks fine.

4.2 Many-small copies: 2500 × (1024×1024 fp16)

small copies pageable: n=2500 dt=10.284s
small copies pinned:   n=2500 dt=0.198s

That is ~52× slowdown for pageable vs pinned.

4.3 Real model-load-like benchmark (page cache dropped each run)

We measured from_pretrained() load time for a quantized transformer (safetensors) under several configurations; we also controlled safetensors mmap behavior (disable_mmap) and explicitly dropped Linux page cache before each run:

drop_caches=OK (echo 3 > /proc/sys/vm/drop_caches)

baseline(defaults):                     5.240s
pin_memory=auto:                        2.613s
disable_mmap=True (pin default/auto):   2.648s
disable_mmap=True (pin_memory=False):  38.451s   <-- pathological
pin_memory=auto + disable_mmap=True:    2.588s

The key point: forcing pin_memory=False reliably reproduces the large slowdown (~38s).


5) Minimal Reproduction (PyTorch only)

import time
import torch

device = torch.device("cuda:0")
torch.cuda.synchronize(device)

n = 2500
tensors = [torch.empty((1024, 1024), dtype=torch.float16, device="cpu") for _ in range(n)]

# pageable copies
torch.cuda.synchronize(device)
t0 = time.perf_counter()
_ = [t.to(device) for t in tensors]
torch.cuda.synchronize(device)
print("pageable dt =", time.perf_counter() - t0)

# pinned copies
tensors_pin = [t.pin_memory() for t in tensors]
torch.cuda.synchronize(device)
t0 = time.perf_counter()
_ = [t.to(device, non_blocking=True) for t in tensors_pin]
torch.cuda.synchronize(device)
print("pinned dt =", time.perf_counter() - t0)


6) Why we believe this might be driver/platform related (not just “user code”)

The workload pattern “many small copies” is common in ML model loading. The performance cliff suggests high per-copy overhead in one or more of:

  • pageable→device path using internal staging buffers

  • page-locking/page-pin operations per transfer

  • DMA mapping overhead (ARM SMMU/IOMMU path)

  • interaction with UMA memory management / page fault cost

  • kernel/driver synchronization behavior for many small transfers

On other GPU platforms (e.g., RTX 3090), the same load patterns do not exhibit such extreme slowdown, suggesting GB10/DGX Spark is amplifying the overhead.


7) Requests to NVIDIA (what we want help with)

  1. Is this behavior expected on GB10/DGX Spark?

  2. If not expected, can NVIDIA investigate/optimize:

    • pageable small-copy H2D path

    • IOMMU/SMMU/DMA mapping overhead

    • any driver/kernel tuning recommendations

  3. Are there recommended best practices/settings for GB10 to avoid this cliff without application-level pinning?

We can run additional targeted tests if you suggest specific diagnostics or driver debug flags.


8) Additional System Info Requested (we can provide)

Please let us know what you need; we can attach:

  • uname -a

  • cat /etc/os-release

  • nvidia-smi -q

  • dmesg | grep -i iommu (and full dmesg if needed)

  • lspci -vv (GB10-related devices)

  • /proc/meminfo, /proc/cmdline

  • Any NVIDIA recommended profiling traces


9) Notes / Workarounds found

  • Pinned host memory (tensor.pin_memory() + non_blocking=True) resolves the performance cliff.

  • For workloads using safetensors, disable_mmap=True can help in some environments, but the dominant factor on this system is pinning for many-small copies.

Out of curiosity, what is the performance when you use the PyTorch container?

My CUDA13 benchmark showed that CPU to GPU was still decent.

I only pasted the relevant stuff to the thread but the benchmark did test local CPU to GPU

Ran your sample code using the official NVIDIA PyTorch container and confirmed the issue. Crazy! I never really bothered to compare, so I bet my code can be improved too.

torch.empty() reserves RAM, but physical pages aren’t mapped until you start using them. When you load data into RAM, it’s like having a reservation at a restaurant where the host didn’t actually check if tables are available—they have to confirm real seating exists. On a regular PC with a discrete RTX card, the PC and GPU are like two separate restaurants. They don’t care who’s sitting in the other one. The CPU confirms all its reservations in one pass, then hands the data to the GPU over PCIe. On GB10, it’s like two restaurants sharing one seating area. When the Grace CPU takes a table, it has to walk over to the Blackwell GPU maître d’ and synchronize—for every single table. That per-page coordination is what kills performance.

I got it way faster by using “pin_memory=True” at allocation time:

Before (demand-paged):
tensors = [torch.empty((1024, 1024), dtype=torch.float16, device="cpu") for _ in range(n)]

After (pre-faulted and pinned):
tensors = [torch.empty((1024, 1024), dtype=torch.float16, pin_memory=True) for _ in range(n)]

The difference is that I’m pinning at the time of load, not running the pinmemory later.

1 Like