(DGX Spark, ARM64, CUDA 13) pathological slowdown for many-small H2D copies from pageable CPU memory (≈50× vs pinned); impacts PyTorch model load patt

tonera · December 13, 2025, 2:06am

1) Executive Summary

We are seeing an extreme performance cliff on a new DGX Spark with NVIDIA GB10: performing many small CPU→GPU (H2D) copies from pageable host memory is orders of magnitude slower than from pinned host memory. This is large enough to turn real model loading (thousands of tensors moved individually to CUDA) from a few seconds into tens of seconds.

This appears to be a platform/driver path issue (page-locking / DMA mapping / page fault interaction on ARM64/UMA), because:

large-copy pinned bandwidth is healthy,
the slowdown is specifically tied to pageable + many-small copies, and
toggling pinning at the framework level fully flips the behavior.

We request NVIDIA to confirm whether this is expected for GB10/DGX Spark, and if not, investigate the driver/kernel interaction.

2) Environment (as observed)

System: DGX Spark (new purchase)
CPU arch: aarch64
GPU: NVIDIA GB10, compute capability 12.1
Unified memory: 128GB (platform UMA / “unified memory system”)
SSD: 4TB local NVMe
Filesystem: ext4, weights stored locally on NVMe (/dev/nvme0n1p2, mount /)
Driver: 580.95.05
CUDA: 13.0 (nvcc 13.0.88)
Python: 3.12.12
PyTorch: custom build 2.10.0a0+gitae7d5b8 with sm_121 enabled (torch.cuda.get_arch_list(): ['sm_121', 'compute_121'])

3) Observed Symptoms / User Impact

Any workload that does thousands of small tensor H2D copies from CPU to CUDA becomes extremely slow on GB10.
A real-world example: quantized transformer weight loading (many small tensors) can take ~30–40s when the host tensors are pageable, but drops to ~2–4s once host tensors are pinned before .to("cuda").

4) Key Measurements (strong evidence)

4.1 Large-copy H2D bandwidth (1 GiB)

H2D pageable: 4.139 GB/s (1.00GiB, dt=0.259s)
H2D pinned:   20.681 GB/s (1.00GiB, dt=0.052s)

Pinned bandwidth looks fine.

4.2 Many-small copies: 2500 × (1024×1024 fp16)

small copies pageable: n=2500 dt=10.284s
small copies pinned:   n=2500 dt=0.198s

That is ~52× slowdown for pageable vs pinned.

4.3 Real model-load-like benchmark (page cache dropped each run)

We measured from_pretrained() load time for a quantized transformer (safetensors) under several configurations; we also controlled safetensors mmap behavior (disable_mmap) and explicitly dropped Linux page cache before each run:

drop_caches=OK (echo 3 > /proc/sys/vm/drop_caches)

baseline(defaults):                     5.240s
pin_memory=auto:                        2.613s
disable_mmap=True (pin default/auto):   2.648s
disable_mmap=True (pin_memory=False):  38.451s   <-- pathological
pin_memory=auto + disable_mmap=True:    2.588s

The key point: forcing pin_memory=False reliably reproduces the large slowdown (~38s).

5) Minimal Reproduction (PyTorch only)

import time
import torch

device = torch.device("cuda:0")
torch.cuda.synchronize(device)

n = 2500
tensors = [torch.empty((1024, 1024), dtype=torch.float16, device="cpu") for _ in range(n)]

# pageable copies
torch.cuda.synchronize(device)
t0 = time.perf_counter()
_ = [t.to(device) for t in tensors]
torch.cuda.synchronize(device)
print("pageable dt =", time.perf_counter() - t0)

# pinned copies
tensors_pin = [t.pin_memory() for t in tensors]
torch.cuda.synchronize(device)
t0 = time.perf_counter()
_ = [t.to(device, non_blocking=True) for t in tensors_pin]
torch.cuda.synchronize(device)
print("pinned dt =", time.perf_counter() - t0)

6) Why we believe this might be driver/platform related (not just “user code”)

The workload pattern “many small copies” is common in ML model loading. The performance cliff suggests high per-copy overhead in one or more of:

pageable→device path using internal staging buffers
page-locking/page-pin operations per transfer
DMA mapping overhead (ARM SMMU/IOMMU path)
interaction with UMA memory management / page fault cost
kernel/driver synchronization behavior for many small transfers

On other GPU platforms (e.g., RTX 3090), the same load patterns do not exhibit such extreme slowdown, suggesting GB10/DGX Spark is amplifying the overhead.

7) Requests to NVIDIA (what we want help with)

Is this behavior expected on GB10/DGX Spark?
If not expected, can NVIDIA investigate/optimize:
- pageable small-copy H2D path
- IOMMU/SMMU/DMA mapping overhead
- any driver/kernel tuning recommendations
Are there recommended best practices/settings for GB10 to avoid this cliff without application-level pinning?

We can run additional targeted tests if you suggest specific diagnostics or driver debug flags.

8) Additional System Info Requested (we can provide)

Please let us know what you need; we can attach:

uname -a
cat /etc/os-release
nvidia-smi -q
dmesg | grep -i iommu (and full dmesg if needed)
lspci -vv (GB10-related devices)
/proc/meminfo, /proc/cmdline
Any NVIDIA recommended profiling traces

9) Notes / Workarounds found

Pinned host memory (tensor.pin_memory() + non_blocking=True) resolves the performance cliff.
For workloads using safetensors, disable_mmap=True can help in some environments, but the dominant factor on this system is pinning for many-small copies.

alan.dang · December 13, 2025, 3:24am

Out of curiosity, what is the performance when you use the PyTorch container?

My CUDA13 benchmark showed that CPU to GPU was still decent.

I only pasted the relevant stuff to the thread but the benchmark did test local CPU to GPU

alan.dang · December 13, 2025, 5:20am

Ran your sample code using the official NVIDIA PyTorch container and confirmed the issue. Crazy! I never really bothered to compare, so I bet my code can be improved too.

torch.empty() reserves RAM, but physical pages aren’t mapped until you start using them. When you load data into RAM, it’s like having a reservation at a restaurant where the host didn’t actually check if tables are available—they have to confirm real seating exists. On a regular PC with a discrete RTX card, the PC and GPU are like two separate restaurants. They don’t care who’s sitting in the other one. The CPU confirms all its reservations in one pass, then hands the data to the GPU over PCIe. On GB10, it’s like two restaurants sharing one seating area. When the Grace CPU takes a table, it has to walk over to the Blackwell GPU maître d’ and synchronize—for every single table. That per-page coordination is what kills performance.

I got it way faster by using “pin_memory=True” at allocation time:

Before (demand-paged):
tensors = [torch.empty((1024, 1024), dtype=torch.float16, device="cpu") for _ in range(n)]

After (pre-faulted and pinned):
tensors = [torch.empty((1024, 1024), dtype=torch.float16, pin_memory=True) for _ in range(n)]

The difference is that I’m pinning at the time of load, not running the pinmemory later.

Topic		Replies	Views
Slow memcpy performance in dual-CPU, 10 GPU system CUDA Programming and Performance cuda , nsight , gpu	24	2585	January 18, 2023
Why write pinned memory is much slower than load from pinned memory on multiprocessing multi-GPU? CUDA Programming and Performance	10	942	May 25, 2024
Highly varying copy throughput from/to pinned to/from pageable memory CUDA Programming and Performance cuda	9	1335	July 10, 2020
GPU Utilization Drops after Consecutive Executions CUDA Programming and Performance	28	6000	October 2, 2013
Data Transfers Optimization aka Pinned Host Memory utilization CUDA Programming and Performance	6	662	December 17, 2021
Question about PCI-E transfer throughput CUDA Programming and Performance	13	315	April 5, 2025
Performance problem of memcpy in Tesla CUDA Programming and Performance	7	1896	March 24, 2010
Poor Memcpy Performance Copying To Pinned Memory On Host CUDA Programming and Performance	16	8270	April 2, 2014
Pinned and Pageable memory CUDA Programming and Performance	5	2595	January 16, 2020
Weird pageable <-> pinned memory performance CUDA Programming and Performance	6	3047	June 10, 2009