1) Executive Summary We are seeing an extreme performance cliff on a new DGX Spark with NVIDIA GB10: performing many small CPU→GPU (H2D) copies from pageable host memory is orders of magnitude slower than from pinned host memory. This is large enough to turn real model loading (thousands of tensors…

Out of curiosity, what is the performance when you use the PyTorch container? PyTorch | NVIDIA NGC PyTorch is a GPU accelerated tensor computational framework. Functionality can be extended with common Python libraries such as NumPy and SciPy. Automatic differentiation is done with a t…

(DGX Spark, ARM64, CUDA 13) pathological slowdown for many-small H2D copies from pageable CPU memory (≈50× vs pinned); impacts PyTorch model load patt

Accelerated Computing DGX Spark / GB10 User Forum DGX Spark / GB10

alan.dang December 13, 2025, 3:24am 2

Out of curiosity, what is the performance when you use the PyTorch container?

My CUDA13 benchmark showed that CPU to GPU was still decent.

I only pasted the relevant stuff to the thread but the benchmark did test local CPU to GPU

Topic		Replies	Views
Slow memcpy performance in dual-CPU, 10 GPU system CUDA Programming and Performance cuda , nsight , gpu	24	2586	January 18, 2023
Why write pinned memory is much slower than load from pinned memory on multiprocessing multi-GPU? CUDA Programming and Performance	10	945	May 25, 2024
Highly varying copy throughput from/to pinned to/from pageable memory CUDA Programming and Performance cuda	9	1336	July 10, 2020
GPU Utilization Drops after Consecutive Executions CUDA Programming and Performance	28	6000	October 2, 2013
Data Transfers Optimization aka Pinned Host Memory utilization CUDA Programming and Performance	6	662	December 17, 2021
Question about PCI-E transfer throughput CUDA Programming and Performance	13	316	April 5, 2025
Performance problem of memcpy in Tesla CUDA Programming and Performance	7	1897	March 24, 2010
Poor Memcpy Performance Copying To Pinned Memory On Host CUDA Programming and Performance	16	8271	April 2, 2014
Pinned and Pageable memory CUDA Programming and Performance	5	2595	January 16, 2020
Weird pageable <-> pinned memory performance CUDA Programming and Performance	6	3048	June 10, 2009

(DGX Spark, ARM64, CUDA 13) pathological slowdown for many-small H2D copies from pageable CPU memory (≈50× vs pinned); impacts PyTorch model load patt

Related topics