(DGX Spark, ARM64, CUDA 13) pathological slowdown for many-small H2D copies from pageable CPU memory (≈50× vs pinned); impacts PyTorch model load patt

Out of curiosity, what is the performance when you use the PyTorch container?

My CUDA13 benchmark showed that CPU to GPU was still decent.

I only pasted the relevant stuff to the thread but the benchmark did test local CPU to GPU