Tesla P100 asynchronous prefetching of managed memory slow - < 1GB/s

zjw518 · May 5, 2018, 6:15pm

When I run the bandwidth test (in the samples), both directions of transfer for pageable memory report ~11GB/s. But I’m only achieving on average ~1GB/s when using cudaMemPrefetchAsync to move cudaMallocManage’d memory to and from the host (processing full data set in portions). See the nvprof output:

==160136== Unified Memory profiling result:
Device "Tesla P100-PCIE-16GB (0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
   11907  1.9995MB  4.0000KB  2.0000MB  23.25006GB   2.403346s  Host To Device
   15875  1.9996MB  4.0000KB  2.0000MB  31.00006GB   2.558402s  Device To Host
       2         -         -         -           -  395.7120us  Gpu page fault groups
Total CPU Page faults: 1

I assume that CUDA doesn’t transfer more than 2MB at a time (given the above results). I am prefetching 64 or 128MB of contiguous data at around 100 addresses (for each iteration). The transfers are currently synchronous with each other (have yet to implement different streams) - but the kernel I’m using takes O(10ms) to process data that is taking O(.5s) to transfer in one direction. Obviously, even if the kernel was running concurrently with both transfers, I would need at least 10x faster bandwidth to get anywhere close to balanced transfer/compute time.

I know I haven’t included any code - but I’m just calling cudaMemPrefetchAsync as I describe and a simple kernel. Am I missing some reason that would cause these transfers to be so slow?

Robert_Crovella · May 5, 2018, 6:28pm

Total size:

23.25GB

Total time:

2.4033s

That looks like roughly 10GB/s throughput. And the Device To Host throughput looks higher.

zjw518 · May 5, 2018, 6:32pm

I need coffee…

zjw518 · May 5, 2018, 8:57pm

Is there any canonical strategy for applications which are highly (host-to-device) memory-bound - if I cannot create (or combine) more work to be done in between memory transfers? Now I can see (with correct arithmetic…) that this particular kernel does in fact 1/50th of the time it takes to transfer the memory it requires. (I did test that running this kernel on 8 threads on the CPU is still ~5x slower than the memory transfer to the GPU itself.) I have no ideas - is this just hopeless?

Topic		Replies	Views
Managed memory slow to copy back to host CUDA Programming and Performance cuda	2	536	January 11, 2021
Low performance on V100 to/from RDMA device CUDA Programming and Performance cuda , kernel	4	693	September 28, 2020
cudaMemcpyDeviceToHost - slow performance using pinned memory CUDA Programming and Performance	6	2820	June 24, 2016
Performance Issue using cudaMemPrefetchAsync CUDA Programming and Performance	7	2367	May 24, 2017
Highly variant memcpyAsync bandwidth on Tesla C2050 pinned memory, async memcpy CUDA Programming and Performance	6	4654	October 24, 2011
17x drop in Cuda performance When each thread operate on subset of kernel input data CUDA Programming and Performance	7	1683	April 16, 2012
Disappointing shared memory performance CUDA Programming and Performance	3	737	September 8, 2011
Slow Paged Memory Transfer with M2090 CUDA Programming and Performance	0	1507	May 15, 2012
CUDA memory performance Jetson TK1	3	1124	October 18, 2021
Slow Paged Memory Transfer with M2090 CUDA Programming and Performance	3	1168	May 18, 2012

Tesla P100 asynchronous prefetching of managed memory slow - < 1GB/s

Related topics