When I run the bandwidth test (in the samples), both directions of transfer for pageable memory report ~11GB/s. But I’m only achieving on average ~1GB/s when using cudaMemPrefetchAsync to move cudaMallocManage’d memory to and from the host (processing full data set in portions). See the nvprof output:
==160136== Unified Memory profiling result:
Device "Tesla P100-PCIE-16GB (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
11907 1.9995MB 4.0000KB 2.0000MB 23.25006GB 2.403346s Host To Device
15875 1.9996MB 4.0000KB 2.0000MB 31.00006GB 2.558402s Device To Host
2 - - - - 395.7120us Gpu page fault groups
Total CPU Page faults: 1
I assume that CUDA doesn’t transfer more than 2MB at a time (given the above results). I am prefetching 64 or 128MB of contiguous data at around 100 addresses (for each iteration). The transfers are currently synchronous with each other (have yet to implement different streams) - but the kernel I’m using takes O(10ms) to process data that is taking O(.5s) to transfer in one direction. Obviously, even if the kernel was running concurrently with both transfers, I would need at least 10x faster bandwidth to get anywhere close to balanced transfer/compute time.
I know I haven’t included any code - but I’m just calling cudaMemPrefetchAsync as I describe and a simple kernel. Am I missing some reason that would cause these transfers to be so slow?