Tesla P100 asynchronous prefetching of managed memory slow - < 1GB/s

When I run the bandwidth test (in the samples), both directions of transfer for pageable memory report ~11GB/s. But I’m only achieving on average ~1GB/s when using cudaMemPrefetchAsync to move cudaMallocManage’d memory to and from the host (processing full data set in portions). See the nvprof output:

==160136== Unified Memory profiling result:
Device "Tesla P100-PCIE-16GB (0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
   11907  1.9995MB  4.0000KB  2.0000MB  23.25006GB   2.403346s  Host To Device
   15875  1.9996MB  4.0000KB  2.0000MB  31.00006GB   2.558402s  Device To Host
       2         -         -         -           -  395.7120us  Gpu page fault groups
Total CPU Page faults: 1

I assume that CUDA doesn’t transfer more than 2MB at a time (given the above results). I am prefetching 64 or 128MB of contiguous data at around 100 addresses (for each iteration). The transfers are currently synchronous with each other (have yet to implement different streams) - but the kernel I’m using takes O(10ms) to process data that is taking O(.5s) to transfer in one direction. Obviously, even if the kernel was running concurrently with both transfers, I would need at least 10x faster bandwidth to get anywhere close to balanced transfer/compute time.

I know I haven’t included any code - but I’m just calling cudaMemPrefetchAsync as I describe and a simple kernel. Am I missing some reason that would cause these transfers to be so slow?

Total size:

23.25GB

Total time:

2.4033s

That looks like roughly 10GB/s throughput. And the Device To Host throughput looks higher.

I need coffee…

Is there any canonical strategy for applications which are highly (host-to-device) memory-bound - if I cannot create (or combine) more work to be done in between memory transfers? Now I can see (with correct arithmetic…) that this particular kernel does in fact 1/50th of the time it takes to transfer the memory it requires. (I did test that running this kernel on 8 threads on the CPU is still ~5x slower than the memory transfer to the GPU itself.) I have no ideas - is this just hopeless?