I’ve a problem with D2H bandwidth on 1.3 HW, using pinned memory.
On all other hardware (1.0, 1.1), pinned memory improves bandwidth drastically both ways (H2D & D2H). On my 1.3 HW, it only improves H2D, D2H doesn’t change with synchronous copy only. This can be seen with the bandwidthTest binary.
If I replace cudaMemcpy by cudaMemcpyAsync with a non-zero thread (both to the pinned memory), w/o using any concurrency, I get a 2x boost in bandwidth. The same test on 1.1 or 1.0 HW doesn’t improve the performance.
Apparently, on 1.3 HW, cudaMemcpy and cudaMemcpyAsync with a zero thread don’t take advantage of the pinned memory. Or am I missing something?