Low Memory Throughput (D2H)

I am experiencing a dramatic reduction of throughput in a data transfer (device to host). My code context (simplified) is next:

At first I use cudaMemcpyAsync because I take advantage of the Streams (previously I allocate the host memory involved with cudaHostAlloc) to overlap these data transfers (4) with four kernels.

After that, I continue executing other kernels (5) until everything is over and at last I tried to do four data transfers (cudaMemcpy D2H, the host memory involved in this case is not pinned). It is here where I find the problem.

The first transfer is very slow (from 5~6 GB/s to 1.4 GB/s) and the following three increase their throughput to 2.2, 2.7 and 5.8 GB/s (this last one is very fast).

The fact is that, the advantage I take from the use of Streams, I lose it with this drop of memory performance and I do not know why it is happening.

I am using a nVidia GTX480 Card with Linux Ubuntu 10.04. Thanks in advance.

What is your timing methodology used to compute the transfer rate? You may be inadvertently including kernel execution time in your measurement of transfer time.

I am seeing the duration of the memcpy’s using nVidia Visual Profiler so I think that it is not a timing mistake…

I really need help with this, please… I’ve tried everything and it is not working.

“the host memory involved in this case is not pinned”

but isn’t it expected to see much lower throughput on transfers involving unpinned memory? This memory is pageable, and hence significant delays will occur when the Windows kernel has to page in this memory first.

Christian

For Christian:

“The first transfer is very slow (from 5~6 GB/s to 1.4 GB/s) and the following three increase their throughput to 2.2, 2.7 and 5.8 GB/s (this last one is very fast)”

The first throughput is very slow even for pageable memory and I think that is not an explanation because the following transfers improve their throughput, moreover, I reorder the four data transfers and the problem is still occurring with the former first transfer.

Thanks a lot :)

Any new ideas???

Post a short, complete app that reproduces the problem.

Are the buffers so large that using pinned host memory buffers is not an option?

If so, try creating two smaller pinned staging buffers (buffer A and B) per CUDA stream.
You get the full speed for D2H copies, but you will have to copy in smaller chunks, e.g.
128MB chunks.

initiate an async D2H copy to buffer A
synchronize CUDA stream
initiate an async D2H copy to buffer B
perform a H2H memcpy from buffer A into the final (unpinned) destination address
synchronize CUDA stream
perform a H2H memcpy from buffer B into the final (unpinned) destination address
repeat until the whole transfer is done

I am not sure if it is still possible to nicely overlap two or more CUDA streams doing theis kind of transfers. The double buffered transfer logic could also be handled in CUDA event callbacks to keep the complexity of this double buffering logic away from the main program code.