I am experiencing a dramatic reduction of throughput in a data transfer (device to host). My code context (simplified) is next:
At first I use cudaMemcpyAsync because I take advantage of the Streams (previously I allocate the host memory involved with cudaHostAlloc) to overlap these data transfers (4) with four kernels.
After that, I continue executing other kernels (5) until everything is over and at last I tried to do four data transfers (cudaMemcpy D2H, the host memory involved in this case is not pinned). It is here where I find the problem.
The first transfer is very slow (from 5~6 GB/s to 1.4 GB/s) and the following three increase their throughput to 2.2, 2.7 and 5.8 GB/s (this last one is very fast).
The fact is that, the advantage I take from the use of Streams, I lose it with this drop of memory performance and I do not know why it is happening.
I am using a nVidia GTX480 Card with Linux Ubuntu 10.04. Thanks in advance.
What is your timing methodology used to compute the transfer rate? You may be inadvertently including kernel execution time in your measurement of transfer time.
“the host memory involved in this case is not pinned”
but isn’t it expected to see much lower throughput on transfers involving unpinned memory? This memory is pageable, and hence significant delays will occur when the Windows kernel has to page in this memory first.
“The first transfer is very slow (from 5~6 GB/s to 1.4 GB/s) and the following three increase their throughput to 2.2, 2.7 and 5.8 GB/s (this last one is very fast)”
The first throughput is very slow even for pageable memory and I think that is not an explanation because the following transfers improve their throughput, moreover, I reorder the four data transfers and the problem is still occurring with the former first transfer.
Are the buffers so large that using pinned host memory buffers is not an option?
If so, try creating two smaller pinned staging buffers (buffer A and B) per CUDA stream.
You get the full speed for D2H copies, but you will have to copy in smaller chunks, e.g.
128MB chunks.
initiate an async D2H copy to buffer A
synchronize CUDA stream
initiate an async D2H copy to buffer B
perform a H2H memcpy from buffer A into the final (unpinned) destination address
synchronize CUDA stream
perform a H2H memcpy from buffer B into the final (unpinned) destination address
repeat until the whole transfer is done
I am not sure if it is still possible to nicely overlap two or more CUDA streams doing theis kind of transfers. The double buffered transfer logic could also be handled in CUDA event callbacks to keep the complexity of this double buffering logic away from the main program code.