I am experiencing a dramatic reduction of throughput in a data transfer (device to host). My code context (simplified) is next:
At first I use cudaMemcpyAsync because I take advantage of the Streams (previously I allocate the host memory involved with cudaHostAlloc) to overlap these data transfers (4) with four kernels.
After that, I continue executing other kernels (5) until everything is over and at last I tried to do four data transfers (cudaMemcpy D2H, the host memory involved in this case is not pinned). It is here where I find the problem.
The first transfer is very slow (from 5~6 GB/s to 1.4 GB/s) and the following three increase their throughput to 2.2, 2.7 and 5.8 GB/s (this last one is very fast).
The fact is that, the advantage I take from the use of Streams, I lose it with this drop of memory performance and I do not know why it is happening.
I am using a nVidia GTX480 Card with Linux Ubuntu 10.04. Thanks in advance.