Getting data from GPU to CPU without blocking calls

I decided to switch from a blocking cudaMemcpy to an asynchronous cudaMemcpyAsync.

At first, to my surprise, the actual calls take the same amount of time. In other words: cudaMemcpyAsync is blocking and will not return early before the transfer is done.

But when I found cudaMallocHost() that block went away: I now use a staging area.

However, the delay just moved from the async copy to the libc memcpy that I do to take it out of the staging area, into its final destination.

I am seeing 28ms to memcpy 64Mbyte, which seems far too slow.

The calls I do:

cuLaunchKernel takes 0.03ms

cudaStreamSynchronize takes 23ms

cudaMemcpyAsync (copy into staging area) takes 0.02ms

memcpy to copy from staging area to CPU buffer takes 28ms.

NOTE: I space these calls out between video-frames. So first the launch, then next frame (16.7ms later) I do the sync and async copy, and another 16.7ms later I do the memcpy.

Why does the libc memcpy take so much time?

==6729== Profiling application: ./noisetuner
==6729== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   77.75%  483.27ms        27  17.899ms  15.182ms  22.150ms  osino_computefield
                   22.25%  138.26ms        26  5.3177ms  5.2778ms  5.3691ms  [CUDA memcpy DtoH]
      API calls:   59.50%  128.51ms         1  128.51ms  128.51ms  128.51ms  cuCtxCreate
                   19.05%  41.139ms         3  13.713ms  13.387ms  14.215ms  cudaMallocHost
                   10.12%  21.866ms         3  7.2886ms  2.3930us  21.860ms  cudaStreamCreate
                    9.84%  21.256ms        26  817.55us  16.473us  20.827ms  cudaStreamSynchronize
                    0.44%  939.68us        27  34.802us  24.998us  175.89us  cuLaunchKernel
                    0.30%  639.50us         3  213.17us  142.55us  283.38us  cuMemAlloc
                    0.29%  623.50us        26  23.980us  21.318us  35.516us  cudaMemcpyAsync
                    0.12%  255.21us         1  255.21us  255.21us  255.21us  cuModuleLoad
                    0.12%  253.92us         1  253.92us  253.92us  253.92us  cuDeviceTotalMem
                    0.11%  237.40us        97  2.4470us     283ns  97.925us  cuDeviceGetAttribute
                    0.09%  197.82us         1  197.82us  197.82us  197.82us  cudaGetDeviceProperties
                    0.02%  37.884us         1  37.884us  37.884us  37.884us  cuDeviceGetName
                    0.00%  3.5210us         1  3.5210us  3.5210us  3.5210us  cuInit
                    0.00%  3.1800us         1  3.1800us  3.1800us  3.1800us  cuDeviceGetPCIBusId
                    0.00%  2.2540us         3     751ns     361ns     960ns  cuDeviceGet
                    0.00%  2.0840us         3     694ns     346ns  1.2740us  cuDeviceGetCount
                    0.00%  1.6030us         2     801ns     678ns     925ns  cudaDeviceGetAttribute
                    0.00%  1.0020us         1  1.0020us  1.0020us  1.0020us  cudaGetDeviceCount
                    0.00%     731ns         1     731ns     731ns     731ns  cuModuleGetFunction
                    0.00%     617ns         1     617ns     617ns     617ns  cuDeviceGetUuid
                    0.00%     592ns         1     592ns     592ns     592ns  cuDriverGetVersion

Why you don’t use cudaMallocHost direct in the target area instead of using a stage? Also, you will need a cudastreamsynchronize after cudaMemcpyAsync, to take advantage of the async copy you need perform some useful host calculation before the this synchronization.
Also, the cudaStreamSynchronize between kernel and cudaMemcpyAsync may not be necessary if you are using the same stream for both the will perform synchronously (cpy after kernel) but async on host’s point o view.