Odd cudaMemcpyAsync() behavior with Kepler K20c and CUDA 5.0

Hello,

I am seeing some very odd behavior for streamed cudaMemcpyAsync() with a K20c. I am using CUDA 5.0 with drivers 304.54 and 310.19 (tried both, but got the same results). I have a large application that uses streams heavily, so I will have code like the following (error checking omitted):

SomeKernel<<<grid,block,smemSize,stream>>>(devOuput, ...);
cudaMemcpyAsync(hostOutput, devOutput, ..., stream);
cudaStreamSynchronize(stream);

This works fine on Fermi-generation hardware, even with CUDA 5 and the drivers above. However, with the K20c, I get highly variable run-times. In particular, much time is sometimes spent in the cudaMemcpyAsync API call. In one use case, only 4 bytes are being copied from the device to the host, which typically takes under 1 ms. In some cases, it will take 50-100 ms on the K20c. Running with nvprof --normalized-time-unit ms --print-api-trace --print-gpu-trace, I get the following on different calls with identical input:

Normal (expected) case:

1.14e+03 0.07843 (1024 1 1) (64 1 1) 20 0B 256B - - 0 1 7 SomeKernel(int*, int const *, int, int, int)
1.14e+03 1.00e-03 - - - - - - - - - - cudaConfigureCall
1.14e+03 1.00e-03 - - - - - - - - - - cudaSetupArgument
1.14e+03 0.00000 - - - - - - - - - - cudaSetupArgument
1.14e+03 1.00e-03 - - - - - - - - - - cudaSetupArgument
1.14e+03 0.00000 - - - - - - - - - - cudaSetupArgument
1.14e+03 0.00000 - - - - - - - - - - cudaSetupArgument
1.14e+03 0.01600 - - - - - - - - - - cudaLaunch
1.14e+03 1.00e-03 - - - - - - - - - - cudaGetLastError
1.14e+03 0.27500 - - - - - - - - - - cudaMemcpyAsync
1.14e+03 0.01178 - - - - - 4B 339.67KB/s 0 1 7 [CUDA memcpy DtoH]
1.14e+03 0.03200 - - - - - - - - - - cudaStreamSynchronize

Bizarre case:
3.01e+03 0.08611 (1024 1 1) (64 1 1) 20 0B 256B - - 0 1 7 SomeKernel(int*, int const *, int, int, int)
3.01e+03 0.02100 - - - - - - - - - - cudaMemsetAsync
3.01e+03 0.00000 - - - - - - - - - - cudaGetLastError
3.01e+03 0.03000 - - - - - - - - - - cudaStreamSynchronize
3.01e+03 0.00000 - - - - - - - - - - cudaConfigureCall
3.01e+03 1.00e-03 - - - - - - - - - - cudaSetupArgument
3.01e+03 0.00000 - - - - - - - - - - cudaSetupArgument
3.01e+03 0.00000 - - - - - - - - - - cudaSetupArgument
3.01e+03 0.00000 - - - - - - - - - - cudaSetupArgument
3.01e+03 0.00000 - - - - - - - - - - cudaSetupArgument
3.01e+03 0.02200 - - - - - - - - - - cudaLaunch
3.01e+03 3.68e-03 - - - - - 4B 1.09MB/s 0 1 7 [CUDA memcpy DtoH]
3.01e+03 0.00000 - - - - - - - - - - cudaGetLastError
3.01e+03 68.20200 - - - - - - - - - - cudaMemcpyAsync

(The calls in the bizarre case are out-of-order, but I expect that this is just an issue with nvprof.) So a huge amount of time was spent in cudaMemcpyAsync(). Inserting a cudaStreamSynchronize() between SomeKernel() and cudaMemcpyAsync() seems to fix this issue, but as it stands cases like this are causing major issues for my application.

Does anyone have any idea what is happening here?

Regards,
Thomas