influence of muti-threading in cudaMemCpy?

Platform: TX2
I have two threads.

in experiment1:
Run kernel and generate cuda result, then use cudaMemcpy to copy cuda result to CPU, copy part takes 5 ms.
doing nothing here.

In experiment:
Run kernel and generate cuda result,
use cudaMemcpy to copy cuda result to CPU, copy part takes 20 ms

Why does copy take much more time when I move copy to thread2? How can I reduce the copy time in thread2? Thank you

So you have an ascynchronous call on T1 and a synchronous on T2, T2 will have to wait for T1 to schecule the kernel the kernel to finish, and memcpy to finish before returning from the call.

If you measure these things with the visula profiler, what does the time line look like?

Perhaps you could also mark the range so that it show up in the profiler?

Hi Jimmy,

In both experiment 1 and experiment 2, I didn’t use asynchronous stream mechanism. Both of 5ms and 20 ms are the profiling result of pure copy time. It doesn’t include the wait time before executing memcpy.

Hi Jimmy,

Even in experiment 1, I think memcpy needs to wait Cuda computation finish to take effect, let’s assume this is only 1 loop.

What steps did you take to ensure the call to the kernel was synchronous?

Your case seems pretty simple, make you can post a repro so its easy for others to test.

Thanks for the comment, it’s part of a larger project, I will extract it out later.

In experiment 1, what I did is just run kernel and memcpy sequentially, I thought memcpy can only start after kernel is done and generate the result in cuda memeory(no asynchronous streaming is used). memcpy will copy the whole result once.

In experiment 2, thread 2 will also wait until kernel function in thread1 is done, so it looks like it’s same in experiment 1 and 2, in any way, memcpy can execute when kernel function is done.


Could you profile your application(both experience) with nvprof first.