In both experiment 1 and experiment 2, I didn’t use asynchronous stream mechanism. Both of 5ms and 20 ms are the profiling result of pure copy time. It doesn’t include the wait time before executing memcpy.
Thanks for the comment, it’s part of a larger project, I will extract it out later.
In experiment 1, what I did is just run kernel and memcpy sequentially, I thought memcpy can only start after kernel is done and generate the result in cuda memeory(no asynchronous streaming is used). memcpy will copy the whole result once.
In experiment 2, thread 2 will also wait until kernel function in thread1 is done, so it looks like it’s same in experiment 1 and 2, in any way, memcpy can execute when kernel function is done.