Asynchronous HtoD memtransfer need to have it asynchronous for cpu, but synchronous for the GPU

Hi,

Here is the situation:

I need this logic to be executed on gpu asynchronously with the cpu:

  1. Memtransfer from host to device
  2. Kernel launch using the result of 1.

The only function for the memory transfer I can use (which doesn’t block cpu) is memcpyHtoDAsync. But the header files also contain a comment that “if the hardware is available, may execute in parallel with the GPU”, which basically means that step 2 can start without step 1 being completed. How can I synchronize step 1 with step 2 without stalling the CPU ?

Thanks

Go back to the programming guide and re-read the section on streams. Two async calls in the same stream are executed sequentially on the GPU.

Go back to the programming guide and re-read the section on streams. Two async calls in the same stream are executed sequentially on the GPU.

Ok, so if I want to preload data in the background and get overlapped copy, I have to use a separate stream for that, correct?

Edit: And then the same behaviour should apply to concurrent kernel execution, does that mean that I can benefit by manually putting independent kernels into as many streams as I can to utilize parallel kernel execution at maximum? This would make a lot of sense to do for small kernels which are not able to fully occupy the gpu, but practically I expect this to be very inconvenient to code (I can be wrong though).

Ok, so if I want to preload data in the background and get overlapped copy, I have to use a separate stream for that, correct?

Edit: And then the same behaviour should apply to concurrent kernel execution, does that mean that I can benefit by manually putting independent kernels into as many streams as I can to utilize parallel kernel execution at maximum? This would make a lot of sense to do for small kernels which are not able to fully occupy the gpu, but practically I expect this to be very inconvenient to code (I can be wrong though).

Yes

Correct again on all points. I certainty find it difficult to take advantage of concurrent kernels in my applications. Look at it this way, your kernels need to be extremely small to really get much of a benefit from this anyways.

Yes

Correct again on all points. I certainty find it difficult to take advantage of concurrent kernels in my applications. Look at it this way, your kernels need to be extremely small to really get much of a benefit from this anyways.