My program creates two CPU threads using the same GPU.
The thread 1 launches CUDA kernel on the default CUDA context and the default stream.
The thread 2 creates a new CUDA context (with the driver API) and a new stream. It performs only H2D data transfers.
My goal is that the operations launched on the GPU by the two CPU threads occurs concurrently
Does a cudaDeviceSynchronize of thread 1 waits for CUDA calls of thread 2 to end?
Moreover, how many concurrent cudaMemCopy
(H2D or D2H) can be performed between a given CPU and a given GPU?
Thanks you for your help
Generally speaking, for the GPU to switch from doing work for one context to work for another context, a context switch is required. This is an expensive operation, meaning it does not happen in a nanosecond or picosecond. It might take microseconds or even milliseconds.
A design involving 2 different contexts is not a smart design choice, if your desire is to overlap or run concurrently activities is both contexts.
cudaDeviceSynchronize() certainly waits until the device is idle.
And that is a runtime API call, not a driver API call. I’d generally advise non-experts not to try to carefully interleave driver API and runtime API. It brings additional complexity, and in the general case I don’t know why it would be needed. Certainly there is no indication in your posting why 2 different contexts are needed.
Overall this looks like a bad design to me.