I have a system with 2 Gpus, where 1 is used for display and the second is used for computation. Since I want to do computation on both CPU/GPU sides, I stacked some kernel launch then create my threads on cpu, join the cpu threads then synchronize to the GPU (which should be already done computing since it is relatively faster in this case… )
After checking with nsight I see that what is actually done is : cpu threads do their things, then the launch takes place after I join on them and hit the deviceSynchronize, kernel returns when gpu is done…
An inefficient workaround is to actually call deviceSynchronize just after creating the CPU threads, but then I lose some cpu time to the device polling…
Does anyone know a more efficient to flush the command queue and force the kernel execution without calling synchronize ?