Force the launch without blocking CPU threads

Hello.

I have a system with 2 Gpus, where 1 is used for display and the second is used for computation. Since I want to do computation on both CPU/GPU sides, I stacked some kernel launch then create my threads on cpu, join the cpu threads then synchronize to the GPU (which should be already done computing since it is relatively faster in this case… )
-stack kernels
-create threads
-join threads
-synchro GPU

After checking with nsight I see that what is actually done is : cpu threads do their things, then the launch takes place after I join on them and hit the deviceSynchronize, kernel returns when gpu is done…

An inefficient workaround is to actually call deviceSynchronize just after creating the CPU threads, but then I lose some cpu time to the device polling…
-stack kernels
-create threads
-synchro gpu
-join threads

Does anyone know a more efficient to flush the command queue and force the kernel execution without calling synchronize ?

Thanks.

Are you on windows? Greg@NV is a user on this board who has given suggestions in the past here:

https://devtalk.nvidia.com/default/topic/550280/will-cudathreadsynchronize-truly-break-up-kernel-launches-to-avoid-wdm-timeout-/

about using

cudaEventQuery(0);

to force empty the command queue (on windows, WDDM GPU).

Thanks

Yes I’m on windows, I did come across the post you point out.
However I thought it did not apply to my case: since I have nothing attached to this card, the stack of WDM commands -> this GPU should be empty and so the fact that my launch could be buried deep under many wdm commands thereby delaying the execution, should not be a factor…I thought