cudaDeviceSynchronize() doesn't wait for kernels launched by other CPU threads, why?

My program is written as follows:

worker thread1: launch some kernels, then go to sleep
worker thread2: launch some kernels, then go to sleep
main thread:  cudaDeviceSynchronize()   //this is expected to wait until the kernels finish, but doesn't work

I assume CPU threads within the same process share the same CUDA context, then why is the above code not working? How do I make it work?

@albertazzf A minimal reproducible or a NVVP/Nsight Systems report with the timeline would be really helpful.

What form of synchronization or notification are you using between the work threads and main thread to inform the main thread that the work has been submitted?

1 Like

None …
I don’t have a minimalistic example at hand right now, but this is likely the culprit.
The main thread is calling the cudaDeviceSynchronize() too soon while the GPU is indeed idle.

By the way, is it true that the CPU threads share the same CUDA context? If not, i’m concerned with the context switching overhead.

Some form of synchronization/communication is required between the worker threads and the main thread. The main thread is very likely calling cudaDeviceSynchronize prior to the work threads submitting the work.

In the CUDA Runtime a thread will submit work to the primary context unless otherwise directed. See CUDA Runtime API | Context Management.

1 Like

I have a similar implementation which uses pthreads to launch kernels on multiple GPUs.

I call pthread_join on every slave thread to wait for all threads to complete before calling cudaStreamSynchronize for every device’s stream.

1 Like

Thanks! That’s a good idea. In my case the worker threads are persistent, after launching the kernels they’ll go to sleep, so i plan to wait for the worker thread’s notification instead.