Multiple GPUs and streams

I’m trying to use streams in an application using multiple GPUs. The code currently launches a selected number of threads, one for each GPU. Each thread has a cudaSetDevice() call to set the appropriate device followed by cudaStreamCreate() and then a cudaEventCreate(). If I launch only one thread, it all works well. If I launch multiple threads, however, one thread works but the others fail with an “unknown” error in the cudaStreamCreate() call. I tried to get around this by eliminating the cudaStreamCreate() call and just using stream 0, but I get the same behavior with cudaEventCreate().

If I simply run everything in the default stream 0 and don’t use events, then all works well using multiple threads. Is it possible to use cudaStreamCreate() and cudaEventCreate() with multiple GPUs?

Never mind. I changed the threading to use pthreads, and all works well now.

How were you doing threading before, out of curiosity?

This is a conversion of an older app that used clone() to launch kernel threads. It worked fine until I tried to create streams or events. I switched to pthreads, and the problems disappeared.

Lately, I’ve started to use OpenMP for creating simple threaded apps. I have one CUDA app that uses OpenMP to access multiple GPUs, but that kernel doesn’t use streams or events either so I don’t know if it would break.