I “discovered” an unexpected behavior using multiple GPUs and OpenMP for thread control.
Background: my machine is a quad-core Phenom running Fedora 10 and CUDA 2.2, with two GTX 275 GPUs and the 185.18.08 kernel module. My application uses OpenMP for shared memory parallelization, and frequently splits from 1 thread to 2 or 4, and then back again. This thread creation and destruction also happens when I go into a routine that uses CUDA and the GPU (because I want to use both GPUs, and each context needs a separate thread).
Everything that I have read indicates that when a thread closes, the CUDA context (the GPU “process,” in a way) that it used is destroyed. As users of CUDA know, creating a new context is a relatively slow operation (0.1-0.4 sec), especially so when multiple contexts are being created at once (CUDA seems to create contexts serially). I expected that, with two GPUs, I would need to create at least one new context every time I asked OpenMP to split from 1 to 2 threads and start up my CUDA calculations on each thread. Not so. I “pay” the time price for each context the first time I split into two threads and start up CUDA jobs, but, even though the extra thread is destroyed upon completion of that subroutine, and execution continues serially elsewhere in the program, the next time execution enters that subroutine and OpenMP splits into two threads, my contexts are still present, and my CUDA kernels execute without any delay.
I am presenting this information here to help multi-GPU programmers understand this aspect of CUDA, because the documentation misled me into thinking something else. While the simpleMultiGPU demo in the CUDA SDK is appropriate for some applications, I’d like to confirm that OpenMP also works.
The only reason I could imagine that this works is that OpenMP (btw, is this gcc’s implementation?) doesn’t terminate threads after a parallel block, but rather only idles the threads until the next parallel block. This would be a good idea generically, since the overhead of starting and stopping threads, while low in Linux, is not zero.
I’m almost sure it just creates a static thread pool because doing it any other way wouldn’t make any sense for OpenMP (make the section you care about parallel, don’t worry about overhead, etc).
Indeed the OpenMP run-time is clever enough to not recreate threads. One possible implementation is to have the parallel sections as a FIFO stack. Each time a thread is available, it will ask for more work from the FIFO.
In the sequential parts of the program (ie outside of the OpenMP part) the threads are still alive, but not used.
While the semantic model is that of a fork/join fork/join…, the runtime is basically very different.