I “discovered” an unexpected behavior using multiple GPUs and OpenMP for thread control.
Background: my machine is a quad-core Phenom running Fedora 10 and CUDA 2.2, with two GTX 275 GPUs and the 185.18.08 kernel module. My application uses OpenMP for shared memory parallelization, and frequently splits from 1 thread to 2 or 4, and then back again. This thread creation and destruction also happens when I go into a routine that uses CUDA and the GPU (because I want to use both GPUs, and each context needs a separate thread).
Everything that I have read indicates that when a thread closes, the CUDA context (the GPU “process,” in a way) that it used is destroyed. As users of CUDA know, creating a new context is a relatively slow operation (0.1-0.4 sec), especially so when multiple contexts are being created at once (CUDA seems to create contexts serially). I expected that, with two GPUs, I would need to create at least one new context every time I asked OpenMP to split from 1 to 2 threads and start up my CUDA calculations on each thread. Not so. I “pay” the time price for each context the first time I split into two threads and start up CUDA jobs, but, even though the extra thread is destroyed upon completion of that subroutine, and execution continues serially elsewhere in the program, the next time execution enters that subroutine and OpenMP splits into two threads, my contexts are still present, and my CUDA kernels execute without any delay.
I am presenting this information here to help multi-GPU programmers understand this aspect of CUDA, because the documentation misled me into thinking something else. While the simpleMultiGPU demo in the CUDA SDK is appropriate for some applications, I’d like to confirm that OpenMP also works.