Also, it is a bit dangerous to do so, because the OpenMP runtime implementation is unlikely to garantee that the same pthreads are reused in the different parallel sections. This would only work if everything (memory allocation, context initialization etc.) was done in the loop body, which is certainly not what you want in general.
CUDA is doing really “weird” asumptions in a multithreaded context that you should really take care if you are going to use OpenMP + CUDA (or CUBLAS).
I would second that advice. I really wouldn’t recommend using OpenMP for multi-gpu. There is an awful lot of implementation specific/unstandardized stuff that has to go one between your #pragma directives and the context and data hitting the right GPU and working as designed. A threading model where there is a lot more control, like pthreads (or boost threads or similar), is probably a much safer and more reliable bet.
As I suggested in your other thread on this, use persistent threads based on the producer-consumer model. Condition variables and mutexes can be used to keep threads alive (and holding contexts for the lifespan of the application). The pthreads model includes both a condition broadcast and unicast mechanism which can be use to signal to idle threads to wake up, and barriers to synchronize them. Work can be passed to the worker threads via function pointers and the condition variable itself.
Any reasonable reference on pthreads should have everything you need to get started. I have a tattered old copy of a Sun pthreads programmers handbook which has served me well. Sun should have a pdf version of it for download in their online reference library. There was also an Oreilly book that a lot of people like, although I haven’t read it.