Bound multiple host threads to the same context?

My main question is in the title, and the details are below:

I have an application where we need to compute interaction lists.
For the CPU version we have openMP threads in parallel assemble their own interaction lists (which come from walking a tree), and then each calls the interaction kernel for their own data.

For our GPU version, I want the openMP threads to run again in parallel to assemble the interaction lists, but instead of computing them, each would asynchronously copy them to the GPU execute a kernel to evaluate the list, and continue to assemble their next list.

Now, although I can assume that I will run only one device per MPI task (meaning all of my threads will use the same device), It would be hard to designate one host thread to do all of the GPU tasks/calls as I want all of the openMP threads to continuously collect/execute/return each interaction list in parallel so I want them to all make kernel calls individually.

Now, I dont want to create a bunch of contexts (naively one per thread) since I am using only one device and dont want the large penalty, so I thought the most natural way to run would be to create one context, where say thread 0 initially allocated large arrays (of say length SIZE), and when it runs each host thread could copy/use/return their data on a designated portion of that array (of length SIZE/numCPUthreads).

So essentially I want to bound the thread 0 context to all of my host threads, so they could access the same device memory pointers (and I am handling all of the collisional logic myself by only using the designated portion of those arrays for each thread). And in fact I am creating the same number of streams as I have threads, to assign them their own stream in the context in hopes of making all of the calls in parallel.

I thought the following code would work:

//Find which device I want and store device ID in m_devID

cuCtxCreate(&cuContext, 0, m_devID);

//Malloc all of my arrays with primary thread
//Create numCPUThreads number of streams

#pragma omp parallel for
for (over all interaction lists I need to build){
cuCtxSetCurrent(cuContext);
// cuda code where I host copy data to this threads designated part of the allocated arrays, run kernels, using their designated stream for this thread.

}

What I found is if I use device 0 everything works (even without specifying any contexts!) and I think thats just a coincidence since the default is device 0 for everything so for some reason all my threads can see the context no problem. When I tell it to use another one of the devices on a node (my nodes have 4 gpus), it crashes with ‘invalid resource handle’ errors on some of my kernels, and I assume it has to do with context issues.

Iv tried many things. Calling cudaSetDevice in the for loop, calling cuCtxPushCurrent, etc.

So my question is simple. Is there an easy way to just make ONE context on ONE device that is shared by ALL host threads? Do I need one context per thread? Is what I am attempting to do reasonable?

Yes, you can use a single context (per device) across multiple host threads. It’s reasonable. In fact, it is what the CUDA runtime API does, by default.

Thanks for the reply txbob,

Then what is the correct syntax? If it is true by default one would think just calling cudaSetDevice in the pragma for loop for all of the threads (to the same device), it would work right? I assumed each thread made its own context by default but then again it working for device 0 does appear to mean that by default the threads use the same context on device 0.

Do I explicitly set the context with cuCtxSetCurrent, as I tried above? Is there another way to bind all the host threads to the same context explicitly?

Use the runtime API.

Problem solved.