CUDA,Context and Threading


I’m developing an application with two threads using the driver API.

Thread 1 preprocesses the data using CUDA.
Thread 2 uses the results from thread 1.

The problem is that I have to switch the contexts all the time.
If I allocate a memory I do not know in which context I’m currently in
and therefore create the memory in the context which is currently active.
Is there a possibility to tell the cuMemAlloc… methods in which context
the memory should be allocated.

My first thought was to activate the context by myself, but then I realized
the the other thread could steal the context. Is there a better way, or do I have to create some locking mechanism?


In CUDA 4.0 and newer, there is only one context per process. Or is that for the runtime API only?

No, I’m currently working with the DriverAPI. Do you really mean process or thread?
As you said, there is one context per process only, how are different devices treated, also in one context?
If I could get some more details that would help me very much.

Can more devices share a single context?

The question “Or is that for the runtime API only?” was a true question for someone on this forum who knows. I don’t know the answer (as I have never used the driver API).

What I was referring to in my post was that in CUDA earlier than 4.0, there was one context per thread. But in CUDA 4.0 and newer, only one context per process. That one context can access all devices in the system. For example (runtime API)







This is mentioned in section 3.2.6 of the programming guide (4.2). The behavior is explicitly described in detail in the reference manual entry for cudaSetDevice(), which says:

I guess all I’m telling you is that it is possible to work around your problem by using one context for the whole process, because the runtime API does it. Though, Someone who knows the driver API will have to comment on how exactly you can accomplish this.

Your other option would be to use the IPC calls, but those are really designed for sharing device memory with another processes entirely.

OK, I see your point.

Since I have to create a context using the cuCtxCreate I also have to give the device for the context as an argument.
So it seems that the context is attached to the device, but this context is bound to the process.

Maybe the runtime API has its own switching mechanism. Maybe someone of nvidia could help on this?


(long, semi-related explanation of runtime API and contexts follows; skip to the line if you just want the answer to your question)

Since I designed this, I guess it falls to me to explain it.

Prior to CUDA 4.0, context management was simple: every thread had a TLS slot that identified which context was currently bound to that thread, and every context could only be bound to one thread at a time. Additionally, every context was only bound to a single device for the entire lifetime of the context. (I’m ignoring the context stack stuff; it doesn’t really matter)

In CUDA 4.0, we enabled multithreaded access to contexts so a single context could belong to more than one thread. So, as of 4.0:

  • a context belongs to a single device

  • a thread has a single context bound at a time (again, ignoring context stack stuff)

  • a context can be bound to multiple threads simultaneously

The driver API works exactly how you’d expect given these definitions, but the runtime API is more complicated. In particular, I felt it was very important that the following piece of code work exactly as you’d expect:








cudaDeviceSynchronize(); // wait for kernel to finish; in other words, go back to the same context as initially

Additionally, cudaSetDevice(0) in one thread needs to access the same context as cudaSetDevice(0) in another thread.

What the runtime API actually does is use a hidden API to create what’s called a primary context. Primary contexts are the same as any other contexts, except that there can be only one for a device at a time. (We’ve never exposed it because the API is ugly and we don’t like it, but we also don’t have a good way to fix it. It’s one of those places where we look at the API and think “damn, we really should have reference counted that thing instead of just having create/destroy.”)

The runtime API creates a context when there’s no context in the thread’s TLS context slot. So, if you do something like this, no primary context is created:

cuCtxCreate(&ctx, 0, 0); //create a context and place it in the thread's TLS context slot


Instead, a standard context will be created on device 0.

Meanwhile, if you just call cudaMalloc as your first CUDA call and never call cuCtxCreate first, a primary context will be created on device 0. You can’t access that directly via the driver API, but you can do something like

cudaMalloc(); // create primary context

cuCtxGetCurrent(&primaryCtx); //store the primary context



cuCtxSetCurrent(primaryCtx); // go back to the primary context created by the runtime

cuLaunchKernel(...); // do more driver API calls on the primary context

The programming model that I generally recommend is one context per device per process. In 4.0, it’s really trivial to share these; just create them (either with driver or runtime API, doesn’t matter) and use them from whichever thread you want. The only time things get crazy is when you’re mixing runtime-created and driver-created contexts in the same app.

If you don’t want to worry about primary contexts versus normal contexts, the easy thing to do is to always create your contexts and manage contexts using the same API, either driver or runtime. If you do that, everything is straightforward and basically works as you’d expect.


Hi tmurray,

thank you very much for the detailed explanation. I think I have found a suitable solution now, and will keep one context for each device.
I will also do the context switch internally, so it seems I’m getting more and more into the runtime API…