CUDA,Context and Threading

tmurray · May 13, 2012, 8:17pm

(long, semi-related explanation of runtime API and contexts follows; skip to the line if you just want the answer to your question)

Since I designed this, I guess it falls to me to explain it.

Prior to CUDA 4.0, context management was simple: every thread had a TLS slot that identified which context was currently bound to that thread, and every context could only be bound to one thread at a time. Additionally, every context was only bound to a single device for the entire lifetime of the context. (I’m ignoring the context stack stuff; it doesn’t really matter)

In CUDA 4.0, we enabled multithreaded access to contexts so a single context could belong to more than one thread. So, as of 4.0:

a context belongs to a single device
a thread has a single context bound at a time (again, ignoring context stack stuff)
a context can be bound to multiple threads simultaneously

The driver API works exactly how you’d expect given these definitions, but the runtime API is more complicated. In particular, I felt it was very important that the following piece of code work exactly as you’d expect:

cudaSetDevice(0);

cudaMalloc(...);

kernel<<<...>>>(...);

cudaSetDevice(1);

cudaMalloc(...);

kernel2<<<...>>>(...);

cudaSetDevice(0);

cudaDeviceSynchronize(); // wait for kernel to finish; in other words, go back to the same context as initially

Additionally, cudaSetDevice(0) in one thread needs to access the same context as cudaSetDevice(0) in another thread.

What the runtime API actually does is use a hidden API to create what’s called a primary context. Primary contexts are the same as any other contexts, except that there can be only one for a device at a time. (We’ve never exposed it because the API is ugly and we don’t like it, but we also don’t have a good way to fix it. It’s one of those places where we look at the API and think “damn, we really should have reference counted that thing instead of just having create/destroy.”)

The runtime API creates a context when there’s no context in the thread’s TLS context slot. So, if you do something like this, no primary context is created:

cuCtxCreate(&ctx, 0, 0); //create a context and place it in the thread's TLS context slot

cudaMalloc();

Instead, a standard context will be created on device 0.

Meanwhile, if you just call cudaMalloc as your first CUDA call and never call cuCtxCreate first, a primary context will be created on device 0. You can’t access that directly via the driver API, but you can do something like

cudaMalloc(); // create primary context

cuCtxGetCurrent(&primaryCtx); //store the primary context

cuCtxSetCurrent(someCtxCreatedByTheDriverAPIElsewhereInTheApp);

...

cuCtxSetCurrent(primaryCtx); // go back to the primary context created by the runtime

cuLaunchKernel(...); // do more driver API calls on the primary context

The programming model that I generally recommend is one context per device per process. In 4.0, it’s really trivial to share these; just create them (either with driver or runtime API, doesn’t matter) and use them from whichever thread you want. The only time things get crazy is when you’re mixing runtime-created and driver-created contexts in the same app.

If you don’t want to worry about primary contexts versus normal contexts, the easy thing to do is to always create your contexts and manage contexts using the same API, either driver or runtime. If you do that, everything is straightforward and basically works as you’d expect.

Topic		Replies	Views
CUDA 4.0 Context Sharing by Threads Impact on existing Multi-threaded Apps CUDA Programming and Performance	8	22902	March 9, 2011
Confusion about context management by CUDA runtime CUDA Programming and Performance	3	478	December 25, 2023
creating a global context using driver api by default context created using driver api seem to be th CUDA Programming and Performance	12	1758	June 15, 2011
Bound multiple host threads to the same context? CUDA Programming and Performance	3	1813	September 26, 2017
Multiple CUDA contexts per device in a single process CUDA Programming and Performance	2	4660	April 22, 2016
video cards in parallel ? how the use of various video cards in parallel? CUDA Programming and Performance	7	753	July 15, 2011
Does CUDA work with seperate calls coming from different CPU threads? CUDA Programming and Performance	3	3792	September 12, 2009
Contexts and cudaMallocHost Same rules? CUDA Programming and Performance	17	11200	November 15, 2008
Multiple GPUs, multiple applications CUDA Programming and Performance	10	9980	April 22, 2009
Support for multi-threaded apps on cuda and multiple applications on cuda CUDA Programming and Performance	13	12728	January 24, 2011

CUDA,Context and Threading

Related Topics