(long, semi-related explanation of runtime API and contexts follows; skip to the line if you just want the answer to your question)
Since I designed this, I guess it falls to me to explain it.
Prior to CUDA 4.0, context management was simple: every thread had a TLS slot that identified which context was currently bound to that thread, and every context could only be bound to one thread at a time. Additionally, every context was only bound to a single device for the entire lifetime of the context. (I’m ignoring the context stack stuff; it doesn’t really matter)
In CUDA 4.0, we enabled multithreaded access to contexts so a single context could belong to more than one thread. So, as of 4.0:
a context belongs to a single device
a thread has a single context bound at a time (again, ignoring context stack stuff)
a context can be bound to multiple threads simultaneously
The driver API works exactly how you’d expect given these definitions, but the runtime API is more complicated. In particular, I felt it was very important that the following piece of code work exactly as you’d expect:
cudaDeviceSynchronize(); // wait for kernel to finish; in other words, go back to the same context as initially
Additionally, cudaSetDevice(0) in one thread needs to access the same context as cudaSetDevice(0) in another thread.
What the runtime API actually does is use a hidden API to create what’s called a primary context. Primary contexts are the same as any other contexts, except that there can be only one for a device at a time. (We’ve never exposed it because the API is ugly and we don’t like it, but we also don’t have a good way to fix it. It’s one of those places where we look at the API and think “damn, we really should have reference counted that thing instead of just having create/destroy.”)
The runtime API creates a context when there’s no context in the thread’s TLS context slot. So, if you do something like this, no primary context is created:
cuCtxCreate(&ctx, 0, 0); //create a context and place it in the thread's TLS context slot
Instead, a standard context will be created on device 0.
Meanwhile, if you just call cudaMalloc as your first CUDA call and never call cuCtxCreate first, a primary context will be created on device 0. You can’t access that directly via the driver API, but you can do something like
cudaMalloc(); // create primary context
cuCtxGetCurrent(&primaryCtx); //store the primary context
cuCtxSetCurrent(primaryCtx); // go back to the primary context created by the runtime
cuLaunchKernel(...); // do more driver API calls on the primary context
The programming model that I generally recommend is one context per device per process. In 4.0, it’s really trivial to share these; just create them (either with driver or runtime API, doesn’t matter) and use them from whichever thread you want. The only time things get crazy is when you’re mixing runtime-created and driver-created contexts in the same app.
If you don’t want to worry about primary contexts versus normal contexts, the easy thing to do is to always create your contexts and manage contexts using the same API, either driver or runtime. If you do that, everything is straightforward and basically works as you’d expect.