Pinned memory does not play nice with ctx management

I’d like to allocate and initialize pinned memory on the host side in one thread (using one ctx) and then use that memory to perform transfers from host to device in another thread (of course, using a different context).

In pseudo code:

Thread #0

cuCtxCreate
cuMemAllocHost (buffer)
(initialized buffer with some data)
cuCtxDestroy // What happens to buffer now? It’s context is now gone.

Thread #1

cuCtxCreate
cuMemcpyHtoDAsync (buffer) // Oops. We allocated and initialized this memory in Thread #0 with a different context!
cuCtxDestroy

This resulted in various bad things happening in cuMemcpyHtoDAsync (invalid context).

I tried to use the context management API to pop the context from Thread #0.
In Thread #1, I pushed the (now floating) context onto the context stack but CUDA didn’t seem to like that either.

This second attempt looked something like this just to be clear:

Thread #0

cuCtxCreate
cuMemAllocHost (buffer)
(initialized buffer with some data)
cuPopCurrentContext (thread_0_context)
(save thread_0_context for later use in Thread #1 when we perform the asynchronous copy)
cuCtxDestroy // Oops. Now I just destroyed the floating context. Hmm…maybe I need to attach to thread_0_context to raise its reference count to 2?

Thread #1

cuCtxCreate
cuPushContext (thread_0_context) // Push context from Thread #0
cuMemcpyHtoDAsync (buffer)
cuPopCurrentContext // Pop thread_0_context from stack to get back to the previous context.
cuCtxDestroy

Any ideas whether this approach should work? What I’m trying to do is decouple the initialization of pinned memory in one thread from it’s use (i.e., DMA transfer) in another thread.

It doesn’t appear that the CUDA context management API is quite up to the task. It would be nice to be able to allow pinned memory to work outside of a specific CUDA context.

Any ideas?

Thanks.

Pinned memory from multiple contexts is a feature we’re working on for a future version of CUDA (that is to say, not 2.1).

Also, you’ve hit the “cuCtxPopCurrent documentation is completely incomprehensible and has no relation to what the function actually does” bug. thread_0_context is going to be NULL in your example code, I bet. If you just pass the context handle as returned from cuCtxCreate in thread #0 to cuCtxPushCurrent in thread #1 after ensuring that you’ve called pop from #0, it should work.

Thanks! The handle is indeed NULL, and yes the context management interface in general is not very well documented. I assume since this strategy should work that the future work you are alluding to involves simplifying the context management interface?

What I need to be able to do is somehow decouple the DMA transfer from the initialization of the memory allocated in Thread #0. Right now, it’s really the CUDA context that “owns” the memory. That makes deferring pinned memory transfers painful.

Thanks for the help!

The interface isn’t that complicated, it’s just doesn’t do what it’s supposed to at the moment. I’m not sure if we’re changing the documentation or changing the behavior to do what the documentation says (since I am fairly sure absolutely no one uses it as it behaves now); if anyone has strong opinions either way, jump in.