I’ve seen a couple of questions related to this topic in a couple of different places, but nothing ever had a firm answer.
Here’s what I’m trying to do:
Allocate device memory x using cudaMalloc in thread A.
Access device memory x (e.g. to zero) from thread A.
Thread A creates thread B (using CreateThread; this is Windows)
Thread A blocks waiting for thread B to complete (using WaitForMultipleObjects)
Access device memory x (e.g. for cudaMemcpy, cufft, etc) in thread B
Step 5 always fails with cudaErrorInvalidValue.
The memory pointer value is still correct, and the calling thread hasn’t yet exited (which would kill the context), so why can’t I access device memory in a thread other than the one that created it?
The context is valid only in the thread that created it. You cannot arbitrarily share a context amongst threads. There is a context migration API to transfer context from thread to thread, but there is non-trivial overhead in the operation. You might want to think about a different multithreading model, perhaps having a single thread holding the context and acting as a consumer, and then multiple producer threads can feed it work asynchronously.
The context is valid only in the thread that created it. You cannot arbitrarily share a context amongst threads. There is a context migration API to transfer context from thread to thread, but there is non-trivial overhead in the operation. You might want to think about a different multithreading model, perhaps having a single thread holding the context and acting as a consumer, and then multiple producer threads can feed it work asynchronously.