Code Sample: Thread A:
cuCtxCreate(&cuDeviceContext, CU_CTX_SCHED_AUTO, cuDeviceHandle);
cuCtxPopCurrent(NULL);
//This will be succeeded !
Thread B:
cuCtxPushCurrent(cuDeviceContext);
… allocate memory or something
cuCtxPopCurrent(NULL);
//This will be succeeded !
Thread C:
cuCtxPushCurrent(cuDeviceContext);
//at this code position the return value is CUDA_ERROR_INVALID_VALUE
Also when I try:
Thread B:
cuCtxPushCurrent(cuDeviceContext);
… allocate memory or something
cuCtxPopCurrent(NULL);
cuCtxPushCurrent(cuDeviceContext);
//at this code position the return value is CUDA_ERROR_INVALID_VALUE
I think, I’ve tried almost all possibilities. I’ve introduced a semaphore-like variable… Tried to “attach” it (the usage count must be increased and therefore the pop() operation must fail but this is the not the case)… and so on… but it’s always the same.
It looks like that you can call cuCtxPushCurrent() only once and after the next cuCtxPopCurrent() this context will be destroyed or becomes invalid. The invalid_value error says nothing about the failure in my opinion. This can mean almost everything.
The threadMigration example in the sdk did not help to understand these facts better in my opinion.
If you have found a solution for this problem in the meanwhile I would be very thankful if you could share this solution with me or can give me a hint.
If anybody from NVIDIA read this comment, it would be very nice to get a statement from him/her. Is there a possibility to use the ctxPush(), ctxPop() functions to create such a structure.
Thread(Main thread)
->create Ctx
->save CtxId in Object
->pop()
->send object to next thread
Thread
->receive object()
->Push(CtxId in object)
->doing things
->Pop()
->send object to next thread
Thread
->receive object()
->push(ctxId in object)
->doing things
->Pop
->send object to next thread
…
…
last thread
->receive object()
->Push(ctxId in Object)
->doing things
->Pop()
->send object to first thread
thread
->receive object()
->Destroy(ctxId in object)
I think, I figured out the problem. The usage of Pop() and Push() is limited to the driver api. This means a use case like the one I described before is possible, BUT all cuda-calls between push() and pop() must be defined in the driver api.
Furthermore, this means you have to use the memory-calls like e.g. cuMemAlloc() instead of cudaMemAlloc().
I’m not sure whether you can use the kernel call with <<<>>> This I still have to figured out but I have to do some code changes (in memory calls) until I can try this. But I let you know.