cuModuleGetGlobal from multiple host threads

I have a single CUDA device and multiple host threads, in-which I each call cuCtxCreate, cuModuleLoadData and cuModuleGetGlobal - the latter to get a device pointer to a u32 in my kernel file.

My threads each get different hContext and hModule handles but the same DevicePtr address from these calls but I’m unable to move a u32 from host to device and back to host using my threads. No runtime errors; the memcopy doesn’t fail - but my second thread always copies 0 from the device.

Single threads don’t seem to be a problem.

Is it a problem that I’m using multiple contexts?

Edit: dynamic device allocations with cuCtxPopCurrent() and cuCtxPushCurrent() seem to be the direction to move in… The forum archive is an awesome resource!

My app is structured as 3 host threads (main and two children, which all communicate through synchronized queues). It mostly works… YMMV

hContext
hFunction
dpMemory1
dpMemory2

main
{
cuInit
cuDeviceGet
cuCtxCreate
cuModuleLoadData
cuModuleGetFunction to get hFunction
cuModuleGetGlobal to get dpMemory1 and dpMemory2
Launch other threads
Loop for a bit sending stuff to thread1
cuStreamSynchronize
cuModuleUnload
cuCtxDestroy
}

thread1
{
cuCtxPushCurrent
Loop // dequeue stuff from main and use hFuncton and dpMemory1 here to call device
cuCtxPopCurrent
}

thread2
{
cuCtxPushCurrent
Loop // use cuStreamQuery and dpMemory2 here and queue stuff back to main
cuCtxPopCurrent
}