Multithreaded access to CUDA

I’m trying to run a program that runs a CUDA kernel from different threads on the host (I’m using POSIX threads). Even if I insert a lock around the code that copies data and runs the kernel, my system still seems to lock up when executing with multiple threads.

Basically, my code looks like this, with no other cuda calls outside this code block:



...cudaMemcpy to device kernel

...cudaMemcpy to host




It’s hard to tell from your description what could be going wrong. Are you aware that CUDA does not allow you to pass device pointers (e.g. those allocated from cudaMalloc()) between host threads?