I experience host memory stack corruption when calling cudaMalloc and cudaFree from multiple host threads of single process. The threads use cudaSetDevice() call to use specific GPU (0, 1, 2) and device memory allocated in one thread on specific device is then, later, released in different thread, using the same device. Basically, I have a few worker threads allocating the memory and passing them in thread safe queue to other threads, which, after doing some work, will release them. Currently, no work is done, no host to device or device to host copying, only allocation in one thread, then, later, deallocation in another. There may be simultaneous calls to cudaMalloc and cudaFree in different threads with different device being set.
When I use only single GPU device in my process, there are no problems.
When I put mutex around cudaMalloc and cudaFree calls, there are no problems.
Without mutex lock, I can randomly see host memory heap corruption, SIGSEGV in libcuda.so, or cudaErrorAlreadyMapped error code from cudaMalloc.
My belief was that cudaMalloc and cudaFree are THREAD SAFE. Is it really true? What I am doing wrong?
Thanks for help
Radek