As I worked at my previous employer, CUDA system there was multi-process-single-thread (MPST). In each process, there was a singleton that manages CUDA resources (stream, device memory, etc.). In particular, during execution, there is only one CUDA stream to queue GPU operations (memory copy and kernel launch). A device memory heap (bookkeeping) was developed to allocate device memory without going through CUDA APIs (e.g., cudaMalloc). We were quite used to the following mode of programming:
(1) ptr1 = allocator(size1);
(2) knl_1<<<…, exec_stream>>>(ptr1);
(3) deallocator(ptr1)
(4) ptr2 = allocator(size2);
(5) knl_2<<<…,exec_stream>>>(ptr2);
(6) deallocator(ptr2);
Due to async behavior of kernel launch, ptr1 may be returned to heap before knl_1 finishes, and it is very likely that ptr2 overlaps with ptr1. But this is not a race condition between knl_1 and knl_2, as there is only one queue (stream) of execution.
Now I am working at a different employer, which adopts multi-threading in each process. Similarly, CUDA resource manager is a singleton. Each thread has a separate CUDA stream. It seems to me that the old programming mode may not work under the new infra, as explained in the following flow:
thread 1
(1) ptr1 = allocator(size1);
(2) knl_1<<<…, exec_stream_1>>>(ptr1);
(3) deallocator(ptr1)
(4) ptr2 = allocator(size2);
(5) knl_2<<<…,exec_stream_1>>>(ptr2);
(6) deallocator(ptr2);
thread 2
(1) ptr3 = allocator(size3);
(2) knl_3<<<…, exec_stream_2>>>(ptr3);
(3) deallocator(ptr3)
(4) ptr4 = allocator(size4);
(5) knl_4<<<…,exec_stream_2>>>(ptr4);
(6) deallocator(ptr4);
It is very likely that ptr1 overlaps ptr3, leading to a race condition between knl_1 and knl_3. The only solution I can think of, is to call cudaStreamSynchronize before calling deallocator. This will surely introduce system slowdown.
I need advice from CUDA architects: in CUDA systems, especially for processes interacting with GPU hardware, do we prefer single-thread over multi-thread?