Multi-Thread or Single-Thread for CUDA Software System

As I worked at my previous employer, CUDA system there was multi-process-single-thread (MPST). In each process, there was a singleton that manages CUDA resources (stream, device memory, etc.). In particular, during execution, there is only one CUDA stream to queue GPU operations (memory copy and kernel launch). A device memory heap (bookkeeping) was developed to allocate device memory without going through CUDA APIs (e.g., cudaMalloc). We were quite used to the following mode of programming:

(1) ptr1 = allocator(size1);
(2) knl_1<<<…, exec_stream>>>(ptr1);
(3) deallocator(ptr1)
(4) ptr2 = allocator(size2);
(5) knl_2<<<…,exec_stream>>>(ptr2);
(6) deallocator(ptr2);

Due to async behavior of kernel launch, ptr1 may be returned to heap before knl_1 finishes, and it is very likely that ptr2 overlaps with ptr1. But this is not a race condition between knl_1 and knl_2, as there is only one queue (stream) of execution.

Now I am working at a different employer, which adopts multi-threading in each process. Similarly, CUDA resource manager is a singleton. Each thread has a separate CUDA stream. It seems to me that the old programming mode may not work under the new infra, as explained in the following flow:

thread 1
(1) ptr1 = allocator(size1);
(2) knl_1<<<…, exec_stream_1>>>(ptr1);
(3) deallocator(ptr1)
(4) ptr2 = allocator(size2);
(5) knl_2<<<…,exec_stream_1>>>(ptr2);
(6) deallocator(ptr2);

thread 2
(1) ptr3 = allocator(size3);
(2) knl_3<<<…, exec_stream_2>>>(ptr3);
(3) deallocator(ptr3)
(4) ptr4 = allocator(size4);
(5) knl_4<<<…,exec_stream_2>>>(ptr4);
(6) deallocator(ptr4);

It is very likely that ptr1 overlaps ptr3, leading to a race condition between knl_1 and knl_3. The only solution I can think of, is to call cudaStreamSynchronize before calling deallocator. This will surely introduce system slowdown.

I need advice from CUDA architects: in CUDA systems, especially for processes interacting with GPU hardware, do we prefer single-thread over multi-thread?

For the allocator problem I would suggest using a stream-ordered memory allocator. CUDA already provides such an allocator which uses a memory pool.

There is also the RAPIDS memory manager: GitHub - rapidsai/rmm: RAPIDS Memory Manager

RAPIDS is a Nvidia organized Open Source project?

RAPIDS is developed by Nvidia, yes.

Eventually, there will also be memory resources provided by CCCL. Experimental implementations already exist. Memory Resources — cudax 2.5 documentation

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.