Multi-Thread or Single-Thread for CUDA Software System

Ziqi · December 3, 2024, 4:16am

As I worked at my previous employer, CUDA system there was multi-process-single-thread (MPST). In each process, there was a singleton that manages CUDA resources (stream, device memory, etc.). In particular, during execution, there is only one CUDA stream to queue GPU operations (memory copy and kernel launch). A device memory heap (bookkeeping) was developed to allocate device memory without going through CUDA APIs (e.g., cudaMalloc). We were quite used to the following mode of programming:

(1) ptr1 = allocator(size1);
(2) knl_1<<<…, exec_stream>>>(ptr1);
(3) deallocator(ptr1)
(4) ptr2 = allocator(size2);
(5) knl_2<<<…,exec_stream>>>(ptr2);
(6) deallocator(ptr2);

Due to async behavior of kernel launch, ptr1 may be returned to heap before knl_1 finishes, and it is very likely that ptr2 overlaps with ptr1. But this is not a race condition between knl_1 and knl_2, as there is only one queue (stream) of execution.

Now I am working at a different employer, which adopts multi-threading in each process. Similarly, CUDA resource manager is a singleton. Each thread has a separate CUDA stream. It seems to me that the old programming mode may not work under the new infra, as explained in the following flow:

thread 1
(1) ptr1 = allocator(size1);
(2) knl_1<<<…, exec_stream_1>>>(ptr1);
(3) deallocator(ptr1)
(4) ptr2 = allocator(size2);
(5) knl_2<<<…,exec_stream_1>>>(ptr2);
(6) deallocator(ptr2);

thread 2
(1) ptr3 = allocator(size3);
(2) knl_3<<<…, exec_stream_2>>>(ptr3);
(3) deallocator(ptr3)
(4) ptr4 = allocator(size4);
(5) knl_4<<<…,exec_stream_2>>>(ptr4);
(6) deallocator(ptr4);

It is very likely that ptr1 overlaps ptr3, leading to a race condition between knl_1 and knl_3. The only solution I can think of, is to call cudaStreamSynchronize before calling deallocator. This will surely introduce system slowdown.

I need advice from CUDA architects: in CUDA systems, especially for processes interacting with GPU hardware, do we prefer single-thread over multi-thread?

striker159 · December 3, 2024, 5:41am

For the allocator problem I would suggest using a stream-ordered memory allocator. CUDA already provides such an allocator which uses a memory pool.

There is also the RAPIDS memory manager: GitHub - rapidsai/rmm: RAPIDS Memory Manager

Curefab · December 3, 2024, 8:43am

RAPIDS is a Nvidia organized Open Source project?

striker159 · December 3, 2024, 8:55am

RAPIDS is developed by Nvidia, yes.

Eventually, there will also be memory resources provided by CCCL. Experimental implementations already exist. Memory Resources — cudax 2.5 documentation

system · December 17, 2024, 8:55am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
IDEA: Intrinsic multi-GPU support (Even over a network) CUDA Programming and Performance	7	9608	January 1, 2009
Using CUDA to run many instances CUDA Programming and Performance	10	3452	April 1, 2012
MultiGPU example in the CUDA SDK some stack problems CUDA Programming and Performance	5	3132	March 11, 2018
Single or multiple CPU threads using same GPU? CUDA Programming and Performance cuda , performance	5	2727	September 14, 2023
My first test on CUDA and some questions sync, thread with CUDA CUDA Programming and Performance	5	3032	November 13, 2007
Processing Order with Cuda Streams in 7.5 CUDA Programming and Performance	13	2009	June 24, 2016
Support for multi-threaded apps on cuda and multiple applications on cuda CUDA Programming and Performance	13	12748	January 24, 2011
cudaMallocHost and pthreads issues with accessing memory from different threads CUDA Programming and Performance	3	3323	November 14, 2008
Convert an existing numerical model (C) to support CUDA CUDA Programming and Performance	4	481	July 26, 2018
CUDA processor allocation CUDA Programming and Performance	7	3437	October 5, 2007

Multi-Thread or Single-Thread for CUDA Software System

Related topics