How to run nvshmemx_uint64_wait_until_on_stream concurrently?

hongyij · April 5, 2024, 3:21am

I have 2 streams, where I launch nvshmemx_uint64_wait_until_on_stream to wait for a remote signal indicating available resource, and then use nvshmemx_getmem_on_stream to read remote memory. However, in nsight system it shows that the nvshmemi_signal_wait_until_on_stream_kernel are serialized, so they block each other. Since signal may be activated in a different order than the kernel execution order, this causes huge extra latency and sometimes dead lock when there is cyclic data dependency between GPUs.

I have also done a very simple experiment using code snippet below:

    __global__ void gpu_print(int id) {
        if (blockIdx.x == 0 && threadIdx.x == 0) {
          printf("stream %d\n", id);
        }
    }
    ...

    CUDA_CHECK(cudaStreamCreateWithFlags(&s1, cudaStreamNonBlocking));
    CUDA_CHECK(cudaStreamCreateWithFlags(&s2, cudaStreamNonBlocking));

    if (mype == 0) {
        nvshmemx_signal_wait_until_on_stream(signal, NVSHMEM_CMP_EQ, 1, s1);
        gpu_print << <1, 1, 0, s1 >> > (1);
        nvshmemx_signal_wait_until_on_stream(signal + 1, NVSHMEM_CMP_EQ, 1, s2);
        gpu_print << <1, 1, 0, s2 >> > (2);
    }
    else {
        heavy_job << <max_blocks, max_threads, 0, s1 >> > (...);
        nvshmemx_signal_op_on_stream(signal+1, 1, NVSHMEM_SIGNAL_SET, 0, s1);
        heavy_job << <max_blocks, max_threads, 0, s1 >> > (...);
        nvshmemx_signal_op_on_stream(signal, 1, NVSHMEM_SIGNAL_SET, 0, s1);
    }

Running this program always print stream 1 first, then stream 2. Nsight system shows the same serialized behavior of nvshmemi_signal_wait_until_on_stream_kernel. This kernel only launches 1 thread, uses no shared memory, and only 16 registers per thread. I wonder why the kernel cannot be executed concurrently.

arnavg · April 8, 2024, 10:05pm

I wonder why the kernel cannot be executed concurrently.

As far as I know, CUDA does not guarantee concurrent scheduling of streams in every case; a lot of exception cases where serialization happens in the driver or HW can result in serialized behavior that you are observing e.g. HW resource sharing is just one of them OR 2 kernels contending over SMs OR with Windows Display Driver Model (WDDM) scheduling, the OS does the work submission, and there can be no guarantees that the two kernels will be submitted simultaneously.

Topic		Replies	Views
Kernels from different non-blocking streams are not executed concurrently CUDA Programming and Performance	1	625	March 26, 2021
Kernels executing concurrently in different streams do not behave as expected CUDA Programming and Performance	6	430	December 20, 2023
Launching several kernels on one stream while another kernel running persistently in the background CUDA Programming and Performance	1	726	October 8, 2016
Cuda launch kernel failed when there is memory dependency CUDA Programming and Performance cuda , kernel	2	251	May 31, 2023
CUDA Streams, cuStreamWaitValue32, and Deadlocks in Inter-Kernel Communication CUDA Programming and Performance cuda , kernel , synchronization , programming	7	63	September 5, 2025
Concurrent Kernel executions Concurrent Kernel executions on same CPU thread and multiple CPU threa CUDA Programming and Performance	2	4190	August 25, 2011
cudaStreamSynchronize CUDA Programming and Performance cuda	4	93	July 17, 2024
Concurrent executions of streams CUDA Programming and Performance	6	448	December 19, 2022
My streams are not running concurrently CUDA Programming and Performance	7	1827	March 6, 2018
Concurrent kernel execution CUDA Programming and Performance	2	430	March 26, 2024

How to run nvshmemx_uint64_wait_until_on_stream concurrently?

Related topics