How to run nvshmemx_uint64_wait_until_on_stream concurrently?

I have 2 streams, where I launch nvshmemx_uint64_wait_until_on_stream to wait for a remote signal indicating available resource, and then use nvshmemx_getmem_on_stream to read remote memory. However, in nsight system it shows that the nvshmemi_signal_wait_until_on_stream_kernel are serialized, so they block each other. Since signal may be activated in a different order than the kernel execution order, this causes huge extra latency and sometimes dead lock when there is cyclic data dependency between GPUs.

I have also done a very simple experiment using code snippet below:

    __global__ void gpu_print(int id) {
        if (blockIdx.x == 0 && threadIdx.x == 0) {
          printf("stream %d\n", id);
        }
    }
    ...

    CUDA_CHECK(cudaStreamCreateWithFlags(&s1, cudaStreamNonBlocking));
    CUDA_CHECK(cudaStreamCreateWithFlags(&s2, cudaStreamNonBlocking));

    if (mype == 0) {
        nvshmemx_signal_wait_until_on_stream(signal, NVSHMEM_CMP_EQ, 1, s1);
        gpu_print << <1, 1, 0, s1 >> > (1);
        nvshmemx_signal_wait_until_on_stream(signal + 1, NVSHMEM_CMP_EQ, 1, s2);
        gpu_print << <1, 1, 0, s2 >> > (2);
    }
    else {
        heavy_job << <max_blocks, max_threads, 0, s1 >> > (...);
        nvshmemx_signal_op_on_stream(signal+1, 1, NVSHMEM_SIGNAL_SET, 0, s1);
        heavy_job << <max_blocks, max_threads, 0, s1 >> > (...);
        nvshmemx_signal_op_on_stream(signal, 1, NVSHMEM_SIGNAL_SET, 0, s1);
    }

Running this program always print stream 1 first, then stream 2. Nsight system shows the same serialized behavior of nvshmemi_signal_wait_until_on_stream_kernel. This kernel only launches 1 thread, uses no shared memory, and only 16 registers per thread. I wonder why the kernel cannot be executed concurrently.

I wonder why the kernel cannot be executed concurrently.

As far as I know, CUDA does not guarantee concurrent scheduling of streams in every case; a lot of exception cases where serialization happens in the driver or HW can result in serialized behavior that you are observing e.g. HW resource sharing is just one of them OR 2 kernels contending over SMs OR with Windows Display Driver Model (WDDM) scheduling, the OS does the work submission, and there can be no guarantees that the two kernels will be submitted simultaneously.