I have 2 streams, where I launch nvshmemx_uint64_wait_until_on_stream to wait for a remote signal indicating available resource, and then use nvshmemx_getmem_on_stream to read remote memory. However, in nsight system it shows that the nvshmemi_signal_wait_until_on_stream_kernel are serialized, so they block each other. Since signal may be activated in a different order than the kernel execution order, this causes huge extra latency and sometimes dead lock when there is cyclic data dependency between GPUs.
I have also done a very simple experiment using code snippet below:
__global__ void gpu_print(int id) {
if (blockIdx.x == 0 && threadIdx.x == 0) {
printf("stream %d\n", id);
}
}
...
CUDA_CHECK(cudaStreamCreateWithFlags(&s1, cudaStreamNonBlocking));
CUDA_CHECK(cudaStreamCreateWithFlags(&s2, cudaStreamNonBlocking));
if (mype == 0) {
nvshmemx_signal_wait_until_on_stream(signal, NVSHMEM_CMP_EQ, 1, s1);
gpu_print << <1, 1, 0, s1 >> > (1);
nvshmemx_signal_wait_until_on_stream(signal + 1, NVSHMEM_CMP_EQ, 1, s2);
gpu_print << <1, 1, 0, s2 >> > (2);
}
else {
heavy_job << <max_blocks, max_threads, 0, s1 >> > (...);
nvshmemx_signal_op_on_stream(signal+1, 1, NVSHMEM_SIGNAL_SET, 0, s1);
heavy_job << <max_blocks, max_threads, 0, s1 >> > (...);
nvshmemx_signal_op_on_stream(signal, 1, NVSHMEM_SIGNAL_SET, 0, s1);
}
Running this program always print stream 1 first, then stream 2. Nsight system shows the same serialized behavior of nvshmemi_signal_wait_until_on_stream_kernel. This kernel only launches 1 thread, uses no shared memory, and only 16 registers per thread. I wonder why the kernel cannot be executed concurrently.