After reading through the documentation, it appears that NCCL point-to-point comms are blocking, but I am not quite sure what this means. Does this mean the comms are blocking with respect to the host (i.e. similar to blocking MPI calls on the host), or the CUDA stream that the send/recv calls were posted on (i.e. kernel launches on other streams can still proceed)?
The reason I am asking is I am wondering if it would be of any value to for me to use NCCL point-to-point comms for halo exchange instead of a CUDA-aware MPI library. I am already using CUDA-aware MPI with non-blocking Isend/Irecv calls for my halo exchange. I found the NCCL point-to-point comms during my reading so I was just wondering about their intended use case. Are they suitable for halo exchange usage or just partial collectives?