Nvshmem support only 8 gpus at most at 1 node

I can’t use nvshmem with 16gpus/node. it’s restricted by the cudaDeviceEnablePeerAccess. is there a way out for nvshmem to work on a node with 16 GPUs?

#include <cuda_runtime_api.h>
#include <stdio.h>

int main() {
  for (int i = 0; i < 16; i++) {
    if (i == 0)
      continue;
    auto err = cudaDeviceEnablePeerAccess(i, 0);
    if (err != cudaSuccess) {
      fprintf(
          stderr,
          "cudaDeviceEnablePeerAccess(%d, 0) failed with error code %d: %s\n",
          i, err, cudaGetErrorString(err));
    }
  }
  return 0;
}
$ ./a.out
cudaDeviceEnablePeerAccess(9, 0) failed with error code 711: peer mapping resources exhausted
cudaDeviceEnablePeerAccess(10, 0) failed with error code 711: peer mapping resources exhausted
cudaDeviceEnablePeerAccess(11, 0) failed with error code 711: peer mapping resources exhausted
cudaDeviceEnablePeerAccess(12, 0) failed with error code 711: peer mapping resources exhausted
cudaDeviceEnablePeerAccess(13, 0) failed with error code 711: peer mapping resources exhausted
cudaDeviceEnablePeerAccess(14, 0) failed with error code 711: peer mapping resources exhausted
cudaDeviceEnablePeerAccess(15, 0) failed with error code 711: peer mapping resources exhausted

this is the NVIDIA driver version:

I don’t see any nvshmem here, and with respect to the CUDA C++ code that you have shown, this is a known limitation of peer mappings, currently (limited to about 8 GPUs in a peer “clique”). With respect to nvshmem, it is documented that nvshmem requires all GPUs to be P2P accessible. Based on that faq link, it looks like it may be possible to break e.g. a 2 socket 16 GPU node into two “half-nodes” if there is IB connectivity between the two halves (and the 8 GPUs in each half are P2P capable). I don’t have a recipe for you.

thanks for your reply

so this is only a restriction for NVSHMEM_DISABLE_CUDA_VMM=1, right? If I use the VMM, I don’t have to call cudaDeviceEanblePeerAccess, so there is no such restrictions?

i guess with MultiNode NVLink, we can use more than 8 GPUs.