EnablePeer access works with C CUDA but not with Fortran CUDA

Hi,

I’m testing P2P access between GPUs with Fortran. Consider this small testcase:

program checkP2P
  use cudafor
  implicit none
  integer :: istat
  integer :: ok01, ok10


  istat = cudaDeviceCanAccessPeer(ok01, 0, 1)
  print *, 'cudaError: ', trim(cudaGetErrorString(istat))
  istat = cudaDeviceCanAccessPeer(ok10, 1, 0)
  print *, 'cudaError: ', trim(cudaGetErrorString(istat))
  print *, 'ok01: ', ok01, ', ok10: ', ok10

  istat = cudaSetDevice(0)
  print *, 'cudaError: ', trim(cudaGetErrorString(istat))
  istat = cudaDeviceEnablePeerAccess(1, 0)
  print *, 'cudaError: ', trim(cudaGetErrorString(istat))

  istat = cudaSetDevice(1)
  print *, 'cudaError: ', trim(cudaGetErrorString(istat))
  istat = cudaDeviceEnablePeerAccess(0, 0)
  print *, 'cudaError: ', trim(cudaGetErrorString(istat))

end program checkP2P

I’m compiling it using nvfortran from the Nvidia HPC SDK 24.11. The resulting program fails to enable the access from GPU 1 → GPU 0, although the preceeding call to cudaDeviceCanAccessPer claims that P2P is possible. The output is as follows:

[cweiss@gpu005 cuda_fortran]$ ./check_p2p_f.x 
 cudaError: no error
 cudaError: no error
 ok01:             1 , ok10:             1
 cudaError: no error
 cudaError: no error
 cudaError: no error
 cudaError: peer access is not supported between these two devices

However, the identical program in C works fine:

#include <stdio.h>

int main (int argc, char *argv[]) {
   cudaError_t ce;
   int ok01, ok10;

   ce = cudaDeviceCanAccessPeer(&ok01, 0, 1);
   printf ("cudaError: %s\n", cudaGetErrorString(ce));
   ce = cudaDeviceCanAccessPeer(&ok10, 1, 0);
   printf ("cudaError: %s\n", cudaGetErrorString(ce));
   printf ("ok01: %d, ok10: %d\n", ok01, ok10);

   ce = cudaSetDevice(0);
   printf ("cudaError: %s\n", cudaGetErrorString(ce));
   ce = cudaDeviceEnablePeerAccess(1, 0);
   printf ("cudaError: %s\n", cudaGetErrorString(ce));

   ce = cudaSetDevice(1);
   printf ("cudaError: %s\n", cudaGetErrorString(ce));
   ce = cudaDeviceEnablePeerAccess(0, 0);
   printf ("cudaError: %s\n", cudaGetErrorString(ce));
}

It’s compiled with nvcc from the same HPC SDK and looks fine:

[cweiss@gpu005 cuda_fortran]$ ./check_p2p_c.x 
cudaError: no error
cudaError: no error
ok01: 1, ok10: 1
cudaError: no error
cudaError: no error
cudaError: no error
cudaError: no error

I have tested all combinations of GPU indices. Not all of them fail with the Fortran version, but all succeed with the C version.
The testing systems are one node with four A100 80GB GPUs and another node with eight A100 40 GB GPUs. On both, the CUDA verison is 12.4.

Can you reproduce this issue? Do you have any idea what might be going on? I would guess that the Fortran code would access the same backend as the C version does, but apparently this is not the case.

Regards,
Christian

I can and I don’t. It’s odd since you[re correct that these are just interfaces to the CUDA C libraries so I would expect the same behavior.

I personally haven’t used P2P calls in over 10 years since it’s largely been supplanted by GPUDirect and NCCL. I also prefer using CUDA Aware MPI since P2P has limited use.

I can investigate when I get a chance, but want to check to see how important this is? Is this just something you were testing or is this part of a large application?

-Mat

Take a look at this example:

On an old DGX Station ( 4 V100 that can do P2P), it is getting the right results:
Number of CUDA-capable devices: 4

Device 0: Tesla V100-DGXS-16GB
Device 1: Tesla V100-DGXS-16GB
Device 2: Tesla V100-DGXS-16GB
Device 3: Tesla V100-DGXS-16GB

 0     1     2     3

0 - Y Y Y
1 Y - Y Y
2 Y Y - Y
3 Y Y Y -

It’s exactly this repository that I’m observing the error with. In my example above, cudaDeviceCanAccessPeer yields true, but it’s when you try to actually perform cudaDeviceEnablePeerAccess that it fails with a CUDA error. The first program in this chapter where you should see the issue is the one created from directTransfer.cuf.

@MatColgrove As you can see, I was just trying out this repository. In general, I’m mostly using CUDA aware MPI with GPU RDMA and agree that P2P has limited use for HPC applications. So this issue does have a very low priority for me.

Regards,
Christian

Ok, I’ll put it on my list to investigate. I’m around over the holidays so should be able to make some time for it.

It works for me ( and we tested all the code in the book) on the DGX Station:

$ ./directTransfer
Number of CUDA-capable devices: 4

Allocation summary
Device 0: Tesla V100-DGXS-16GB
Free memory before: 16585916416, after: 16552361984, difference: 33554432

Device 1: Tesla V100-DGXS-16GB
Free memory before: 16585916416, after: 16552361984, difference: 33554432

Device 2: Tesla V100-DGXS-16GB
Free memory before: 16585916416, after: 16552361984, difference: 33554432

Device 3: Tesla V100-DGXS-16GB
Free memory before: 16585916416, after: 16552361984, difference: 33554432

Peer access available between 0 and 1

Timing on device 0

b_d=a_d transfer (GB/s): 348.5957336
cudaMemcpyPeer transfer (GB/s): 348.5957336
cudaMemcpyPeer transfer w/ P2P disabled (GB/s): 341.3333435

Timing on device 1

b_d=a_d transfer (GB/s): 348.5957336
cudaMemcpyPeer transfer (GB/s): 348.5957336
cudaMemcpyPeer transfer w/ P2P disabled (GB/s): 348.5957336