Hi,
I’m testing P2P access between GPUs with Fortran. Consider this small testcase:
program checkP2P
use cudafor
implicit none
integer :: istat
integer :: ok01, ok10
istat = cudaDeviceCanAccessPeer(ok01, 0, 1)
print *, 'cudaError: ', trim(cudaGetErrorString(istat))
istat = cudaDeviceCanAccessPeer(ok10, 1, 0)
print *, 'cudaError: ', trim(cudaGetErrorString(istat))
print *, 'ok01: ', ok01, ', ok10: ', ok10
istat = cudaSetDevice(0)
print *, 'cudaError: ', trim(cudaGetErrorString(istat))
istat = cudaDeviceEnablePeerAccess(1, 0)
print *, 'cudaError: ', trim(cudaGetErrorString(istat))
istat = cudaSetDevice(1)
print *, 'cudaError: ', trim(cudaGetErrorString(istat))
istat = cudaDeviceEnablePeerAccess(0, 0)
print *, 'cudaError: ', trim(cudaGetErrorString(istat))
end program checkP2P
I’m compiling it using nvfortran from the Nvidia HPC SDK 24.11. The resulting program fails to enable the access from GPU 1 → GPU 0, although the preceeding call to cudaDeviceCanAccessPer
claims that P2P is possible. The output is as follows:
[cweiss@gpu005 cuda_fortran]$ ./check_p2p_f.x
cudaError: no error
cudaError: no error
ok01: 1 , ok10: 1
cudaError: no error
cudaError: no error
cudaError: no error
cudaError: peer access is not supported between these two devices
However, the identical program in C works fine:
#include <stdio.h>
int main (int argc, char *argv[]) {
cudaError_t ce;
int ok01, ok10;
ce = cudaDeviceCanAccessPeer(&ok01, 0, 1);
printf ("cudaError: %s\n", cudaGetErrorString(ce));
ce = cudaDeviceCanAccessPeer(&ok10, 1, 0);
printf ("cudaError: %s\n", cudaGetErrorString(ce));
printf ("ok01: %d, ok10: %d\n", ok01, ok10);
ce = cudaSetDevice(0);
printf ("cudaError: %s\n", cudaGetErrorString(ce));
ce = cudaDeviceEnablePeerAccess(1, 0);
printf ("cudaError: %s\n", cudaGetErrorString(ce));
ce = cudaSetDevice(1);
printf ("cudaError: %s\n", cudaGetErrorString(ce));
ce = cudaDeviceEnablePeerAccess(0, 0);
printf ("cudaError: %s\n", cudaGetErrorString(ce));
}
It’s compiled with nvcc from the same HPC SDK and looks fine:
[cweiss@gpu005 cuda_fortran]$ ./check_p2p_c.x
cudaError: no error
cudaError: no error
ok01: 1, ok10: 1
cudaError: no error
cudaError: no error
cudaError: no error
cudaError: no error
I have tested all combinations of GPU indices. Not all of them fail with the Fortran version, but all succeed with the C version.
The testing systems are one node with four A100 80GB GPUs and another node with eight A100 40 GB GPUs. On both, the CUDA verison is 12.4.
Can you reproduce this issue? Do you have any idea what might be going on? I would guess that the Fortran code would access the same backend as the C version does, but apparently this is not the case.
Regards,
Christian