We have been experiencing what we suspect is a compiler issue while running jobs on multiple GPUs when upgrading compilers to nvhpc 24.11. Currently we are compiling the code with nvhpc 24.11, we have also tried nvhpc 24.9, to no avail. The code will successfully run on multiple GPUs when compiled using pgi 20.7. I have narrowed down the point at which the program is hanging and run a debug analysis which can hopefully give more insight into the issue. In the compute_pVp_ps function when a parallel loop is created to initialize each element of the valS1 array to 1. (a rather simple operation) the code hangs when run with more than 1 GPU. When using PGI_ACC_DEBUG=1 the following output is obtained where the code hangs(new lines have been added for ease of reading):
pgi_uacc_cuda_wait(lineno=-99,async=-1,dindex=1,threadid=1)
pgi_uacc_cuda_wait(sync on stream=0x2,threadid=1)
pgi_uacc_cuda_wait done (threadid=1)
pgi_uacc_dataexitstart( file=/home/alstark/SlaterGPU/src/integrals/integrals_aux.cpp, function=_Z14add_r1_to_gridiPdddd, line=487:504, line=487, devid=0 )
pgi_uacc_dataoff(devptr=0x7ff9a7400000,hostptr=0xe25ee30,stride=1,size=1053696,extent=-1,eltsize=8,lineno=501,name=grid1[:gs*6],flags=0x200=present,async=-1,threadid=1)
pgi_uacc_dataexitdone( devid=1, threadid=1 )
pgi_uacc_dataenterstart( file=/home/alstark/SlaterGPU/src/integrals/integrals_ps.cpp, function=_Z14compute_pVp_psiPiPdRSt6vectorIS1_IdSaIdEESaIS3_EEiiiiS0_i, line=724:1158, line=860, devid=0,threadid=1 )
pgi_uacc_dataon(hostptr=0x49ebe70,stride=1,-1,size=526848x19,extent=-1x-1,eltsize=8,lineno=860,name=valS1[:iN][:gs3],flags=0x200=present,async=-1,threadid=1)
attach skipped due to non-contiguous sections (threadid=1)
1: function(begin) __pgi_uacc_event_synchronize, hostptr 0x49ebe70, hostptr(present search) 0x49ebe70, current_data_entry 0xefe0db0, htodcopying 1, wait_event (nil)
The code and location where it fails can be found here: