Unable to get code to run on multiple GPUs

We have been experiencing what we suspect is a compiler issue while running jobs on multiple GPUs when upgrading compilers to nvhpc 24.11. Currently we are compiling the code with nvhpc 24.11, we have also tried nvhpc 24.9, to no avail. The code will successfully run on multiple GPUs when compiled using pgi 20.7. I have narrowed down the point at which the program is hanging and run a debug analysis which can hopefully give more insight into the issue. In the compute_pVp_ps function when a parallel loop is created to initialize each element of the valS1 array to 1. (a rather simple operation) the code hangs when run with more than 1 GPU. When using PGI_ACC_DEBUG=1 the following output is obtained where the code hangs(new lines have been added for ease of reading):

pgi_uacc_cuda_wait(lineno=-99,async=-1,dindex=1,threadid=1)

pgi_uacc_cuda_wait(sync on stream=0x2,threadid=1)

pgi_uacc_cuda_wait done (threadid=1)

pgi_uacc_dataexitstart( file=/home/alstark/SlaterGPU/src/integrals/integrals_aux.cpp, function=_Z14add_r1_to_gridiPdddd, line=487:504, line=487, devid=0 )

pgi_uacc_dataoff(devptr=0x7ff9a7400000,hostptr=0xe25ee30,stride=1,size=1053696,extent=-1,eltsize=8,lineno=501,name=grid1[:gs*6],flags=0x200=present,async=-1,threadid=1)

pgi_uacc_dataexitdone( devid=1, threadid=1 )

pgi_uacc_dataenterstart( file=/home/alstark/SlaterGPU/src/integrals/integrals_ps.cpp, function=_Z14compute_pVp_psiPiPdRSt6vectorIS1_IdSaIdEESaIS3_EEiiiiS0_i, line=724:1158, line=860, devid=0,threadid=1 )

pgi_uacc_dataon(hostptr=0x49ebe70,stride=1,-1,size=526848x19,extent=-1x-1,eltsize=8,lineno=860,name=valS1[:iN][:gs3],flags=0x200=present,async=-1,threadid=1)

attach skipped due to non-contiguous sections (threadid=1)

1: function(begin) __pgi_uacc_event_synchronize, hostptr 0x49ebe70, hostptr(present search) 0x49ebe70, current_data_entry 0xefe0db0, htodcopying 1, wait_event (nil)

The code and location where it fails can be found here:

Hi alstark,

Interesting problem and I’m not sure what’s going on, but let’s work through it. Note I did try a quick build SlaterGPU but ran into a few issues (library dependencies). If I have time later, I’ll try to work through them and attempt to recreate the problem.

From the debug info, there’s a few discrepancies. First is the output complete? I would have expected to see the “enter data” regions for lines 795-799 in the output. If it is complete, did you define “USE_ACC” when compiling? Granted, I’m not sure how well maintained the PGI_ACC_DEBUG info is being maintained, so it could be something on our end. We might try using “NV_ACC_NOTIFY=3” instead, which will have the runtime print each time data compute regions are entered.

The second are the line numbers. The debug info shows that compute_pVp_psi is lines 724 to 1158, but the source shows it goes from 719 to 1107. Also the kernel is at line 834 in the source, but 860 in the debug output. Is the source shown the same as the source you used to compile? Now it may not be relevant, but there seems to be an extra 50 lines in the compiled source. I bring it up only because if I am able to compile the application, I may not be reproducing the same issue.

Now the hang is likely at the end of the valS1 init loop, though it is a very basic loop so doesn’t make much sense why it would hang there.

Another possibility is that it’s the “wait” directive. Granted, if this were the case, I’d expect to see the valS2 init loop in the output, but you might try commenting it out. There’s no use of async so it shouldn’t be needed and if you need to sync the OpenMP threads, use a barrier. Each OpenMP thread does have it’s own async queue id, so a “wait” will sync across all of the queues which may be causing the issue.

-Mat

First, if the library dependency issue is with libcint, if you look in the highest level CMakeLists.txt you should be able to set DO_GTO to False which should not effect the calculation we are running. After this following the steps on the gitihub should work as long as cmake and nvhpc are present:
within SlaterGPU:
mkdir build
cd build
source . ./env.set.local0
cmake . .
make

next navigate to geom_ps in examples:
cd examples/geom_ps

remove files that have been run with 1 GPU:
rm A_ref Ciap_ref hf.out pVp_ref SENT_ref

vi . ./. ./. ./env.set.local0
add to this file the line:
export CUDA_VISIBLE_DEVICES=0,1
export OMP_NUM_THREADS=2

you should now be ready to run a job:
. ./sgpu.exe

The output is not complete, I will link the complete output for PGI_ACC_DEBUG here, within the output though I did not see any locations specified where it specifically points to the lines of enter data create statements even for other functions.
pgi_debug.txt (40.3 MB)

USE_ACC is specified as True when compiling.

I will link the output obtained from NV_ACC_NOTIFY=3 here.
nv_notify3.txt (2.0 MB)

The source was not the same as the one I linked, I accidentally posted the debug output from the build I was modifying in an attempt to fix the issue, the output I have linked in the post is from a clean build. It is also worth noting that when switching to the clean build, when running with PGI_ACC_DEBUG=1 the lines which the compute_pVp_ps function spans is reported as 810:810.
I have attempted to use omp barriers to sync the threads before and after the parallel loop which initializes valS1 to 1., the code still hung in the same spot.
I should also mention for your reference that a similar set up is used in the function compute_STEn_ps which is directly above compute_pVp_ps (the STEn function works for some reason though).

-Alex

Thanks Alex.

Any ideas on this compilation issue with “write.cpp”? I’ve tried it with both nvc++ and g++, but get the same error.

% g++ -Dio_EXPORTS -I/local/home/mcolgrove/SlaterGPU/src/integrals -I/local/home/mcolgrove/SlaterGPU/include -I/local/home/mcolgrove/SlaterGPU/src/libio -O2 --std=c++23 -fPIC -O2 -g -D__GCC_ATOMIC_TEST_AND_SET_TRUEVAL=1 -DUSE_ACC=1 /local/home/mcolgrove/SlaterGPU/src/libio/write.cpp -c
In file included from /local/home/mcolgrove/SlaterGPU/src/libio/write.h:15,
                 from /local/home/mcolgrove/SlaterGPU/src/libio/write.cpp:1:
/local/home/mcolgrove/SlaterGPU/src/libio/write.cpp: In function ‘void write_iarray(short int, short int, short int, int, int, float*)’:
/local/home/mcolgrove/SlaterGPU/src/libio/read.h:21:24: error: invalid ‘static_cast’ from type ‘std::__cxx11::basic_ostringstream<char>’ to type ‘std::ostringstream&’ {aka ‘std::__cxx11::basic_ostringstream<char>&’}
   21 | #define SSTRF( x ) ( ( static_cast<std::ostringstream&>(std::ostringstream() << std::fixed << std::setprecision(8) << std::scientific << (x)) ).str() )
      |                        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
% nvc++ -I/local/home/mcolgrove/SlaterGPU/src/integrals -I/local/home/mcolgrove/SlaterGPU/include -I/local/home/mcolgrove/SlaterGPU/src/libio -O2 --std=c++17 -fPIC -O2 -g /local/home/mcolgrove/SlaterGPU/src/libio/write.cpp -c
"/local/home/mcolgrove/SlaterGPU/src/libio/write.cpp", line 60: error: expression must be an lvalue
        outfile << SSTRF(A[i*s2+j]) + " ";

Using a post-processed file to see the macro expansion:

nvc++ -Dio_EXPORTS -I/local/home/mcolgrove/SlaterGPU/src/integrals -I/local/home/mcolgrove/SlaterGPU/include -I/local/home/mcolgrove/SlaterGPU/src/libio -mp -O2 --c++14 --gnu_extensions -fPIC -mp -O2 -g -D__GCC_ATOMIC_TEST_AND_SET_TRUEVAL=1 -DUSE_ACC=1 -acc=gpu -c write.i -w --no_preincludes 
"write.i", line 75950: error: expression must be an lvalue
        outfile << ( ( static_cast<std::ostringstream& >(std::ostringstream() << std::fixed << std::setprecision(8) << std::scientific << (A[i*s2+j])) ). str() ) + " ";
                                                         ^

I believe if you remove the static_cast from SSTRF in the read.cpp file that should fix this issue. We became aware of this issue when compiling on other machines and have not pushed the fix yet. If that does not work I will download it locally and try to work this error out.

-Alex

Thanks Alex, I was able to get it to compile and recreated the hang.

I’ll need engineering to take a look so will write up a report and send it tp them, but I’m out of time for today.

What it looks to me is that there’s some type of issue when trying to sync the default CUDA stream across multiple OpenMP threads.

I did find I could work around the issue by using different CUDA streams for each OpenMP thread by using this pattern:

#if USE_ACC
   int tid = omp_get_thread_num();
 #pragma acc parallel loop independent present(grid[0:6*gs],val[0:3*gs]) async(tid+1)
#endif
  for (int i=0;i<gs;i++)
  {
...
  }
#pragma acc wait(tid+1)

I use “tid+1” since async queue 0 is the default stream.

I did this to all the routines in pVpd.cpp and more in integrals_ps.cpp and after each one, the program progressed a bit further. I didn’t make all the way through so don’t know if it fully works around the issue, but will continue later.

Thank you, this seems to be working for me as well, I am also applying these changes past pVpd.cpp, and will keep this thread informed when I am done applying to all the places necessary throughout the code.

Hey, so we have gotten SlaterGPU to a point where it will finish using multiple GPUs, thank you for the assistance. Another related problem has arisen, though which we could use some insight on. SlaterGPU is a library meant to be used by another library (FancyElectons) for integral calculation. When we compile and run a job which uses SlaterGPU on multiple GPUs functions from the FancyElectrons library hang even though the FancyElectrons library is only supposed to use one of the available GPUs. We would like to keep is so that FancyElectrons only uses one of the GPUs.
Here is a snippet of PGI_ACC_DEBUG from where the code hangs in FancyElectrons:

pgi_uacc_dataexitstart( file=/home/alstark/FancyElectrons_jan25/src/libhf/cuda_util.cpp, function=_Z13mat_times_matPdS_S_iii, line=337:357, line=340, devid=0 )

pgi_uacc_dataoff(devptr=0x7fc62376c400,hostptr=0xa0b71160,stride=1,size=625,extent=-1,eltsize=8,lineno=354,name=C[:MN],flags=0x200=present,async=-1,threadid=1)

pgi_uacc_dataoff(devptr=0x7fc623770000,hostptr=0xa288e400,stride=1,size=625,extent=-1,eltsize=8,lineno=354,name=B[:NK],flags=0x200=present,async=-1,threadid=1)

pgi_uacc_dataoff(devptr=0x7fc62376ec00,hostptr=0xa0b73880,stride=1,size=625,extent=-1,eltsize=8,lineno=354,name=A[:MK],flags=0x200=present,async=-1,threadid=1)

pgi_uacc_dataexitdone( devid=1, threadid=1 )

pgi_uacc_dataenterstart( file=/home/alstark/FancyElectrons_jan25/src/libhf/cuda_util.cpp, function=_Z16mat_times_mat_atPdS_S_iii, line=368:388, line=371, devid=0,threadid=1 )

pgi_uacc_dataon(hostptr=0xa288e400,stride=1,size=625,extent=-1,eltsize=8,lineno=371,name=A[:MK],flags=0x200=present,async=-1,threadid=1)

1: function(begin) __pgi_uacc_event_synchronize, hostptr 0xa288e400, hostptr(present search) 0xa288e400, current_data_entry 0xa1923da0, htodcopying 0, wait_event 0xa2b3a0a0

pgi_uacc_dataon(hostptr=0xa0b71160,stride=1,size=625,extent=-1,eltsize=8,lineno=371,name=B[:NK],flags=0x200=present,async=-1,threadid=1)

1: function(begin) __pgi_uacc_event_synchronize, hostptr 0xa0b71160, hostptr(present search) 0xa0b71160, current_data_entry 0xa0b53ed0, htodcopying 0, wait_event 0xa0b6ef30

pgi_uacc_dataon(hostptr=0xa1b39b60,stride=1,size=625,extent=-1,eltsize=8,lineno=371,name=C[:MN],flags=0x200=present,async=-1,threadid=1)

1: function(begin) __pgi_uacc_event_synchronize, hostptr 0xa1b39b60, hostptr(present search) 0xa1b39b60, current_data_entry 0xa0905070, htodcopying 1, wait_event (nil)

and here is a snippet of where this come from in the code:

void mat_times_mat_at(double* C, double* A, double* B, int M, int N, int K)
{
  int MN = M*N; 
  int MK = M*K; 
  int NK = N*K; 

 #pragma acc parallel loop present(A[0:MK],B[0:NK],C[0:MN])
  for (int i=0;i<M;i++)
  {
   #pragma acc loop
    for (int j=0;j<N;j++)
    {    
      double val = 0.;
     #pragma acc loop reduction(+:val)
      for (int k=0;k<K;k++)
        val += A[k*M+i]*B[k*N+j];
      C[i*N+j] = val; 
    }    
  }

  return;
}

-Alex

Something else to mention: When SlaterGPU is run and output is obtained and printed in text files and FancyElectrons is subsequently run by reading this data in on one GPU it works. But if we compile FancyElectrons to directly read in the data in a single execution, it fails.