I am trying to port an OpenMP loop to OpenACC to be able to run on a GPU. I have used OpenMP a fair amount, but I am new to OpenACC.
I am doing this testing on a
g4dn.xlarge AWS instance with the following compiler version:
nvfortran 22.9-0 64-bit target on x86-64 Linux -tp skylake-avx512
I have converted an
omp parallel section to
acc parallel loop, with some
The code compiles okay, but I get a runtime error with no error message. The preceding message is:
... libcupti.so not found upload CUDA data file=forwrd.F90 function=solxy_p0 line=3096 device=0 threadid=1 variable=_dep_chain_dat _21 bytes=54876 upload CUDA data file=forwrd.F90 function=solxy_p0 line=3096 device=0 threadid=1 variable=_dvode_mod_21 bytes=8 ... launch CUDA kernel file=forwrd.F90 function=solxy_p0 line=3096 device=0 threadid=1 num_gangs=32 num_work ers=1 vector_length=128 grid=32 block=128 shared memory=1080 Accelerator Kernel Timing data forwrd.F90 solxy_p0 NVIDIA devicenum=0 time(us): 3,954 3096: compute region reached 1 time 3096: kernel launched 1 time grid:  block:  device time(us): total=0 max=0 min=0 avg=0 3096: data region reached 2 times 3096: data copyin transfers: 30 device time(us): total=3,898 max=1,614 min=4 avg=129 3096: upload reached 5 times 3096: data copyin transfers: 5 device time(us): total=56 max=33 min=4 avg=11 Command exited with non-zero status 1
The code runs okay when I compile without
-acc, and it also runs okay with
-acc=multicore. When I switch to
-acc, then I get this error.
I have tried the following environment variables. I also ran this in
gdb, but without any additional clue as to what is wrong.
export NVCOMPILER_ACC_NOTIFY=3 export NVCOMPILER_TERM=trace
Currently this is inside a large proprietary code, so I can’t share a small example that would reproduce this error. I’m hoping that someone can suggest some debugging steps that I can do myself.
Any advice would be greatly appreciated!