Hello,
I am trying to port an OpenMP loop to OpenACC to be able to run on a GPU. I have used OpenMP a fair amount, but I am new to OpenACC.
I am doing this testing on a g4dn.xlarge
AWS instance with the following compiler version:
nvfortran 22.9-0 64-bit target on x86-64 Linux -tp skylake-avx512
I have converted an omp parallel
section to acc parallel loop
, with some private
and reduction
clauses.
The code compiles okay, but I get a runtime error with no error message. The preceding message is:
...
libcupti.so not found
upload CUDA data file=forwrd.F90 function=solxy_p0 line=3096 device=0 threadid=1 variable=_dep_chain_dat
_21 bytes=54876
upload CUDA data file=forwrd.F90 function=solxy_p0 line=3096 device=0 threadid=1 variable=_dvode_mod_21
bytes=8
...
launch CUDA kernel file=forwrd.F90 function=solxy_p0 line=3096 device=0 threadid=1 num_gangs=32 num_work
ers=1 vector_length=128 grid=32 block=128 shared memory=1080
Accelerator Kernel Timing data
forwrd.F90
solxy_p0 NVIDIA devicenum=0
time(us): 3,954
3096: compute region reached 1 time
3096: kernel launched 1 time
grid: [32] block: [128]
device time(us): total=0 max=0 min=0 avg=0
3096: data region reached 2 times
3096: data copyin transfers: 30
device time(us): total=3,898 max=1,614 min=4 avg=129
3096: upload reached 5 times
3096: data copyin transfers: 5
device time(us): total=56 max=33 min=4 avg=11
Command exited with non-zero status 1
The code runs okay when I compile without -acc
, and it also runs okay with -acc=multicore
. When I switch to -acc
, then I get this error.
I have tried the following environment variables. I also ran this in gdb
, but without any additional clue as to what is wrong.
export NVCOMPILER_ACC_NOTIFY=3
export NVCOMPILER_TERM=trace
Currently this is inside a large proprietary code, so I can’t share a small example that would reproduce this error. I’m hoping that someone can suggest some debugging steps that I can do myself.
Any advice would be greatly appreciated!