How to debug an OpenACC Fortran program that is aborting with no error message?

joshua.hykes · November 16, 2022, 4:49pm

Hello,

I am trying to port an OpenMP loop to OpenACC to be able to run on a GPU. I have used OpenMP a fair amount, but I am new to OpenACC.

I am doing this testing on a g4dn.xlarge AWS instance with the following compiler version:

nvfortran 22.9-0 64-bit target on x86-64 Linux -tp skylake-avx512

I have converted an omp parallel section to acc parallel loop, with some private and reduction clauses.

The code compiles okay, but I get a runtime error with no error message. The preceding message is:

...
libcupti.so not found
upload CUDA data  file=forwrd.F90 function=solxy_p0 line=3096 device=0 threadid=1 variable=_dep_chain_dat
_21 bytes=54876
upload CUDA data  file=forwrd.F90 function=solxy_p0 line=3096 device=0 threadid=1 variable=_dvode_mod_21
bytes=8
...
launch CUDA kernel  file=forwrd.F90 function=solxy_p0 line=3096 device=0 threadid=1 num_gangs=32 num_work
ers=1 vector_length=128 grid=32 block=128 shared memory=1080

Accelerator Kernel Timing data
forwrd.F90
  solxy_p0  NVIDIA  devicenum=0
    time(us): 3,954
    3096: compute region reached 1 time
        3096: kernel launched 1 time
            grid: [32]  block: [128]
             device time(us): total=0 max=0 min=0 avg=0
    3096: data region reached 2 times
        3096: data copyin transfers: 30
             device time(us): total=3,898 max=1,614 min=4 avg=129
    3096: upload reached 5 times
        3096: data copyin transfers: 5
             device time(us): total=56 max=33 min=4 avg=11
Command exited with non-zero status 1

The code runs okay when I compile without -acc, and it also runs okay with -acc=multicore. When I switch to -acc, then I get this error.

I have tried the following environment variables. I also ran this in gdb, but without any additional clue as to what is wrong.

export NVCOMPILER_ACC_NOTIFY=3
export NVCOMPILER_TERM=trace

Currently this is inside a large proprietary code, so I can’t share a small example that would reproduce this error. I’m hoping that someone can suggest some debugging steps that I can do myself.

Any advice would be greatly appreciated!

MatColgrove · November 16, 2022, 7:52pm

Difficult to say what’s going on. If it were a problem with the GPU code, I’d expect an error such as an illegal address error so show up.

Though, you can run your app through cuda-memcheck to see if anything shows up, or use ‘cuda-gdb’ instead of gdb so you can debug on the device side.

I’m not sure if the status 1 error is coming back from a shell script or you’re program. If it’s from your program, could it be hitting a STOP statement?

joshua.hykes · November 16, 2022, 9:53pm

Thank you Mat for the pointers! When running with cuda-gdb, I get the following error message:

CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0xc987500 (forwrd.F90:3240)

which helpfully points me to a specific line so that I can continue debugging.

Regarding the status 1 message, you are correct that this was coming from a wrapper script, not the Fortran executable.

system · November 30, 2022, 9:53pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.