Viewing PTX source when compiling DO CONCURRENT, OpenACC, and/or OpenMP

Hi everyone,
I’m currently interested in profiling an application that exercises the DO CONCURRENT base language support for Fortran, as well as some OpenMP and OpenACC. I would like to be able to view the PTX source code in Nsight Compute, like I can when using nvcc and looking at C code. This is currently the command I’m using:
nvfortran -O3 -march=native -acc=gpu -mp=gpu -stdpar=gpu -gpu=ccnative,managed,unified,keep,lineinfo,ptxinfo,debug -Minfo=accel ../advection.f90 -g -Xptxas=-v -Mdwarf3
Howeover, after running ncu and collecting a report to copy over to Nsight Compute, the only source views available to me are the original Fortran, and SASS. I do not get an option for viewing PTX, despite producing a .ptx file with the keep option. Is it not possible to get that PTX view with nvfortran? Clarification, I’m not trying to debug or step through PTX code, I’m just trying to associate the collected ncu statistics with the PTX so I can see which instructions are causing which events.

Thanks in advance,
Ivan

1 Like

Hi Ivan,

I believe it’s due to RDC which creates device binaries without embedded PTX. Try compiling with “-gpu=nordc” to see if that gets you the PTX info.

nvfortran enables RDC by default given it allows linking of cross-file device subroutine calls as well accessing module variables used in device subroutines. If you need these features, then you might not be able to compile with nordc.

-Mat

Adding -gpu=nordc resulted in a cuModuleLoad error:

bash-5.1$ /opt/nvidia/hpc_sdk/Linux_aarch64/24.3/compilers/bin/nvfortran -O3 -march=native -acc=gpu -mp=gpu -stdpar=gpu -gpu=ccnative,managed,unified,keep,lineinfo,ptxinfo,debug,nordc -Minfo=accel advection_min.f90 
nvfortran-Warning-Malformed $expr(), nonnumeric value qemu
...nvfortran-Warning-Malformed $expr(), extra text: :...
advection_operator_weno3:
     33, Generating NVIDIA GPU code
         33,   ! blockidx%x threadidx%x auto-collapsed
             Loop parallelized across CUDA thread blocks, CUDA threads(128) collapse(3) ! blockidx%x threadidx%x
     33, Generating implicit copyout(ln(2:ntm-1,:npm-1,:nr)) [if not already present]
         Generating implicit copyin(vp(2:ntm-1,1:npm-1,1:nr)) [if not already present]
...ptxas info    : 16 bytes gmem
ptxas info    : Compiling entry function 'advection_operator_weno3_33_gpu' for 'sm_90'
ptxas info    : Function properties for advection_operator_weno3_33_gpu
    56 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 66 registers, 56 bytes cumulative stack size
...
bash-5.1$ ncu --target-processes all --set full --call-stack --import-source yes --replay-mode application -f -o full-report.advection_min-nv-nordc.ncu-rep ./a.out
==PROF== Connected to process 717853 (/path/a.out)
Failing in Thread:1
Accelerator Fatal Error: call to cuModuleLoad returned error 301 (CUDA_ERROR_FILE_NOT_FOUND): File not found
 File: /path/advection_min.f90
 Function: advection_operator_weno3:21
 Line: 33

==PROF== Disconnected from process 717853
==ERROR== The application returned an error code (1).
==WARNING== No kernels were profiled.

Did you link with nordc as well?

The error means that that the device binary couldn’t be found, though I don’t know if that’s because it wasn’t linked with nordc so the it’s expecting a bin file, or there was a problem with PTX JIT compilation.

The error message is a little misleading, since it refers to the .f90 file. I scrubbed the full path but /path/advection_min.f90 definitely exists.
By linking with nordc, does that mean a flag to the linker from the driver call? i.e. Is nordc a link option or an actual library to link against?

The error message is a little misleading, since it refers to the .f90 file.

The error is occurring at this line in the source since this is likely the first OpenMP offload code and the device is being initialized.

Is nordc a link option or an actual library to link against?

It’s an option to the compiler driver. RDC requires an additional device link step which gets disabled when “nordc” is set on the link line. It also changes the generated initialization code so device binaries don’t registered.

I’ve tried this compile line:
nvfortran -O3 -march=native -acc=gpu -mp=gpu -stdpar=gpu -gpu=ccnative,managed,unified,keep,lineinfo,ptxinfo,debug,nordc -Minfo=accel ../advection_min.f90 -v -Wnvlink,--nordc
I looked at nvfortran --help to see who can receive forwarded arguments, tried it with everything I saw and removed arguments until it compiled successfully. Of all the following options, only nvlink accepts nordc (tried with one - and two --).

-Wl,-nordc
-Wnvlink,-nordc 
-Wfatbinary,-nordc 
-Wnvvm,-nordc 
-Wptxas,-nordc 

Am I missing something obvious?

Apologies. I just noticed that this is a single file that you’re compiling and linking on the same command line. I assumed this was one compile line from a larger project with a separate link step later.

Adding just “-gpu=nordc” should be sufficient. No need to add all the “-W” commands. If you add the verbose flag , “-v”, you’ll see the different compilation and linking phases. The last one is the “acclnk” driver, which does the device linking. There you’ll see the “-nordc” flag being passed, which should be sufficient to disable RDC.

So the question is back to why it’s not finding the JIT compiled PTX, which I’m not sure. If you can get me a reproducing example, I can investigate.