Device Debugging with Allinea DDT

I became excited when I read about DDT supporting device debugging starting with PGI 14.1 [1]. When trying to run a CUDA Fortran program, DDT correctly recognizes a multithreaded environment, however on trying to step it fails with “Allinea DDT could not step: Cannot find bounds of current function”. Do I have to compile differently? I couldn’t find any documentation on this feature.

PGI version 14.2
DDT version 4.2-PR-36863

My compiler flags:

-g -Mcuda=cc20 -ta=nvidia,cc20,keepgpu,keepbin,time -Minfo=accel,inline,ipa -Mneginfo -Minform=inform -I/usr/local/include -r8

[1] http://www.allinea.com/events/201403/nvidia-webinar-debugging-cuda-fortran-using-allinea-ddt-january-30-2014[/quote]

Hello,

I need a bit more context here, can please answer following questions :
Are you compiling CUDA Fortran code or OpenACC code ? Options that you provide are applying to both.
Can you share the code ? or send me the generated .gpu file ?
By the way steping from Host code to Kernel code may not work, did you try to set a breakpoint in CUDA Fortran kernel and run to this kernel ?

Hi Sebastien

Thanks a lot for your time. I’m a bit confused by the question “Are you compiling CUDA Fortran code or OpenACC code” Does the Mcuda-option really apply to OpenACC as well? In that case that would be something new for me. Anyway, it’s CUDA Fortran (like mentioned in my first post).

I think I may still misunderstand how device debugging is supposed to work - as I wrote I can’t find any documentation on this, would you have any pointers? Anyway, do I have to load the GPU file into the debugger? If so, how do you initialize the kernels correctly, without loading the host code as well? I’d gladly send you the source code and/or binaries, I’d just like to know a bit more about the process before causing you too much work.

Thanks a lot for your time. I’m a bit confused by the question “Are you compiling CUDA Fortran code or OpenACC code” Does the Mcuda-option really apply to OpenACC as well? In that case that would be something new for me. Anyway, it’s CUDA Fortran (like mentioned in my first post).

FYI, “-ta” enables OpenACC, hence Seb’s question.

Also, Seb’s presentation at GTC2014 might be helpful: http://on-demand.gputechconf.com/gtc/2014/video/S4284-debugging-pgi-cuda-fortran-openacc-gpus-allinea-ddt.mp4

  • Mat

I know it worked at some point, but right now I can’t get DDT to play well together with PGI OpenACC. When trying to step into device code, it does show threadidx / blockidx / blockdim under “locals”, but the editor window stays on the kernel call in host code, i.e. I’m blind on what code has exactly executed on the device. Switching to device threads it usually shows that it’s within cuda_wait or cuda_launch (C Files probably coming from PGI’s wrapper around CUDA). Is there something I’m doing wrong when compiling?

Compiler command
pgf90 -g -acc -DGPU -I /home/michel/asuca/hybrid/Nusdas13/src -I //home/michel/lib/netcdf3/include -Mcuda=6.5,cc3x -ta=tesla:loadcache:L1,cc3x -Minline=levels:5,reshape Mipa=inline,reshape -Minfo=accel,inline,ipa -Mneginfo -Minform=inform -byteswapio -Mmpi=mpich -DGPU -c my_fortran_code.f90 -o my_fortran_code.o

Linker command
pgf90 -g -acc -Mcuda=6.5,cc3x -ta=tesla:loadcache:L1,cc3x -Mipa=inline,reshape -Minfo=accel,inline -Mneginfo -byteswapio -Mmpi=mpich -o ideal -L/home/michel/lib/nusdas/lib -L/home/michel/lib/netcdf3/lib -L/home/michel/asuca/hybrid/asuca-kij/build/gpu/Framework/…/HybridSources parameter_control.o -lasuca -lnusdas -lnwp -lnetcdf

Versions
ddt: 5.0.1
pgf90 15.7-0 64-bit target on x86-64 Linux -tp haswell [/b]

Hi Michel,

I sent a note to Seb to see if he can help but he’s on vacation this week.

I don’t know enough about DDT to answer this. Can you send the question to Allinea? Seb worked with them to ensure PGI was producing correct dwarf code, so knows more than I, but Allinea would have better insight on the operation of DDT.

As far as the compiler flags, they seem correct.

  • Mat

From Seb:

Can you advise customer to try to set a breakpoint inside ACC region and launch run in DDT. I’m not sure stepping from host code to device code will work.

  • Mat