NVHPC timeline question

Dear Nvidia users,

using NVHPC with a very big code, the timeline is quite populated. Is it possible, from a portion of timeline, to understand wich part of code is referring? I see the kernels name, but such kernel are called in more part of the code in my case.

In other words, how to isolate in timeline a portion of the code where I’m working, in order to see the behaviour before/after my changes? IS it possible to analyze I mean, for example the behaviour just in a particular subroutine? I have just a global view, but not understanding where such piece of timeline has a correspondence in the code, it is quite difficult to understand if is working better or not.

Thanks.

Hi unrue,

I’m not 100% clear what you’re asking, but think you’re asking within an Nsight-Systems profiling timeline if you can track CPU profiling. Nsight-System does have some CPU profiling, but it’s only post-mortem and isn’t included in the time. The volume of samples required to add this support would be unwieldy.

Instead, you’ll want to look at adding NVTX: NVIDIA Tools Extension Library (NVTX) :: NVIDIA Nsight VSE Documentation

NVTX allows you to insert API calls in your code so you insert start and stop points into the profile’s timeline.

You can use NVTX with Fortran as well see: https://developer.nvidia.com/blog/customize-cuda-fortran-profiling-nvtx/
Note that Mass’ NVTX module is included with the compilers so no need to write your own.

Hope this helps,
Mat

Hi Mat,

thanks. I just would check if in a particular part of the code, I’m having an overlap between computation and memory transfer. So I think such labels satisfy my request. Thanks.

Hi Mat,

does it works only with NVVP? I’m trying with Nsight System 2021.3.1 but my labels does not appears. In attach the code and visualizations.

add2s2_omp.f (1.9 KB)

Works for me, but didn’t use Mass’ module, I used the one that comes with the compilers as well as linked with our wrapper library:

% nvfortran add2s2_omp.f -lnvhpcwrapnvtx -mp=gpu
% nsys profile ./a.out
Collecting data...
    7.399999        5.980000      -0.4800002
Processing events...
Capturing symbol files...
Saving temporary "/tmp/nsys-report-24e3-a504-7a0a-c7da.qdstrm" file to disk...

Creating final output files...
Import error: The importation timed out.
Skipping import of the QDSTRM file.
Report file moved to ".report1.qdstrm"

Hi Mat,

from my side does not compile:

nvfortran -L/p/software/juwelsbooster/stages/2020/software/NVHPC/21.9-GCC-10.3.0/Linux_x86_64/21.9/compilers/lib/ -lnvhpcwrapnvtx -mp=gpu -Mcuda=cc80 -Minfo=all add2s2_omp.f -O2 -o add2s2_omp

/p/scratch/prcoe05/fatigati1/nek5000/test/Input/ReTau180/pnpn_omp/add2s2_omp.f:50: undefined reference to nvtx_nvtxstartrange_' /p/software/juwelsbooster/stages/2020/software/binutils/2.36.1-GCCcore-10.3.0/bin/ld: /p/scratch/prcoe05/fatigati1/nek5000/test/Input/ReTau180/pnpn_omp/add2s2_omp.f:67: undefined reference to nvtx_nvtxendrange_’

(the library nvhpcwrapnvtx is present in the assigned path)

/p/software/juwelsbooster/stages/2020/software/NVHPC/21.9-GCC-10.3.0/Linux_x86_64/21.9/compilers/lib/libnvhpcwrapnvtx.a
/p/software/juwelsbooster/stages/2020/software/NVHPC/21.9-GCC-10.3.0/Linux_x86_64/21.9/compilers/lib/libnvhpcwrapnvtx.so

I tried also passing nvtx.f90 in compilation string, same error.

Do you still have the old “nvtx.mod” file you built from Mass’ nvtx.f90 file in this directory? I’m guessing that it’s picking up this module rather than the one that’s shipped with the compilers.

Yes exactly. Now it compiles well, but still my labels doe not appears in nsys-ui :/

Hmm, sorry then I’m not sure. It seems to work fine for me but no idea why it’s not for you.

Hi Mat,

now I see my labels, adding

nvtx.f90 -L/p/software/juwelsbooster/stages/2020/software/NVHPC/21.9-GCC-10.3.0/Linux_x86_64/21.9/cuda/11.0/lib64 -lnvToolsExt

to my compilation string.

Now the problem is, in such labels the code seems to make just cudaFree. Attached the output of nsys.ui. This is my piece of code I want to check:

  call nvtxStartRange("MY_LABEL")

!$OMP TARGET DATA MAP(TOFROM:xbar,bbar,b,alpha) MAP(TO:xx,bb,w) 
!$OMP TARGET DATA use_device_ptr(xbar,xx,bb,bbar,b,w)
  do k = 2,m
     alpha_d = alpha(k)
     call cublasDaxpy(n, alpha_d, xx(:,k), 1, xbar, 1)
     call cublasDaxpy(n, alpha_d, bb(1,k), 1, bbar, 1)
     call cublasDaxpy(n, -alpha_d, bb(1,k), 1, b, 1)
  enddo
!$OMP END TARGET DATA

  do k = 1, m
     if(ifwt) then
        alpha(k) = vlsc3_omp(xx(1,k),w,b,n)
     else
        alpha(k) = vlsc2_omp(xx(1,k),b,n)
     endif
  enddo
!$OMP END TARGET DATA
  call gop(alpha,work,'+  ',m)

!$OMP TARGET DATA MAP(TOFROM:xbar,bbar,b) MAP(TO:xx,bb,alpha) 
!$OMP& use_device_ptr(xbar,xx,bb,bbar,b)

  do k = 1,m
     alpha_d = alpha(k)
     call cublasDaxpy(n, alpha_d, xx(:,k), 1, xbar, 1)
     call cublasDaxpy(n, alpha_d, bb(1,k), 1, bbar, 1)
     call cublasDaxpy(n, -alpha_d, bb(1,k), 1, b, 1)
  enddo
!$OMP END TARGET DATA 

  call nvtxEndRange

Some possible reason? Thanks.

Again, no idea. I went back an tried the earlier example using Mass’ module and replaced the add2s2_omp calls to cublasSaxpy, but still see the nvtx range in the proper spot in the profile.

I’m guessing it’s pilot error or a system issue, but I’d need a full reproducing example to be sure.

Hi Mat,

the problem is with big code I send you. The little example it works. I could provide you all you need to run an example, but my test case is about 1,6 gigabytes and I don’t have smaller test case. It should be possible?

When you were profiling the code with Nsight Systems, did you tell it to specifically trace NVTX? For instance, if you want to trace OpenACC, CUDA, and NVTX, as well as disable sampling (to speed up execution) and show the summary table, you would use the following:

$ nsys profile --stats=true --sample=none -t openacc,cuda,nvtx ./app

I thought nvtx is part of the default trace? OpenACC isn’t, but that just adds the OpenACC runtime routine profiles.

from “nsys --help profile”

    -t, --trace=
       Possible values are 'cuda', 'nvtx', 'cublas', 'cublas-verbose', 'cusparse', 'cusparse-verbose', 'mpi', 'oshmem', 'ucx', 'osrt', 'cudnn', 'opengl', 'opengl-annotations', 'nvvideo', 'openacc', 'openmp', 'vulkan', 'vulkan-annotations' or 'none'.
       Select the API(s) to trace. Multiple APIs can be selected, separated by commas only (no spaces).
       If '<api>-annotations' is selected, the corresponding API will also be traced.
       If 'none' is selected, no APIs are traced.
       Default is 'cuda,nvtx,osrt,opengl'. Application scope

.

1 Like

Yes,

this is my command line:

nsys profile -f true --trace=cuda,openmp,nvtx -o outputprofile

(OpenACC is not used in my code)

I see. What I guess is happening here is mismatched pairs of nvtxStartRange and nvtxEndRange calls. I personally add a comment to each nvtxEndRange call to track which range it is paired with, e.g.:

call nvtxStartRange("My label")
:
do i = 1, N
  call nvtxStartRange("Inner loop")
  :
  ! Main computation code goes here
  :
  call nvtxEndRange ! Inner loop
end do
:
call nvtxEndRange ! My label

Hi Wileam,

thanks for the suggest, but I have just two region, the first one before the loop, the second one after the loop, so I have not mismatch.