Nsys, understand kernel name mangling

Dear nsys users,

I’m using nsys version 2021.3.1.54 and from my side is not totally clear how to read name mangling in a profilation. The following is the profilation of my code:

Time(%) Total Time (ns) Instances Average Minimum Maximum StdDev Name


 9,0       4612111527      74412    61980,0    22016    92672  27717,0  nvkernel_add2s2_omp__F1L1959_25_ 
 5,0       2667511586      26632   100161,0    89696   119552   2052,0  nvkernel_vlsc3_omp__F1L2322_78_            
 5,0       2442601853      80384    30386,0     4543    77600  29743,0  nvkernel_scatter_double_F1L138_31          
 4,0       2273953647      77984    29159,0     4480    77376  29396,0  nvkernel_gather_double_add_F1L138_16       

So, add2s2_omp is the most expensive target region. Such target region is called in 28 different part of my code (it is quite big). Now, reading the compiler output I see:

mpif90 -c  -O2 -gpu=lineinfo -mp=gpu -Mcuda=cc80 -mcmodel=medium -Minfo=all -Mpreprocess -r8  -gpu=lineinfo -mp=gpu -Mcuda=cc80 -mcmodel=medium -Minfo=all -DMPI -DUNDERSCORE -DGLOBAL_LONG_LONG -DTIMER -I/p/scratch/prcoe05/fatigati1/nek5000/test/Input/ReTau180/pnpn_omp -I/p/scratch/prcoe05/fatigati1/nek5000/nek5000_omp_offload/core -I./ -I /p/scratch/prcoe05/fatigati1/nek5000/nek5000_omp_offload/core/experimental /p/scratch/prcoe05/fatigati1/nek5000/nek5000_omp_offload/core/subs1.f -o obj/subs1.o

add2s2_omp:
1958, !$omp target teams distribute parallel do
   1958, Generating Tesla and Multicore code
         Generating "nvkernel_add2s2_omp__F1L1958_25" GPU kernel

1958, Generating implicit map(tofrom:b(:),a(:))

mpif90 -c  -O2 -gpu=lineinfo -mp=gpu -Mcuda=cc80 -mcmodel=medium -Minfo=all -Mpreprocess -r8  -gpu=lineinfo -mp=gpu -Mcuda=cc80 -mcmodel=medium -Minfo=all -DMPI -DUNDERSCORE -DGLOBAL_LONG_LONG -DTIMER -I/p/scratch/prcoe05/fatigati1/nek5000/test/Input/ReTau180/pnpn_omp -I/p/scratch/prcoe05/fatigati1/nek5000/nek5000_omp_offload/core -I./ -I /p/scratch/prcoe05/fatigati1/nek5000/nek5000_omp_offload/core/experimental /p/scratch/prcoe05/fatigati1/nek5000/nek5000_omp_offload/core/subs2.f -o obj/subs2.o

The problem is add2s2_omp is not called from subs1.f or subs2.f:

 grep add2s2_omp *.f | grep call 
 drive2.f:      call add2s2_omp(vx,vxc,scale,ntot1)
 drive2.f:      call add2s2_omp(vy,vyc,scale,ntot1)
 drive2.f:      call add2s2_omp(vz,vzc,scale,ntot1)
 drive2.f:      call add2s2_omp(pr,prc,scale,ntot2)
 gmres.f:            call add2s2_omp(r_gmres,w_gmres,-1.,n)       ! r = r - w
 gmres.f:               call add2s2_omp(w_gmres,v_gmres(1,i),-h_gmres(i,j),n) ! w = w - h    v
 gmres.f:            call add2s2_omp(r_gmres,w_gmres,-1.,ntot2)            ! r = r - w
 gmres.f:               call add2s2_omp(w_gmres,v_gmres(1,i),-h_gmres(i,j),ntot2) ! w = w - h    v
 gmres.f:            call add2s2_omp(x_gmres,z_gmres(1,i),c_gmres(i),ntot2) 
 hmholtz.f:         call add2s2_omp(r,x,rmean,n)
 hmholtz.f:         call add2s2_omp(x,p ,alpha,n)
 hmholtz.f:         call add2s2_omp(r,w ,alphm,n)
 induct.f:         call add2s2_omp(pbar,pset(1,i),alpha(i),ntot2)
 induct.f:            call add2s2_omp(pset(1,nprev),pset(1,i),alpham,ntot2)
 navier4.f:            call add2s2_omp(xx(1,k),xx(1,j),-alpha(j),n)
 navier4.f:            call add2s2_omp(bb(1,k),bb(1,j),-alpha(j),n)
 navier4.f:      call add2s2_omp(b,bb(1,1),-alpha(1),n)
 navier4.f:         call add2s2_omp(xbar,xx(1,k),alpha(k),n)
 navier4.f:         call add2s2_omp(bbar,bb(1,k),alpha(k),n)
 navier4.f:         call add2s2_omp(b,bb(1,k),-alpha(k),n)
 navier4.f:         call add2s2_omp(xbar,xx(1,k),alpha(k),n)
 navier4.f:         call add2s2_omp(bbar,bb(1,k),alpha(k),n)
 navier4.f:         call add2s2_omp(b,bb(1,k),-alpha(k),n)
 navier4.f:         call add2s2_omp(xx(1,m),xx(1,k),-alpha(k),n)
 navier4.f:         call add2s2_omp(bb(1,m),bb(1,k),-alpha(k),n)
 navier4.f:         call add2s2_omp(xx(1,m),xx(1,k),-beta(k),n)
 navier4.f:         call add2s2_omp(bb(1,m),bb(1,k),-beta(k),n)
 plan4.f:c     call add2s2_omp(v,vvlag(1,2),ab2,ntot)

So, how can I understand where such particular instance of add2s2_omp (_25) is called from my code? Thanks.

This would be the target region in the “add2s2_omp” routine at line 1959 of the file. The final number is extraneous and just part of the demangled C++ name.

So, how can I understand where such particular instance of add2s2_omp (_25) is called from my code?

Sorry but the profiler wont be able to tell you that. It has some limited CPU profiling but this is not included in the timeline.

This would be the target region in the “add2s2_omp” routine at line 1959 of the file. The final number is extraneous and just part of the demangled C++ name.

Ok I understood, I was looking for call to add2s2_omp, and lines mismatched. I have to find where TARGET region starts. Is it possible to have also the file name, or just the line of target region?

And, other question. I used NCU on little main source code where I run just “add add2s2_omp” kernel. Such target region is calles 27 times, and I see from NCU each calls is profiled, having 27 different profilation. Why? Is it possible to aggregate the results for same target region? It is quite difficult to profile it.

This is the launch command:

srun ncu -o add2s2_omp_profile -f --kernel-name-base=function --kernel-regex add2s2_omp --target-processes all “./add2s2_omp”

Attached the screenshot and source code. Thanks.

add2s2_omp.f (2.3 KB)

Correct, it’s the location of the target region, not where the subroutine that contains it is called. It might be possible to add the filename, but this would start to make the kernel names very long, probably not something we’d want to do.

Why? Is it possible to aggregate the results for same target region?

Each call could have different profiles depending on the data being passed in and any aggregation would give you a skewed view of each. Though you can limit the number of times ncu profiles a kernel via the “-c” or “–launch-count” command line options. See: Nsight Compute CLI :: Nsight Compute Documentation

-Mat