Nsys, understand kernel name mangling

unrue · August 31, 2021, 12:09pm

Dear nsys users,

I’m using nsys version 2021.3.1.54 and from my side is not totally clear how to read name mangling in a profilation. The following is the profilation of my code:

Time(%) Total Time (ns) Instances Average Minimum Maximum StdDev Name

 9,0       4612111527      74412    61980,0    22016    92672  27717,0  nvkernel_add2s2_omp__F1L1959_25_ 
 5,0       2667511586      26632   100161,0    89696   119552   2052,0  nvkernel_vlsc3_omp__F1L2322_78_            
 5,0       2442601853      80384    30386,0     4543    77600  29743,0  nvkernel_scatter_double_F1L138_31          
 4,0       2273953647      77984    29159,0     4480    77376  29396,0  nvkernel_gather_double_add_F1L138_16

So, add2s2_omp is the most expensive target region. Such target region is called in 28 different part of my code (it is quite big). Now, reading the compiler output I see:

mpif90 -c  -O2 -gpu=lineinfo -mp=gpu -Mcuda=cc80 -mcmodel=medium -Minfo=all -Mpreprocess -r8  -gpu=lineinfo -mp=gpu -Mcuda=cc80 -mcmodel=medium -Minfo=all -DMPI -DUNDERSCORE -DGLOBAL_LONG_LONG -DTIMER -I/p/scratch/prcoe05/fatigati1/nek5000/test/Input/ReTau180/pnpn_omp -I/p/scratch/prcoe05/fatigati1/nek5000/nek5000_omp_offload/core -I./ -I /p/scratch/prcoe05/fatigati1/nek5000/nek5000_omp_offload/core/experimental /p/scratch/prcoe05/fatigati1/nek5000/nek5000_omp_offload/core/subs1.f -o obj/subs1.o

…

add2s2_omp:
1958, !$omp target teams distribute parallel do
   1958, Generating Tesla and Multicore code
         Generating "nvkernel_add2s2_omp__F1L1958_25" GPU kernel

1958, Generating implicit map(tofrom:b(:),a(:))

…

mpif90 -c  -O2 -gpu=lineinfo -mp=gpu -Mcuda=cc80 -mcmodel=medium -Minfo=all -Mpreprocess -r8  -gpu=lineinfo -mp=gpu -Mcuda=cc80 -mcmodel=medium -Minfo=all -DMPI -DUNDERSCORE -DGLOBAL_LONG_LONG -DTIMER -I/p/scratch/prcoe05/fatigati1/nek5000/test/Input/ReTau180/pnpn_omp -I/p/scratch/prcoe05/fatigati1/nek5000/nek5000_omp_offload/core -I./ -I /p/scratch/prcoe05/fatigati1/nek5000/nek5000_omp_offload/core/experimental /p/scratch/prcoe05/fatigati1/nek5000/nek5000_omp_offload/core/subs2.f -o obj/subs2.o

The problem is add2s2_omp is not called from subs1.f or subs2.f:

 grep add2s2_omp *.f | grep call 
 drive2.f:      call add2s2_omp(vx,vxc,scale,ntot1)
 drive2.f:      call add2s2_omp(vy,vyc,scale,ntot1)
 drive2.f:      call add2s2_omp(vz,vzc,scale,ntot1)
 drive2.f:      call add2s2_omp(pr,prc,scale,ntot2)
 gmres.f:            call add2s2_omp(r_gmres,w_gmres,-1.,n)       ! r = r - w
 gmres.f:               call add2s2_omp(w_gmres,v_gmres(1,i),-h_gmres(i,j),n) ! w = w - h    v
 gmres.f:            call add2s2_omp(r_gmres,w_gmres,-1.,ntot2)            ! r = r - w
 gmres.f:               call add2s2_omp(w_gmres,v_gmres(1,i),-h_gmres(i,j),ntot2) ! w = w - h    v
 gmres.f:            call add2s2_omp(x_gmres,z_gmres(1,i),c_gmres(i),ntot2) 
 hmholtz.f:         call add2s2_omp(r,x,rmean,n)
 hmholtz.f:         call add2s2_omp(x,p ,alpha,n)
 hmholtz.f:         call add2s2_omp(r,w ,alphm,n)
 induct.f:         call add2s2_omp(pbar,pset(1,i),alpha(i),ntot2)
 induct.f:            call add2s2_omp(pset(1,nprev),pset(1,i),alpham,ntot2)
 navier4.f:            call add2s2_omp(xx(1,k),xx(1,j),-alpha(j),n)
 navier4.f:            call add2s2_omp(bb(1,k),bb(1,j),-alpha(j),n)
 navier4.f:      call add2s2_omp(b,bb(1,1),-alpha(1),n)
 navier4.f:         call add2s2_omp(xbar,xx(1,k),alpha(k),n)
 navier4.f:         call add2s2_omp(bbar,bb(1,k),alpha(k),n)
 navier4.f:         call add2s2_omp(b,bb(1,k),-alpha(k),n)
 navier4.f:         call add2s2_omp(xbar,xx(1,k),alpha(k),n)
 navier4.f:         call add2s2_omp(bbar,bb(1,k),alpha(k),n)
 navier4.f:         call add2s2_omp(b,bb(1,k),-alpha(k),n)
 navier4.f:         call add2s2_omp(xx(1,m),xx(1,k),-alpha(k),n)
 navier4.f:         call add2s2_omp(bb(1,m),bb(1,k),-alpha(k),n)
 navier4.f:         call add2s2_omp(xx(1,m),xx(1,k),-beta(k),n)
 navier4.f:         call add2s2_omp(bb(1,m),bb(1,k),-beta(k),n)
 plan4.f:c     call add2s2_omp(v,vvlag(1,2),ab2,ntot)

So, how can I understand where such particular instance of add2s2_omp (_25) is called from my code? Thanks.

MatColgrove · August 31, 2021, 3:37pm

This would be the target region in the “add2s2_omp” routine at line 1959 of the file. The final number is extraneous and just part of the demangled C++ name.

So, how can I understand where such particular instance of add2s2_omp (_25) is called from my code?

Sorry but the profiler wont be able to tell you that. It has some limited CPU profiling but this is not included in the timeline.

unrue · September 1, 2021, 12:01pm

This would be the target region in the “add2s2_omp” routine at line 1959 of the file. The final number is extraneous and just part of the demangled C++ name.

Ok I understood, I was looking for call to add2s2_omp, and lines mismatched. I have to find where TARGET region starts. Is it possible to have also the file name, or just the line of target region?

And, other question. I used NCU on little main source code where I run just “add add2s2_omp” kernel. Such target region is calles 27 times, and I see from NCU each calls is profiled, having 27 different profilation. Why? Is it possible to aggregate the results for same target region? It is quite difficult to profile it.

This is the launch command:

srun ncu -o add2s2_omp_profile -f --kernel-name-base=function --kernel-regex add2s2_omp --target-processes all “./add2s2_omp”

Attached the screenshot and source code. Thanks.

add2s2_omp.f (2.3 KB)

MatColgrove · September 1, 2021, 3:59pm

Correct, it’s the location of the target region, not where the subroutine that contains it is called. It might be possible to add the filename, but this would start to make the kernel names very long, probably not something we’d want to do.

Why? Is it possible to aggregate the results for same target region?

Each call could have different profiles depending on the data being passed in and any aggregation would give you a skewed view of each. Though you can limit the number of times ncu profiles a kernel via the “-c” or “–launch-count” command line options. See: Nsight Compute CLI :: Nsight Compute Documentation

-Mat

Topic		Replies	Views
Kernel names in the profiler To hide the kernel names from showing in the profiler CUDA Programming and Performance	1	1624	June 9, 2009
preview of NVIDIA Visual Profiler CUDA Programming and Performance	76	89292	May 18, 2010
Ncu profile file not created Nsight Compute	5	1181	September 1, 2021
Accessing kernel name from .cubin Cuda Applied programming. CUDA Programming and Performance	5	3050	October 9, 2008
Nvidia Visual Profiler OptiX	2	825	June 14, 2022
Profiling CUDA Programming and Performance	0	508	August 13, 2015
nvcc changes symbol names when compiling to ptx CUDA Programming and Performance	2	3095	January 24, 2010
preventing nvcc name mangling for use with python ctypes CUDA Programming and Performance	2	4875	June 1, 2009
5 questions about driver api, occupancy & profiling CUDA Programming and Performance	2	1477	August 14, 2008
Question of NVIDIA CUDA Visual Profiler Version 2.2 CUDA Programming and Performance	1	1018	November 13, 2009

Nsys, understand kernel name mangling

Related topics