Dear nsys users,
I’m using nsys version 2021.3.1.54 and from my side is not totally clear how to read name mangling in a profilation. The following is the profilation of my code:
Time(%) Total Time (ns) Instances Average Minimum Maximum StdDev Name
9,0 4612111527 74412 61980,0 22016 92672 27717,0 nvkernel_add2s2_omp__F1L1959_25_ 5,0 2667511586 26632 100161,0 89696 119552 2052,0 nvkernel_vlsc3_omp__F1L2322_78_ 5,0 2442601853 80384 30386,0 4543 77600 29743,0 nvkernel_scatter_double_F1L138_31 4,0 2273953647 77984 29159,0 4480 77376 29396,0 nvkernel_gather_double_add_F1L138_16
So, add2s2_omp is the most expensive target region. Such target region is called in 28 different part of my code (it is quite big). Now, reading the compiler output I see:
mpif90 -c -O2 -gpu=lineinfo -mp=gpu -Mcuda=cc80 -mcmodel=medium -Minfo=all -Mpreprocess -r8 -gpu=lineinfo -mp=gpu -Mcuda=cc80 -mcmodel=medium -Minfo=all -DMPI -DUNDERSCORE -DGLOBAL_LONG_LONG -DTIMER -I/p/scratch/prcoe05/fatigati1/nek5000/test/Input/ReTau180/pnpn_omp -I/p/scratch/prcoe05/fatigati1/nek5000/nek5000_omp_offload/core -I./ -I /p/scratch/prcoe05/fatigati1/nek5000/nek5000_omp_offload/core/experimental /p/scratch/prcoe05/fatigati1/nek5000/nek5000_omp_offload/core/subs1.f -o obj/subs1.o
…
add2s2_omp: 1958, !$omp target teams distribute parallel do 1958, Generating Tesla and Multicore code Generating "nvkernel_add2s2_omp__F1L1958_25" GPU kernel1958, Generating implicit map(tofrom:b(:),a(:))
…
mpif90 -c -O2 -gpu=lineinfo -mp=gpu -Mcuda=cc80 -mcmodel=medium -Minfo=all -Mpreprocess -r8 -gpu=lineinfo -mp=gpu -Mcuda=cc80 -mcmodel=medium -Minfo=all -DMPI -DUNDERSCORE -DGLOBAL_LONG_LONG -DTIMER -I/p/scratch/prcoe05/fatigati1/nek5000/test/Input/ReTau180/pnpn_omp -I/p/scratch/prcoe05/fatigati1/nek5000/nek5000_omp_offload/core -I./ -I /p/scratch/prcoe05/fatigati1/nek5000/nek5000_omp_offload/core/experimental /p/scratch/prcoe05/fatigati1/nek5000/nek5000_omp_offload/core/subs2.f -o obj/subs2.o
The problem is add2s2_omp is not called from subs1.f or subs2.f:
grep add2s2_omp *.f | grep call drive2.f: call add2s2_omp(vx,vxc,scale,ntot1) drive2.f: call add2s2_omp(vy,vyc,scale,ntot1) drive2.f: call add2s2_omp(vz,vzc,scale,ntot1) drive2.f: call add2s2_omp(pr,prc,scale,ntot2) gmres.f: call add2s2_omp(r_gmres,w_gmres,-1.,n) ! r = r - w gmres.f: call add2s2_omp(w_gmres,v_gmres(1,i),-h_gmres(i,j),n) ! w = w - h v gmres.f: call add2s2_omp(r_gmres,w_gmres,-1.,ntot2) ! r = r - w gmres.f: call add2s2_omp(w_gmres,v_gmres(1,i),-h_gmres(i,j),ntot2) ! w = w - h v gmres.f: call add2s2_omp(x_gmres,z_gmres(1,i),c_gmres(i),ntot2) hmholtz.f: call add2s2_omp(r,x,rmean,n) hmholtz.f: call add2s2_omp(x,p ,alpha,n) hmholtz.f: call add2s2_omp(r,w ,alphm,n) induct.f: call add2s2_omp(pbar,pset(1,i),alpha(i),ntot2) induct.f: call add2s2_omp(pset(1,nprev),pset(1,i),alpham,ntot2) navier4.f: call add2s2_omp(xx(1,k),xx(1,j),-alpha(j),n) navier4.f: call add2s2_omp(bb(1,k),bb(1,j),-alpha(j),n) navier4.f: call add2s2_omp(b,bb(1,1),-alpha(1),n) navier4.f: call add2s2_omp(xbar,xx(1,k),alpha(k),n) navier4.f: call add2s2_omp(bbar,bb(1,k),alpha(k),n) navier4.f: call add2s2_omp(b,bb(1,k),-alpha(k),n) navier4.f: call add2s2_omp(xbar,xx(1,k),alpha(k),n) navier4.f: call add2s2_omp(bbar,bb(1,k),alpha(k),n) navier4.f: call add2s2_omp(b,bb(1,k),-alpha(k),n) navier4.f: call add2s2_omp(xx(1,m),xx(1,k),-alpha(k),n) navier4.f: call add2s2_omp(bb(1,m),bb(1,k),-alpha(k),n) navier4.f: call add2s2_omp(xx(1,m),xx(1,k),-beta(k),n) navier4.f: call add2s2_omp(bb(1,m),bb(1,k),-beta(k),n) plan4.f:c call add2s2_omp(v,vvlag(1,2),ab2,ntot)
So, how can I understand where such particular instance of add2s2_omp (_25) is called from my code? Thanks.
