Strange results of profiling OpenACC code by VISUAL profiler

Hello,

I am comparing CUDA and OpenACC versions of my code now and tried to profile codes with CUDA VISUAL profiler.
I have tried to make the codes to be as close as possible, but I am still getting different profiling results.

Here the profiling result for the CUDA code:

And here - for OpenACC one:

Could you, please, explain me where do these small data copy calls (thin blue lines before and after each kernel) come from?

my OpenACC code looks like:

!$acc data create( hvx, hvy, hvz, grdx, grdy, grdz),    &
    !$acc  copyin (vx,vy, vz, h) , &
    !$acc copyout (dh,dvx, dvy, dvz), &
    !$acc create (scl, omega)   

! first kernel

    !$acc kernels loop gang vector(4)  create (depth), present (CNST_EGRAV,   GRD_zs, ADM_VNONE)
     do l=1,ADM_lall
    !$acc loop gang vector(128)
       do n =1, ADM_gall
          scl(n,k,l)=&
               -( CNST_EGRAV*(h(n,k,l))          &
               +0.5D0*( vx(n,k,l)*vx(n,k,l)    &
               +vy(n,k,l)*vy(n,k,l)    &
               +vz(n,k,l)*vz(n,k,l) ) )
          depth=h(n,k,l)-GRD_zs(n,k,l,ADM_VNONE)
          hvx(n,k,l)=depth*vx(n,k,l)
          hvy(n,k,l)=depth*vy(n,k,l)
          hvz(n,k,l)=depth*vz(n,k,l)
       end do
    !$acc end kernels
  !$acc update host(scl)
    end do
   
  !Other kernels
 !$acc end data

[/code]

Thank you,

Irina.

Hi Irina,

Sorry but I’m not familiar with the CUDA Visual Profiler so don’t know what the different colors correspond to.

Could you, please, explain me where do these small data copy calls (thin blue lines before and after each kernel) come from?

Do you mean the thin green lines? I only see one thin blue line around the 144000 mark.

Before the kernel is launched, there will be some overhead in looking up the addresses for the variable in the “present” clause as well as creating the global memory for “depth”. Also, the complier may be copying the arguments as a separate struct in order to work around CUDA’s argument size limit.

Note that it is unnecessary to copy scalar variables and in some cases can be detrimental. For example, by putting “depth” in a create clause, you have made it a global variable. Beside the performance hit of not using a register, all threads will be sharing the same “depth” variable and will most likely give you wrong answers.

How does your profile change after removing scalar variables from the various copy, create, and present clauses?

  • Mat

Dear Mat,
Thank you for explanation.

I am sorry for not describing traces in detail.
Here the trace with some of my comments. The are thin green lines before each kernel (for example, 6 green lines around point 140150), which I was asking about.

Following your advice, I have tried to delete all data regions and copy, create and present clauses, and created a new trace only for 1 kernel:

So, on this trace I also have thin green lines ( for example at point 3758), which I am trying to understand.

code:

 !$acc kernels loop
    do l=1,ADM_lall
       do n =1, ADM_gall
          scl(n,k,l)=&
               -( CNST_EGRAV*(h(n,k,l))          &
               +0.5D0*( vx(n,k,l)*vx(n,k,l)    &
               +vy(n,k,l)*vy(n,k,l)    &
               +vz(n,k,l)*vz(n,k,l) ) )
          depth=h(n,k,l)-GRD_zs(n,k,l,ADM_VNONE)
          hvx(n,k,l)=depth*vx(n,k,l)
          hvy(n,k,l)=depth*vy(n,k,l)
          hvz(n,k,l)=depth*vz(n,k,l)
       end do
    end do

OpenACC compiling output:

406, Generating copyin(vz(:adm_gall,:1,:adm_lall))
         Generating copyin(vy(:adm_gall,:1,:adm_lall))
         Generating copyin(vx(:adm_gall,:1,:adm_lall))
         Generating copyin(h(:adm_gall,:1,:adm_lall))
         Generating copyout(scl(1:adm_gall,1,1:adm_lall))
         Generating copyin(grd_zs(1:adm_gall,1,1:adm_lall,1))
         Generating copyout(hvx(1:adm_gall,1,1:adm_lall))
         Generating copyout(hvy(1:adm_gall,1,1:adm_lall))
         Generating copyout(hvz(1:adm_gall,1,1:adm_lall))
    407, Loop is parallelizable
    408, Loop is parallelizable
         Accelerator kernel generated
        407, !$acc loop gang, vector(8) ! blockidx%y threadidx%y
        408, !$acc loop gang, vector(8) ! blockidx%x threadidx%x

Thank you,

Irina

In previous trace (without data region) there were 5 thin green lines in total
Then, when I put data region to the code, the number of thin green lines became 6 (lines before point 3276)




!$acc data copyin (vx, vy, vz, GRD_zs), copyout (hvx, hvy, hvz)
    !$acc kernels loop
    do l=1,ADM_lall
       do n =1, ADM_gall
          scl(n,k,l)=&
               -( CNST_EGRAV*(h(n,k,l))          &
               +0.5D0*( vx(n,k,l)*vx(n,k,l)    &
               +vy(n,k,l)*vy(n,k,l)    &
               +vz(n,k,l)*vz(n,k,l) ) )
          depth=h(n,k,l)-GRD_zs(n,k,l,ADM_VNONE)
          hvx(n,k,l)=depth*vx(n,k,l)
          hvy(n,k,l)=depth*vy(n,k,l)
          hvz(n,k,l)=depth*vz(n,k,l)
       end do
    end do
  !$acc end data[img][/img]

Can it be because compiler copying some additional information about arrays I copy to GPU?


Thank you,

Best regards,

Irina

Hi Irina,

My best guess is these are the F90 Array descriptors. We currently send this information separate from the data. Though we are looking at consolidating this as well as making these copies asynchronous.

  • Mat

Thank you, Mat.
You helped me a lot.

Best regards,
Irina

Hi, Mat,

The results I showed you before with small data copies before each kernel was obtained, by using 12.5 version of the PGI compiler.
Now, when I use 12.10 version there are much less copies then it was for 12.5 one, but, for some reason, execution time for one of the kernels became about 3 times longer.
I specify grid and block size by myself, so, this parameters are fixed.
Execution time for other kernels doesn’t change and I wonder why only one of the kernels shows different results?

Thank you in advance,

Irina.

Hi Irina,

We introduced OpenACC in 12.6 and with it some major changes.

I would consider your issue a performance bug, especially if you were able to work around it by explicitly setting the schedule. Scheduling is very difficult so when the compiler’s automatic scheduler is getting it wrong, we’d like to know about it. Can you send a report to PGI Customer Service (trs@pgroup.com) and include a reproducing example?

Thanks,
Mat