Using shared memory of a block to improve performance

Dear All,

I need some help to improve the performance of the following code. I am getting only 25% theoretical and achieved occupancy which I believe can be increased. The recommendation of the nvprof is to reduce number of registers per thread. Currently it is 102. If I reduce it with the flag ‘maxregcount:32’ I am getting the theoretical and achieved occupancy 100%. However, the performance is getting much worse. I feel that it is not the way to go. But I don’t know other ways yet.

Also, I heard that the performance can be improved by copying all the data needed for a block from the global memory to the shared memory per block. However, I don’t know how to do it. In my case arrays ‘zij1(:,:), plma(:,:,:,:), plmb(:,:,:,:), jint(:,:,:), basis_a(:,:,:,:), basis_b(:,:,:,:), basis_bp(:,:,:,:)’ are stored in the global GPU memory. How can get the required chunk of it to copy to the block shared memory. Will it improve performance?

My task is to calculate the matrix VRtmp(:,:).
So, the subroutine get_LR_mat_sum_pt_prj(il,jl,bigr,zz,vmatR) is called from the following kernel region:

!$acc kernels
!$acc loop independent gang collapse(2)
do k=1,num_t
  do j=1,num_p
    call get_LR_mat_sum_pt_prj(j,k,bigr,z,vRtmp(j,k))
  enddo
enddo
!$acc end loop
!$acc end kernels

where

subroutine get_LR_mat_sum_pt_prj(il,jl,bigr,zz,vmatR)
!$acc routine vector
  use data_base
  use flogs
  implicit none
  integer::mf,mi,lf,li,kf,ki,i,j,mmf,mmi
  integer,intent(in)::il,jl
  real(kind=id)::bigr,zz,difc
 real(kind=id)::tmp,fact,func,com,comres,arg,fact_obk,com_obk,com_dir,var_lam,var_mu,carg,sarg,wig_d_f,wig_d_i
  complex(kind=p4),intent(out) ::vmatR
  complex(kind=id) :: sumz,sumresz,sumz_obk,sumzj,sumreszj,sumz_obkj,phase,tmpLR

  kf=n_p(il)
  lf=l_p(il)
  mf=m_p(il)
  ki=n_t(jl)
  li=l_t(jl)
  mi=m_t(jl)

sumz = zero
sumz_obk = zero
!$acc loop independent  collapse(2) reduction(+:sumz,sumz_obk)
do i=1,nglag
!!$acc loop independent  reduction(+:sumz,sumz_obk)
  do j=1,ngleg
     tmpLR=zij1(i,j)*plma(lf,mf,i,j)*plmb(li,mi,i,j)*jint(mi-mf,i,j)*basis_a(kf,lf,i,j)
     sumz=sumz+tmpLR*basis_b(ki,li,i,j)
     sumz_obk=sumz_obk+tmpLR*basis_bp(ki,li,i,j)
  enddo
!$acc end loop
enddo
!$acc end loop

  fact_obk=bigr*lm_fac(lf,mf)*lm_fac(li,mi)/8._id
  fact=fact_obk*bigr/2._id

  sumz=sumz*fact*cu**(mi-mf)
  sumz_obk=sumz_obk*fact*cu**(mi-mf)

  phase=cmplx(cos(zz*(k_t(inl_t(jl))-k_p(inl_p(il)))),sin(zz*(k_t(inl_t(jl))-k_p(inl_p(il)))),id)

  vmatR=(sumz/bigr*target_charge*projectile_charge-sumz_obk*target_charge)/v*phase  !was added to add 1/R

end subroutine get_LR_mat_sum_pt_prj

Hi Ilkhom,

The recommendation of the nvprof is to reduce number of registers per thread. Currently it is 102. If I reduce it with the flag ‘maxregcount:32’ I am getting the theoretical and achieved occupancy 100%. However, the performance is getting much worse. I feel that it is not the way to go. But I don’t know other ways yet.

The high register count is due to the local variables which are stored in registers. By limiting the the number of registers, there’s still a need to store these variables so they get “spilled”. If they are spilled to L1 cache, usually performance isn’t effected. However if they are spilled to global memory (which is most likely the case here), it can hurt performance.

You might try using a maxregcount of 64 and 96 instead. While you wont achieve 100% occupancy, having a 50% occupancy is usually fine as well. At 96, the spills will most likely stay in L1 cache.

The best way to reduce register usage is to limit the number of local variables used either by splitting computation into multiple kernels or reusing variables.

For example, it looks like several of the local variables are only used once (like “kf”) so could be removed and replaced with using the array directly (“n_p(il)”)

Also, I heard that the performance can be improved by copying all the data needed for a block from the global memory to the shared memory per block. Will it improve performance?

Possible, but hardware caching has been improved significantly over the years thus reducing the need to do software caching (which is what using CUDA shared arrays effectively is).

However, I don’t know how to do it.

I am hesitant to recommend the following since I don’t think will help much and will require some rewriting of your code, but will put it out there. The key to caching is data reuse, which the code has limited opportunities.

You can try manually inlining the “get_LR_mat_sum_pt_prj” subroutine, then interchange the “k” and “j” loops. Then you could put the local assignment of “kf=n_p(il)” and the others before the “k” loop getting better reuse. Because the variables are at the “gang” loop level, the compiler will store them in shared memory for use by all the vectors.

Something like:

!$acc kernels 
!$acc loop independent gang
  do j=1,num_p 
    kf=n_p(j) 
    lf=l_p(j) 
    mf=m_p(j) 
    ki=n_t(j) 
    li=l_t(j) 
   mi=m_t(j) 
!$acc loop worker  
do k=1,num_t

The caveat here is that I’ve limited the gang parallism to just the “j” loop so num_p should be a large value. Also, I’ve added a “worker” level so “k” is still parallelized but not as much as if it were still part of the collapsed gang. By default, the compiler will use 4 workers and 32 vectors in this case, so you may want to try using “vector(128)” in the inner vector loops to keep the vector length the same. Bigger gangs (CUDA block) usually benefit when there’s more shared memory, even though the occupancy doesn’t change since occupancy is more about the total number of threads used rather than the block size. i.e. bigger blocks with more threads mean few blocks per SM, but the thread count remains the same.

I don’t think that software caching the 3D and 4D arrays would be of benefit. (By caching this would mean creating a 2D private gang array, writing a loop to copy the 4D array values to the private 2D array, then using the private array in the computation). The problem being that there really isn’t any reuse of data since each element of the arrays are only accessed once.



What could help with the arrays is if you can rearrange your indices so that “i” or “j” accesses the stride-1 dimension. For example use “plma(i,f,lf,mf)” instead of “plma(lf,mf,i,j)”. (or "plma(i,f,l_p(ij),m_p(ij)) to combine with my first recommendation) This have no effect on occupancy but will lessen the data divergence and lead to better hardware caching of the arrays. However, this mean that you’d need to update the indices everywhere in your code. Though it most likely would benefit CPU SIMD vectorization as well giving you an overall performance boost (assuming the same access pattern is used elsewhere)

Apologies if this answer is a bit convoluted. Without full knowledge of the application it’s difficult to offer specific recommendation but hopefully have given you some ideas.

Best Regards,
Mat

Hi Mat,

Thank you very much for your insightful response. Now to do some experiment I slightly modified the subroutine get_LR_mat_sum_pt_prj as follows

sumz = zero
sumz_obk = zero
!$acc loop independent  collapse(2) reduction(+:sumz,sumz_obk)
do i=1,nglag
  do j=1,ngleg
     t1=1._id   !plma(lf,mf,i,j)
     t2=1._id   !plmb(li,mi,i,j)
     t3=1._id   !jint(mi-mf,i,j)
     t4=1._id   !basis_b(kf,lf,i,j)
     t5=1._id   !basis_bp(kf,lf,i,j)
     t6=1._id   !basis_a(kf,lf,i,j)
     t7=cmplx(1._id,1._id)  !zij1(i,j)
     tmpLR=t1*t2*t3*t6*t7
     sumz=sumz+tmpLR*t4
     sumz_obk=sumz_obk+tmpLR*t5
  enddo
enddo
!$acc end loop

So, if I set specific values (1._id) to t1, t2, …,t7 instead of using stored arrays on GPU memory (plma, plmb,…) the code runs only 427.63 msec. If I use stored arrays to assign it runs 8.926 sec which is about 21 times slower. Also, this action increased the number of registers from 70 to 150 and decreased the theoretical and achieved occupancy from 43.8% to 18.8%. Grid size and block size didn’t change.
I clearly notice that memory access is very expensive compare to the computation time. Can it be improved?

For information in these calculations num_t=num_p=4385, nglag=108, ngleg=120.

I clearly notice that memory access is very expensive compare to the computation time. Can it be improved?

It might be due the arrays not being accessed along the stride-1 dimension causing memory divergence by the threads.

Can you refactor the arrays to access “i” in the first column of the arrays? i.e. “plma(i,j,lf,mf)”

The increased register usage may be due to the local variables the compile uses to index the arrays. Using stride-1 access might help this as well by reducing the index computation.

-Mat

Before applying your idea of refactoring arrays in the main code I tried similar idea in the following matrix multiplication code. I got unexpected results to me.

When I access a(:,:) and b(:,: ) arrays along the stride-1 the code runs significantly slower, 18.31441688537598 sec.

program mmm
  implicit none
  integer, parameter :: id = selected_real_kind(8)
  integer::i,j,k,l,imax
  complex(kind=id),allocatable,dimension(:,:)::a,b,c
  complex(kind=id),parameter::  zero=cmplx(0._id,0._id)
  complex(kind=id),parameter::  one=cmplx(1._id,0._id)
  integer :: t1, t2, dt, count_rate, count_max
  real(kind=id) ::  secs


  imax=5000
  allocate(a(1:imax,1:imax))
  allocate(b(1:imax,1:imax))
  allocate(c(1:imax,1:imax))


!$acc data create(a,b,c)
!$acc kernels
  a(1:imax,1:imax)=cmplx(1,1,id)
  b(1:imax,1:imax)=cmplx(1,1,id)
  c(1:imax,1:imax)=zero
!$acc end kernels


call system_clock(count_max=count_max, count_rate=count_rate)
call system_clock(t1)

!$acc kernels 
!$acc loop independent collapse(2)
      do k = 1,imax
       do l = 1,imax
!$acc loop seq 
        do j = 1,imax
!           c(l,k) = c(l,k) + a(l,j) * b(k,j)
           c(l,k) = c(l,k) + a(j,l) * b(j,k)
        enddo
       enddo
      enddo
!$acc end kernels

call system_clock(t2)
dt = t2-t1
secs = real(dt)/real(count_rate)
print*,'time in secs',secs
!$acc end data

end program mmm

If I have c(l,k) = c(l,k) + a(l,j) * b(k,j) in the innermost summation loop the time is only 3.599750041961670 sec. The fortran rule of accessing arrays worked in opposed way which is strange to me. I would really appreciate if you could shed some light.

Hi llkhom,

You want the stride-1 dimension across the vector loop to ensure better memory access across the threads in a CUDA warp. Since you’re running the j loop sequentially, the “A” and “B” arrays will not be able to be cached.

However in the second case, the code is stride-1 across the vector loop, “l”, so is getting the better memory access.

-Mat

Hi Mat,

Thanks again for your reply. Following your recommendations I restructured all of my arrays to make i,j indices leftmost. It indeed helped significantly reduce time with speed up factor of ~3.5. Now compiler also generates vector simd code for i and j loops. However, for some reason the codes compiled with -fast and -Mvect=nosimd flags consume exactly the same time. Why there is no effect of simd vectorization?

Cheers,
Ilkhom

However, for some reason the codes compiled with -fast and -Mvect=nosimd flags consume exactly the same time. Why there is no effect of simd vectorization?

Hard to say.

The compiler may be generating altcode so at run time has decided based on the loop trip count or other criteria that the unrolled path is better so not actually using the vector instructions.

It could be that it is using the vector instructions but the non-simd version (typically the code is unrolled) is just as fast.

It could also be that the vector version is faster but something else in the program is dominating the overall time so the speed-up is negligible.

Without analysis, it’d difficult to say if it’s any other these scenarios or something different.

-Mat