using shared memory in my cuda code decreases execution time

ali3 · December 11, 2016, 1:49pm

I bring most time-consuming part of my code below …

when I’m using global memory, execution time is less than when I utilize shared memory …

attributes(global) subroutine inc(p1,m,p, ds,r,c)

implicit none

real(fp_kind),dimension(N_z,N_rho)  :: m,p,p1
real(fp_kind),dimension(1,N_rho)  ::r
real(fp_kind),dimension(N_z,1)  ::c
real(fp_kind), value :: ds 

integer :: i, j , is , js , iter

real(fp_kind) , shared, dimension(0:(Bdimy+1),0:(Bdimx+1)) :: p_s

is = threadIdx%x
js = threadIdx%y

i = (blockIdx%x-1)*blockDim%x + is
j = (blockIdx%y-1)*blockDim%y + js


!!!!!!!!!! transfer p to shared memory  
 if(i>=1 .AND. i<=N_z .AND. j>=1 .AND. j<=N_rho) then 


p_s(is,js)=p(i,j) 

endif 

if (is==1 .and. i>1) then
p_s(is-1,js)=p(i-1,j) 
else if (is==Bdimy .and. i<N_z) then 
p_s(is+1,js)=p(i+1,j) 
end if 

if (js==1 .and. j>1)then 
p_s(is,js-1)=p(i,j-1)
end if
if (js==Bdimx .and. j<N_rho) then 
p_s(is,js+1)=p(i,j+1) 
end if
   !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!


do iter = 1,100

if (1<i .and. i<N_z .and. 1<j .and. j<N_rho) then


     p1(i,j)=-m(i,j)+c(i,1)*    ((-4._fp_kind+2._fp_kind/c(i,1))*p_s(is,js) &
                           +p_s(is-1,js) & !...%down    to center
                           +p_s(is+1,js) & !...%up  to center
          +(1._fp_kind-ds/(2._fp_kind*r(1,j)))*p_s(is,js-1) & !...%left  to center
          +(1._fp_kind+ds/(2._fp_kind*r(1,j)))*p_s(is,js+1))

end if
end do
return
end subroutine inc

and this function invoked by this directive :

call inc<<<grid,tBlock>>>(p1_d,m_d,p_d, ds,r_d,c_d)

and the thread per block for my code is equal to (Bdimy=128,Bdimx=1)
I put a 1000 iteration loop around the kernels’ computation to increasing computation and usage of shared memory for the problem but it’s not working and still global memory is faster than shared memory …

in both cases (global and shared memory usage) I give a correct answer from my code but my problem is about execution time …

where is the problem ? is there any way to decreases execution time for the code ?

Best regard

MatColgrove · December 12, 2016, 7:08pm

Hi @@ali@@,

My best guess without a reproducible example is that the automatic hardware caching is working well for your case and doesn’t have the extra code overhead needed for using shared memory. Now this might change if you were using a 16x16 block size instead of 128x1 since 128x1 is accessing most of the data contiguously.

Using a profiler such as PGPROF or NVPROF and comparing the two profiles will give you a more definitive answer as to the difference.

Note that your code needs a call to “syncthreads” after you update the shared memory. Without it there’s potential for a race condition.

-Mat

Topic		Replies	Views
shared memory in cuda fortran and increasing time of process Legacy PGI Compilers	1	2044	September 2, 2016
shared memory in cuda fortran and increasing time of process CUDA Programming and Performance	0	559	September 2, 2016
Shared memory as slow as global memory CUDA Programming and Performance	8	4439	September 5, 2016
shared memory Computation become slower when using the shared memory CUDA Programming and Performance	8	1828	August 20, 2010
__shared__ memory offers no performance increase Also, using GPUs to display video while running pro CUDA Programming and Performance	2	2115	February 14, 2012
How to use shared memory CUDA Programming and Performance	7	375	May 25, 2022
CUDA gemm with shared memory is slower than with global memory CUDA Programming and Performance cuda	3	91	November 4, 2024
Disappointing shared memory performance CUDA Programming and Performance	3	756	September 8, 2011
Gemm runs too slow with shared memory CUDA Programming and Performance cuda	2	76	October 21, 2024
Local vs Shared Memory execution slows down when using shared memory CUDA Programming and Performance	6	3203	October 14, 2009

using shared memory in my cuda code decreases execution time

Related topics