adding a matrix element safely to itself in a device sub

Hi all,

anyone can show me how to safely (i.e. no errors or damages to performance) add an element in a matrix (like aj,bj,cj,dj) to itself in a device subroutine? Please see subroutine below.

!=============================================================
attributes (global) subroutine reyneq_p4_kernel(nx,ny,nxmax,nymax,p,xref,yref, &
	aw,ap,ae,ar,as,an,aj,bj,cj,dj)
!=============================================================

	implicit none 
	integer, value :: nx,ny,nxmax,nymax
	integer :: i,j,k,km1,im1,ip1,jm1,jp1,nxm1,nym1
	real(8) :: p(nxmax,nymax),xref(nxmax),yref(nxmax),aw(nxmax,nymax),ap(nxmax,nymax), &
	ae(nxmax,nymax),ar(nxmax,nymax),as(nxmax,nymax),an(nxmax,nymax),aj(nxmax), &
	bj(nxmax),cj(nxmax),dj(nxmax)
	real(8) :: ymyp,xmxp 

	i = (blockidx%x - 1) * blockDim%x + threadidx%x
	j = (blockidx%y - 1) * blockDim%y + threadidx%y

	nxm1 = nx -1
	nym1 = ny -1
	
	if (i >= 2 .and. i <= nxm1) then
          im1=i-1
          ip1=i+1
          xmxp=(xref(ip1)-xref(im1))/2.d0
          
		if (j >= 2 .and. j <= nym1) then
            aj(im1)=0.d0
			bj(im1)=0.d0
			cj(im1)=0.d0
			dj(im1)=0.d0
			jm1=j-1
            jp1=j+1
            ymyp=(yref(jp1)-yref(jm1))/2.d0
!
            aj(im1)=aj(im1)+aw(im1,jm1)
            bj(im1)=bj(im1)+ap(im1,jm1)
            if(j.ne.2) bj(im1)=bj(im1)+as(im1,jm1)
            if(j.ne.nym1) bj(im1)=bj(im1)+an(im1,jm1)
            cj(im1)=cj(im1)+ae(im1,jm1)
            dj(im1)=dj(im1)+ar(im1,jm1)-aw(im1,jm1)*p(im1,j)    &
                      -ae(im1,jm1)*p(ip1,j)-as(im1,jm1)*p(i,jm1)  &
                      -an(im1,jm1)*p(i,jp1)-ap(im1,jm1)*p(i  ,j)
         
	    endif  !j
     endif  !i
		
	end subroutine reyneq_p4_kernel

regards,

Dolf

No way to not suffer some performance, but I see three ways to at least get correct answers.

  1. Use “atomicadd” operations when updating the arrays.
  2. Just parallelize the “i” dimension and then loop over “j” inside the kernel.
  3. Create a shared array, one element per “j” to hold the intermediary values, then have a reduction loop to sum the value into “i”

Other folks may have ideas as well. I favor #2 assuming “i” is fairly large.

  • Mat

Hi Mat,
I appreciate the reply. I used your suggested #2, but still have an error of 10^10.
Can you please give examples of #1 and #3?
i and j are equal to a value between 100 and 1000. currently I am using 402.

thanks,
Dolf

I used your suggested #2, but still have an error of 10^10.

There might be some other issue going on besides a race condition.

Using atomics would be something like:

      if (j >= 2 .and. j <= nym1) then 
! only one thread should initialize the arrays
      if (j.eq.2) then 
         aj(im1)=0.d0 
         bj(im1)=0.d0 
         cj(im1)=0.d0 
         dj(im1)=0.d0
      endif
      call syncthreads()
 
... later use atomicadd to update the global arrays
    aj(im1)=atomicadd(aj(im1),aw(im1,jm1)) 
    bj(im1)=atomicadd(bj(im1),ap(im1,jm1))
  ... etc ...

The shared memory version is a bit more complex and unfortunately don’t have the time to pull together an example. Busy getting ready for the Supercomputing Conference in Austin.

Feel free to send me your code but I may not be able to look at until I’m back from SC.

  • Mat