CUDA "philosophy"

FIrst, thanks for the help on my last thread re: my coding of the diffusion equation. I have the code going now, but would like to get some “general edification.” My kernel for the diffusion equation is:

attributes(global) subroutine diff_time_stepper(v,diffconst)


real*8 :: v(:,:)
real*8 :: diffconst

real*8 :: vintermed
integer :: i,j,m
integer :: nx, ny

nx=256
ny=256

i=(blockIdx%x-1)*blockDim%x+threadIdx%x
j=(blockIdx%y-1)*blockDim%y+threadIdx%y

if (i<nx .and. j<ny>1 .and. j>1) then
  vintermed=v(i,j)+diffconst*(v(i-1,j)-2.*v(i,j)+v(i+1,j)+v(i,j-1)-2.*v(i,j)+v(i,j+1))
  v(i,j)=vintermed
! add a source for the heck of it
  if (i==64 .and. j==64) v(i,j)=v(i,j)+1
endif


end subroutine

My question is: When I add 1 to v(64,64), am I risking non-deterministic behavior? It seems arguable to me that the thread that handles (64,64) might, for instance complete this routine early sometimes and affect the calculation of the thread that handles, for instance, (63,64). And then, sometimes, it might not. But the code seems to be working. As I have thought about this, I have gotten confused about just what might cause non-deterministic behavior. Maybe my first two calculations, where vinterm gets defined and then put back in v might be problematic? Anyone care to enlighten me or point me to a good discussion on this topic? Thanks.

When I add 1 to v(64,64), am I risking non-deterministic behavior?

Yes.

It seems arguable to me that the thread that handles (64,64) might, for instance complete this routine early sometimes and affect the calculation of the thread that handles, for instance, (63,64). And then, sometimes, it might not. But the code seems to be working.

Correct.

Maybe my first two calculations, where vinterm gets defined and then put back in v might be problematic?

Yes, this is problematic.

Anyone care to enlighten me or point me to a good discussion on this topic?

Do a web serach for “loop carried dependence cuda” and you’ll find a lot of discussions. In particular, search Google Books and you’ll find section 4.5, from “Computer Architecture: A Quantitative Approach” by John L. Hennessy and David A. Patterson which gives a good overview of loop dependencies.

Note that the best way to fix this code is to break this code into two separate kernels (or a kernel and device to device copy) since global synchronization is required between the calculation of “vintermed” and storing the result back to “v”.

  • Mat

OK, I feel like I have gone from not knowing what I don’t know to knowing what I don’t know. And if that’s confusing, think how I feel. Actually, I think this is progress. And thanks for the reference. Much obliged!