cuda variable not updating

cablesb · May 21, 2012, 6:26pm

I am trying to use CUDA to run what I thought would be a very simple-minded coding of the diffusion equation. Here’s the code:

module dimensions

integer, parameter :: nx=256, ny=256

end module




program diffuse

use cudafor
use dimensions

real*8 :: v(nx,ny), diffconst
real*8, device :: v_d(nx,ny)
integer :: nloops, outputdl
type(dim3) :: grid, tBlock
integer, device :: n_d, outputdl_d
real*8, device :: diffconst_d

nloops=10000
outputdl=1000
diffconst=.25

outputdl_d=outputdl
diffconst_d=diffconst

tBlock=dim3(64,64,1)
grid=dim3(ceiling(real(nx)/tBlock%x), &
          ceiling(real(ny)/tBlock%y), 1)

open(unit=11,file='diff_output.txt')

v=0

n=0

do while (n.le.nloops)
  v_d=v
  n_d=n
  call diff_time_stepper<<<grid>>>(v_d,diffconst_d,n_d,outputdl_d)
  v=v_d
  n=n_d
  write(11,*) v
  print *,n
enddo


end program





attributes(global) subroutine diff_time_stepper(v,diffconst,n,nsteps)


real*8 :: v(:,:)
real*8 :: diffconst
integer :: n
integer :: nsteps

real*8 :: vintermed
integer :: i,j,m
integer :: nx, ny

nx=256
ny=256

i=(blockIdx%x-1)*blockDim%x+threadIdx%x
j=(blockIdx%y-1)*blockDim%y+threadIdx%y

do m=1,nsteps
  if (i<nx .and. j<ny>1 .and. j>1) then
    vintermed=v(i,j)+diffconst*(v(i-1,j)-2.*v(i,j)+v(i+1,j)+v(i,j-1)-2.*v(i,j)+v(i,j+1))
    v(i,j)=vintermed
  endif
! add a source for the heck of it
  if (i==64 .and. j==64) v(i,j)=v(i,j)+1
enddo


n=n+nsteps

end subroutine

The one obvious problem I have is is that the variable “n” (returned to main as “n_d”) is not updating. At all. There are probably numerous other problems i don’t know about too. Trying to run with cuda=emu gives a seg fault. Any ideas? THanks. I’m really new to this.

cablesb · May 21, 2012, 6:28pm

But wait, there’s more. In the triple chevron, I don’t know what happened. The code – I am looking at it now – reads

call diff_time_stepper<<>>( … )

cablesb · May 21, 2012, 6:29pm

OK, this is like a Twilight Zone episode or something. Inside the triple chevrons are “grid,tBlock”. Don’t know how to make it look right.

TheMatt · May 21, 2012, 6:59pm

I think this has to do with the HTML parser or something with the Forum. Try substituting, say, {{{ }}} instead in your paste.

MatColgrove · May 21, 2012, 8:14pm

Hi cablesb,

I think this has to do with the HTML parser or something with the Forum. Try substituting, say, {{{ }}} instead in your paste.

Yes, it’s our UF. I keep meaning on looking at the UF code to see if can fix it, but just haven’t done it yet.

The one obvious problem I have is is that the variable “n” (returned to main as “n_d”) is not updating. At all. There are probably numerous other problems i don’t know about too. Trying to run with cuda=emu gives a seg fault. Any ideas? THanks. I’m really new to this.

Your code has a number of issues. First is that global kernel routines must have an interface. Without an interface, your program will most likely seg fault if you are passing arguments. You can fix this by adding an explicit interface, or put your routine in a module. Module routines are given an implicit interface.

Next, your algorithm will give you wrong answers in two spots:

n=n+nsteps

Here every thread will update “n” and the value stored back to global memory will be dependent upon the execution order of the threads. To fix, you need to either guard the updates (like “if (i.eq.1.and.j.eq.1) then”) or better yet, move this to the host code.

 
do m=1,nsteps
  if (i<nx .and. j<ny>1 .and. j>1) then
    vintermed=v(i,j)+diffconst*(v(i-1,j)-2.*v(i,j)+v(i+1,j)+v(i,j-1)-2.*v(i,j)+v(i,j+1))
    v(i,j)=vintermed
  endif
! add a source for the heck of it
  if (i==64 .and. j==64) v(i,j)=v(i,j)+1
enddo

Again, this code will lead to non-deterministic results. You update the new value v(i,j) before making sure all threads that need the old value have read it. Hence, some threads may be reading the new value, and others the old. Worse, the order in which the threads changes from run to run and so will your results.

You need to globally synchronize your threads in between the reads and write to “v” and the only method to globally synchronize threads is between kernel calls. Hence, you need to change your algorithm to something like:

! Host code
do while (n.le.nloops) 
    do m=1,nsteps 
!         call kernel1 to perform the calculation and store the results to a temp array
!         copy temp array back into v
    enddo
    n=n+nsteps 
end do

I use a similar algorithm for my CUDA Fortran implementation of the “Game of Life” (found HERE).

Hope this helps,
Mat

cablesb · May 21, 2012, 9:21pm

Thanks much for the advice. But today’s not my day. I have decided to try something much simpler. I am trying to just take one element of my array and increment it with each iteration step:

module dimensions
integer, parameter :: nx=256, ny=256
end module


module diff_kernel_mod

contains


attributes(global) subroutine add_source(v,iloc,jloc)

real*8 :: v(:,:)
integer, value :: iloc, jloc

i=(blockIdx%x-1)*blockDim%x+threadIdx%x
j=(blockIdx%y-1)*blockDim%y+threadIdx%y

if (i==iloc .and. j==jloc) then
  v(i,j)=v(i,j)+1
endif

end subroutine

end module diff_kernel_mod


program diffuse


use cudafor
use dimensions
use diff_kernel_mod


real*8 :: v(nx,ny), diffconst
real*8, device :: v_d(nx,ny)
integer :: nloops, outputdl
type(dim3) :: grid, tBlock
real*8, device :: diffconst_d

nloops=100
outputdl=10
diffconst=.25

outputdl_d=outputdl
diffconst_d=diffconst

tBlock=dim3(64,64,1)
grid=dim3(ceiling(real(nx)/tBlock%x), &
          ceiling(real(ny)/tBlock%y), 1)

open(unit=11,file='diff_output.txt')

v=0.

n=0

do while (n.le.nloops)
  do m=1,outputdl
    v_d=v
    call add_source<<<grid>>>(v_d,64,64)
    v=v_d
  enddo
  n=n+outputdl
  write(11,*) v
  print *,n
enddo


end program

I expect v(64,64) to increase steadily. And, when I compile with cuda=emu, it does. But when I run it on the GPGPU for real, it stubbornly remains zero. If you could tell me what I’m doing wrong I would be much obliged.

MatColgrove · May 21, 2012, 10:02pm

Hi Cablesb,

Adding a bit of error handling shows that you’re getting a “invalid configuration argument”. The problem being that your block size is 64x64, or 4096 threads. The maximum number of threads per block will vary by device, but on my Tesla C2070 the max is 1024. Changing “64” to “16” fixes the problem. Use the utility “pgaccelinfo” to see the max for your device.

module dimensions
integer, parameter :: nx=256, ny=256
end module


module diff_kernel_mod

contains


attributes(global) subroutine add_source(v,iloc,jloc)
implicit none
real*8 :: v(:,:)
integer, value :: iloc, jloc
integer :: i,j

i=(blockIdx%x-1)*blockDim%x+threadIdx%x
j=(blockIdx%y-1)*blockDim%y+threadIdx%y

if (i==iloc .and. j==jloc) then
  v(i,j)=v(i,j)+1
endif

end subroutine

end module diff_kernel_mod


program diffuse


use cudafor
use dimensions
use diff_kernel_mod
implicit none


real*8 :: v(nx,ny), diffconst
real*8, device :: v_d(nx,ny)
integer :: nloops, outputdl,n,m,ierr
type(dim3) :: grid, tBlock

nloops=1
outputdl=10
diffconst=.25

tBlock=dim3(16,16,1)
grid=dim3(ceiling(real(nx)/tBlock%x), &
          ceiling(real(ny)/tBlock%y), 1)
print *, 'Block:', tBlock
print *, 'Grid: ', grid
open(unit=11,file='diff_output.txt')

v=0.
n=0

do while (n.le.nloops)
  do m=1,outputdl
    v_d=v
    call add_source{{{grid,tBlock}}}(v_d,16,16)
    ierr=cudaGetLastError()
    if (ierr.ne.0) then
       print *, 'ERROR:', cudaGetErrorString(ierr)
    endif
    v=v_d
  enddo
  n=n+outputdl
  write(11,*) v
  print *,n
enddo

end program

Hope this helps,
Mat

MatColgrove · May 22, 2012, 6:23pm

I think this has to do with the HTML parser or something with the Forum. Try substituting, say, {{{ }}} instead in your paste.

I think I have this fixed now. Looks like I just needed to turn off HTML support. I watch for other issues.

Mat