how does this newbie fix his code?

cablesb · March 26, 2012, 9:05pm

I’m a complete CUDA novice. Am trying to run a little test code to figure out the workings of CUDA. I have a subroutine:

attributes(global) subroutine time_step(a,v,dt,dx,ntimes,simlength)

use cudafor

real8, intent(inout) :: a(simlength)
real8, intent(in) :: v, dt, dx
integer :: ntimes
integer, value :: simlength
integer :: i, n
real*8 :: aux(250)

real*8 :: decl

decl=v*dt/dx
aux=0.

i = blockDim%x*(blockIdx%x-1)+threadIdx%x
do n = 1,ntimes
if (i>1 .and. i<=simlength) aux(i)=a(i)-decl*(a(i)-a(i-1))
! this is really what’s going on:
! if (i>1 .and. i<=simlength) aux(i)=a(i-1)
a(i)=aux(i)
enddo

end subroutine time_step

Three problems: 1) Most importantly, this code should just be moving the contents of array “a” one step to the right for each iteration in the loop. It does this for most part, but it drops a couple of numbers. e.g., after 40 iterations, where the data should look like 17, 18, 19, 20 … it looks like 17, 19, 20 … (By the way, I have run this code as you see it here, and with the simple"this is what’s going on line" replacing the line above it; same results.) 2)I want “aux” to be an automatic array. Tried making it shared, but then the code won’t run. Dies with “unspecified failure to launch”. 3) I really don’t want to define a separate “aux” array. I want “a” to be something like a(:,2). Then I would update a(:,2) in every iteration and copy it to a(:,1). Doing this in a naive way produces a code that will compile but not run. Anyone know what to do about these things? Much obliged.

MatColgrove · March 26, 2012, 9:59pm

Hi cablesb,

Without a complete reproducing example, I’ll need to make some guesses.

For issue #1, since your expression contains a backward dependency, it is not parallel. Here you have multiple threads all accessing and updating different elements of a. Since there is no guarantee the order in which a is updated, the value of a(i-1) may be a new value or an old value. You just don’t know. Each thread needs to be working on it’s own element of a.

To make this code parallel, you need to rethink it to be something along the lines of:

Host code

do n = 1,ntimes
  call time_step{{{dimGrid,dimBlock}}}(..args...)  ! note change } to >
  call copyaux_a{{{dimGrid,dimBlock}}}(..args...)  ! 
end do

Device code
attributes(global) subroutine time_step(a,aux, v,dt,dx,ntimes,simlength) 
,,,,
i = blockDim%x*(blockIdx%x-1)+threadIdx%x 
f (i>1 .and. i<=simlength) aux(i)=a(i)-decl*(a(i)-a(i-1)) 
end subroutine

attributes(global) subroutine copyaux_a(aux, a)
...
i = blockDim%x*(blockIdx%x-1)+threadIdx%x 
a(i)=aux(i)
end subroutine copyaux_a

Note that I needed to do a similar algorithm in the following two articles: Account Login | PGI and Account Login | PGI

2)I want “aux” to be an automatic array. Tried making it shared, but then the code won’t run. Dies with “unspecified failure to launch”.

You probably forgot to put the size of the shared memory as the third argument in the chevrons when you launched your kernel.

I really don’t want to define a separate “aux” array. I want “a” to be something like a(:,2). Then I would update a(:,2) in every iteration and copy it to a(:,1).

That’s fine so long as you don’t copy a(:,2) back to a(:,1) in the same kernel.

Hope this helps,
Mat

cablesb · March 26, 2012, 10:37pm

Mat,

Thanks very much for the reply.

Re: 1) and 3), oh man, do I have alot to learn.

Re: 2) I have been trying to puzzle out the amount of memory needed from the PGI CUDA Fortran Programming Guide, and I’m not having much luck. How does this get calculated? In my current example, I am operating on a real*8 array of 256 elements. I have run with a variety of threads per block, from 2 up to 256. Thanks again.

cablesb · March 26, 2012, 11:13pm

Mat (or anyone!),
A modified code is posted below in all it’s glory. Well, in most of its glory. I am apparently running up against some limit in comment lines or something. I will post the second subroutine in another reply. I am now completely lost, I think. The “a” array is not being modified at all, neither by the time_step routine, nor by the switch_places routine. Any ideas are most appreciated.

program simple_wave

use cudafor

integer, parameter :: simlength=256
integer :: tPB = 2

real*8 :: a(simlength,2)

real*8, device :: a_dev(simlength,2)
real*8, device :: v, dt, dx
integer, device :: ntimes

real*8 :: v_h, dt_h, dx_h
integer :: ntimes_h

! parameters for triangle wave
integer :: peak_start, peak_half_width

peak_start=50
peak_half_width=50

! define triangle wave
a(1:peak_start-1,1)=0.d0
a(peak_start:peak_start+peak_half_width-1,1)=(/(dble(i),i=1,peak_half_width)/)
a(peak_start+peak_half_width:peak_start+2*peak_half_width-1,1)=(/(dble(i),i=peak_half_width,1,-1)/)
a(peak_start+2*peak_half_width:simlength,1)=0.d0
a(:,2)=1.d0

! define advection parameters.  let's make everything one for now.
v=1.d0
dt=1.d0
dx=1.d0
v_h=v
dt_h=dt
dx_h=dx

! how many time steps:
ntimes=50
ntimes_h=ntimes

write(15,fmt='(1f10.5)') a(:,1)
write(6,'(10f10.5)'),a(:,1)
print *,'========'


a_dev=a
!do on host first, for comparison
do n=1,ntimes_h
  a(2:simlength,2)=a(2:simlength,1)-v_h*dt_h/dx_h*(a(2:simlength,1)-a(1:simlength-1,1))
  a(:,1)=a(:,2)
enddo
write(6,'(10f10.5)'),a(:,1)

! now on device
do n=1,ntimes_h
call time_step<<<ceiling>>>(a_dev,v,dt,dx,simlength)
a=a_dev
print *,'**********'
write(6,'(10f10.5)'),a(:,2)
call switch_places<<<ceiling>>>(a_dev,simlength)
enddo
a=a_dev

print *,'========'
write(6,'(10f10.5)'),a(:,1)
print *,'========'
write(6,'(10f10.5)'),a(:,2)
write(16,fmt='(1f10.5)') a(:,1)


end program

attributes(global) subroutine switch_places(a,simlength)

use cudafor

integer, intent(in), value :: simlength
real*8, intent(inout) :: a(simlength,2)

integer :: i

i = blockDim%x*(blockIdx%x-1)+threadIdx%x
if (i>=1 .and. i<=simlength) a(i,1)=a(i,2)

end subroutine switch_places

cablesb · March 26, 2012, 11:14pm

… and here is time_step:

attributes(global) subroutine time_step(a,v,dt,dx,simlength)

use cudafor

integer, value :: simlength
real*8, intent(inout)  :: a(simlength,2)
real*8, intent(in) :: v, dt, dx
integer :: i, n

real*8 :: decl

decl=v*dt/dx

i = blockDim%x*(blockIdx%x-1)+threadIdx%x
!if (i>1 .and. i<=simlength) a(i,2)=a(i,1)-decl*(a(i,1)-a(i-1,1))
! this is really what's going on:
  if (i>1 .and. i<=simlength) a(i,2)=a(i-1,1)


end subroutine time_step

[/code]

cablesb · March 26, 2012, 11:38pm

whoops! In the host code, the “ceiling” text should read

ceiling(real(simlength*2)/tPB),tPB,simlength*8*128

Don’t know what happened to my cut and paste there.

MatColgrove · March 27, 2012, 2:57pm

Hi cablesb,

Don’t know what happened to my cut and paste there.

Problem with our UF. It wants to reformat what it thinks are HTML tags. I’ll look into getting it fixed.

I see one major issue and a couple minor. First, CUDA Fortran global subroutines must have an interface in order to call them. Failure to have an interface will cause your kernels to fails and is most likely what’s happening here. To fix, add an interface block for the global routines or put them in a module. Modules provide an implicit interface.

Next, since you don’t use automatic or assumed-size shared arrays in your kernels, there is no need in passing in the third argument in the chevron syntax. Also, your block size is only 2, tPB, which is small. Typically you want block sizes in increments of the warp size (i.e. 32).

With these changes, I see that shift in the resulting temp files.

Mat

cablesb · March 27, 2012, 5:07pm

Thanks, Mat.

The module seems to have done the trick. Have to admit, every example I’ve seen of a global subroutine has a modules, but I never saw it documented that you have to use modules, so I thought I would do it quick and dirty. Serves me right, I guess.

Re: the bytes argument in the chevrons and the low value of threads per block, I was trying all kinds of things to get the code to work.

I’ve looked around Amazon etc. for a good CUDA FORTRAN introduction. The closest I could find was a book for CUDA C that looks OK. Is there a good CUDA FORTRAN book around?

Thanks again!

MatColgrove · March 27, 2012, 8:23pm

Is there a good CUDA FORTRAN book around?

Greg Ruetsch and Massimiliano Fatica from NVIDIA are in the process of writing one. Here’s link to an early version http://corsi.cineca.it/courses/scuolaAvanzata/Massimiliano%20Fatica/Book-Fatica.pdf. Greg just sent me an updated version that I can let you have. Send a note to PGI Customer service (trs@pgroup.com) and I’ll have them send you a copy.

Greg said that it still needs some work, especially with the Multi-GPU chapter, but should be helpful.

Mat

Topic		Replies	Views
Error running simple CUDA Fortran program Legacy PGI Compilers	9	21314	February 26, 2010
CUDA Fortran and Fortran 77 Legacy PGI Compilers	13	8219	March 12, 2012
CudaFotran compiling problem When i am comipiling the cuda fortran code, type mismatch error is com CUDA Programming and Performance	13	3616	December 1, 2009
cuda fortran module data Legacy PGI Compilers	6	8139	September 9, 2010
Problems with FORTRAN Accelerator and subroutines Legacy PGI Compilers	21	11923	August 17, 2011
An Easy Introduction to CUDA Fortran Technical Blog	7	567	June 21, 2024
CUDA Fortran matrix-multiply 10x slower than CUDA C version Legacy PGI Compilers	5	6918	July 14, 2010
cuda memory issues Legacy PGI Compilers	13	9510	March 19, 2012
Declaring local arrays in device code Legacy PGI Compilers	16	9070	June 8, 2012
Trouble Getting Started CUDA/PGI Fortran Legacy PGI Compilers	30	15816	November 9, 2012

how does this newbie fix his code?

Related topics