how does this newbie fix his code?

I’m a complete CUDA novice. Am trying to run a little test code to figure out the workings of CUDA. I have a subroutine:

attributes(global) subroutine time_step(a,v,dt,dx,ntimes,simlength)

use cudafor

real8, intent(inout) :: a(simlength)
8, intent(in) :: v, dt, dx
integer :: ntimes
integer, value :: simlength
integer :: i, n
real*8 :: aux(250)

real*8 :: decl


i = blockDim%x*(blockIdx%x-1)+threadIdx%x
do n = 1,ntimes
if (i>1 .and. i<=simlength) aux(i)=a(i)-decl*(a(i)-a(i-1))
! this is really what’s going on:
! if (i>1 .and. i<=simlength) aux(i)=a(i-1)

end subroutine time_step

Three problems: 1) Most importantly, this code should just be moving the contents of array “a” one step to the right for each iteration in the loop. It does this for most part, but it drops a couple of numbers. e.g., after 40 iterations, where the data should look like 17, 18, 19, 20 … it looks like 17, 19, 20 … (By the way, I have run this code as you see it here, and with the simple"this is what’s going on line" replacing the line above it; same results.) 2)I want “aux” to be an automatic array. Tried making it shared, but then the code won’t run. Dies with “unspecified failure to launch”. 3) I really don’t want to define a separate “aux” array. I want “a” to be something like a(:,2). Then I would update a(:,2) in every iteration and copy it to a(:,1). Doing this in a naive way produces a code that will compile but not run. Anyone know what to do about these things? Much obliged.

Hi cablesb,

Without a complete reproducing example, I’ll need to make some guesses.

For issue #1, since your expression contains a backward dependency, it is not parallel. Here you have multiple threads all accessing and updating different elements of a. Since there is no guarantee the order in which a is updated, the value of a(i-1) may be a new value or an old value. You just don’t know. Each thread needs to be working on it’s own element of a.

To make this code parallel, you need to rethink it to be something along the lines of:

Host code

do n = 1,ntimes
  call time_step{{{dimGrid,dimBlock}}}(..args...)  ! note change } to >
  call copyaux_a{{{dimGrid,dimBlock}}}(..args...)  ! 
end do

Device code
attributes(global) subroutine time_step(a,aux, v,dt,dx,ntimes,simlength) 
i = blockDim%x*(blockIdx%x-1)+threadIdx%x 
f (i>1 .and. i<=simlength) aux(i)=a(i)-decl*(a(i)-a(i-1)) 
end subroutine

attributes(global) subroutine copyaux_a(aux, a)
i = blockDim%x*(blockIdx%x-1)+threadIdx%x 
end subroutine copyaux_a

Note that I needed to do a similar algorithm in the following two articles: and

2)I want “aux” to be an automatic array. Tried making it shared, but then the code won’t run. Dies with “unspecified failure to launch”.

You probably forgot to put the size of the shared memory as the third argument in the chevrons when you launched your kernel.

  1. I really don’t want to define a separate “aux” array. I want “a” to be something like a(:,2). Then I would update a(:,2) in every iteration and copy it to a(:,1).

That’s fine so long as you don’t copy a(:,2) back to a(:,1) in the same kernel.

Hope this helps,


Thanks very much for the reply.

Re: 1) and 3), oh man, do I have alot to learn.

Re: 2) I have been trying to puzzle out the amount of memory needed from the PGI CUDA Fortran Programming Guide, and I’m not having much luck. How does this get calculated? In my current example, I am operating on a real*8 array of 256 elements. I have run with a variety of threads per block, from 2 up to 256. Thanks again.

Mat (or anyone!),
A modified code is posted below in all it’s glory. Well, in most of its glory. I am apparently running up against some limit in comment lines or something. I will post the second subroutine in another reply. I am now completely lost, I think. The “a” array is not being modified at all, neither by the time_step routine, nor by the switch_places routine. Any ideas are most appreciated.

program simple_wave

use cudafor

integer, parameter :: simlength=256
integer :: tPB = 2

real*8 :: a(simlength,2)

real*8, device :: a_dev(simlength,2)
real*8, device :: v, dt, dx
integer, device :: ntimes

real*8 :: v_h, dt_h, dx_h
integer :: ntimes_h

! parameters for triangle wave
integer :: peak_start, peak_half_width


! define triangle wave

! define advection parameters.  let's make everything one for now.

! how many time steps:

write(15,fmt='(1f10.5)') a(:,1)
print *,'========'

!do on host first, for comparison
do n=1,ntimes_h

! now on device
do n=1,ntimes_h
call time_step<<<ceiling>>>(a_dev,v,dt,dx,simlength)
print *,'**********'
call switch_places<<<ceiling>>>(a_dev,simlength)

print *,'========'
print *,'========'
write(16,fmt='(1f10.5)') a(:,1)

end program

attributes(global) subroutine switch_places(a,simlength)

use cudafor

integer, intent(in), value :: simlength
real*8, intent(inout) :: a(simlength,2)

integer :: i

i = blockDim%x*(blockIdx%x-1)+threadIdx%x
if (i>=1 .and. i<=simlength) a(i,1)=a(i,2)

end subroutine switch_places

… and here is time_step:

attributes(global) subroutine time_step(a,v,dt,dx,simlength)

use cudafor

integer, value :: simlength
real*8, intent(inout)  :: a(simlength,2)
real*8, intent(in) :: v, dt, dx
integer :: i, n

real*8 :: decl


i = blockDim%x*(blockIdx%x-1)+threadIdx%x
!if (i>1 .and. i<=simlength) a(i,2)=a(i,1)-decl*(a(i,1)-a(i-1,1))
! this is really what's going on:
  if (i>1 .and. i<=simlength) a(i,2)=a(i-1,1)

end subroutine time_step


whoops! In the host code, the “ceiling” text should read


Don’t know what happened to my cut and paste there.

Hi cablesb,

Don’t know what happened to my cut and paste there.

Problem with our UF. It wants to reformat what it thinks are HTML tags. I’ll look into getting it fixed.

I see one major issue and a couple minor. First, CUDA Fortran global subroutines must have an interface in order to call them. Failure to have an interface will cause your kernels to fails and is most likely what’s happening here. To fix, add an interface block for the global routines or put them in a module. Modules provide an implicit interface.

Next, since you don’t use automatic or assumed-size shared arrays in your kernels, there is no need in passing in the third argument in the chevron syntax. Also, your block size is only 2, tPB, which is small. Typically you want block sizes in increments of the warp size (i.e. 32).

With these changes, I see that shift in the resulting temp files.

  • Mat

Thanks, Mat.

The module seems to have done the trick. Have to admit, every example I’ve seen of a global subroutine has a modules, but I never saw it documented that you have to use modules, so I thought I would do it quick and dirty. Serves me right, I guess.

Re: the bytes argument in the chevrons and the low value of threads per block, I was trying all kinds of things to get the code to work.

I’ve looked around Amazon etc. for a good CUDA FORTRAN introduction. The closest I could find was a book for CUDA C that looks OK. Is there a good CUDA FORTRAN book around?

Thanks again!

Is there a good CUDA FORTRAN book around?

Greg Ruetsch and Massimiliano Fatica from NVIDIA are in the process of writing one. Here’s link to an early version Greg just sent me an updated version that I can let you have. Send a note to PGI Customer service ( and I’ll have them send you a copy.

Greg said that it still needs some work, especially with the Multi-GPU chapter, but should be helpful.

  • Mat