CUDA fortran device array parameter

Continuing the porting problem I have, I have isolated a frustrating bug to strange behavior in passing device arrays to kernel programs.

In my application, the copy of device to host array completes, but with gibberish in the host array. I spend a long time trying to find a problem in the kernel code.

Am I missing something, or is this a bug in PGI 10.3.0 ?

In this simple example, I get a failure like this:

pgfortran haha3.CUF
./a.out
about to do x=a
copyout Symbol Memcpy FAILED:4

If I delete the array parameter to foo_kernel ( instead using the module array ‘a’ ) the test works:

./a.out
about to do x=a
25.00000 100.0000 6250.000 25000.00
FORTRAN STOP

#define NCOMPLEX 500

module bar
  use cudafor
  complex, device, dimension(NCOMPLEX) :: a
contains

attributes(global) subroutine foo_kernel(thing)
  complex, device, dimension(NCOMPLEX) :: thing

  it = threadidx%x + (blockidx%x-1) * blockdim%x
  if ( it .le. NCOMPLEX ) thing(it) = 100.0 * thing(it)
  return

end subroutine foo_kernel

end module bar

! ----------------------------------------------------------------

subroutine foo(x)

  use cudafor	
  use bar
  type(dim3) :: dimGrid, dimBLock
  complex, dimension(NCOMPLEX) :: x
  integer nt,ng      

! .. 100 threads

  nt = min( 100, NCOMPLEX )
  ng = max( 1, (NCOMPLEX+nt-1)/nt )

  a= x
  call foo_kernel<<<ng,nt>>> (a)
  i = cudathreadsynchronize()
  write(0,*) "about to do x=a"
  x= a

end subroutine foo

! ----------------------------------------------------------------

program test
  integer, parameter :: NR=NCOMPLEX*2	
  save ha
  real ha(NR)

  do j = 1, NCOMPLEX
     ha(2*j-1) = 0.25*j
     ha(2*j  ) = j
  enddo

  call foo(ha)
  print *,ha(1),ha(2), ha(NCOMPLEX-1),ha(NCOMPLEX)
  stop
end program

move complex, device, dimension(NCOMPLEX) :: a out of the module and into the host subroutine, like this:

#define NCOMPLEX 500

module bar
  use cudafor

contains

attributes(global) subroutine foo_kernel(thing)
  complex, device, dimension(NCOMPLEX) :: thing

  it = threadidx%x + (blockidx%x-1) * blockdim%x
  if ( it .le. NCOMPLEX ) thing(it) = 100.0 * thing(it)
  return

end subroutine foo_kernel

end module bar

! ----------------------------------------------------------------

subroutine foo(x)

  use cudafor   
  use bar
  type(dim3) :: dimGrid, dimBLock
  complex, dimension(NCOMPLEX) :: x
  complex, device, dimension(NCOMPLEX) :: a
  integer nt,ng     

! .. 100 threads

  nt = min( 100, NCOMPLEX )
  ng = max( 1, (NCOMPLEX+nt-1)/nt )

  a= x
  call foo_kernel<<<ng,nt>>> (a)
  i = cudathreadsynchronize()
  write(0,*) "about to do x=a"
  x= a

end subroutine foo

! ----------------------------------------------------------------

program test
  integer, parameter :: NR=NCOMPLEX*2   
  save ha
  real ha(NR)

  do j = 1, NCOMPLEX
     ha(2*j-1) = 0.25*j
     ha(2*j  ) = j
  enddo

  call foo(ha)
  print *,ha(1),ha(2), ha(NCOMPLEX-1),ha(NCOMPLEX)
  stop
end program

Alternatively you could probably stay with the same structure, but make a allocatable and allocate it in foo. I can’t remember whether thing should have the ‘device’ attribute. Try removing it if you still have problems.

… OK, that may work but it defeats my purpose, which is to use module data as a work around for the lack of device common.

That’s better. That works ( it should have the device attribute ). It is still a bug… as far as I can see there’s nothing wrong with static module data, it just causes the run-time error.

Does a moderator here submit bug reports or should I do it through other channels?

Thanks!
Sarah

Hi Sarah,

OK, that may work but it defeats my purpose, which is to use module data as a work around for the lack of device common.

Another work around is to have the device routine access “a” directly and not pass it in.

....
attributes(global) subroutine foo_kernel()

  it = threadidx%x + (blockidx%x-1) * blockdim%x
  if ( it .le. NCOMPLEX ) a(it) = 100.0 * a(it)
  return

end subroutine foo_kernel
...
call foo_kernel<<<ng,nt>>> ()
...



Does a moderator here submit bug reports or should I do it through other channels?

I’m happy to submit them but if you prefer, you can send a report to PGI Customer Support (trs@pgroup.com). The only difference is that I post fix notification here where while Customer Support will notify you directly.

I went ahead and added this as TPR#16767. Several of our internal tests do similar operations, so it’s unclear exactly why it fails here. Hopefully our engineers can get it fixed here shortly.

Thanks,
Mat

Hi Sarah,

Sorry this took so long but I just verified that TPR#16767 will be fixed the 11.0 release.

% pgf90 -V11.0 -fast test3.cuf -Mpreprocess
% a.out
 about to do x=a
    25.00000        100.0000        6250.000        25000.00

Thanks,
Mat

Sarah,

TPR 16767 - CUDA: copyout Symbol Memcpy FAILED:4
when passing in module varaible as an argument


is fixed in the 11.0 release, which is now out.


regards,
dave