Error: Formal parameter space overflowed

Okay, as my spamming of this board in recent days has shown, I’m trying to explore asynchronous memory movement with CUDA Fortran. Since my code involves moving 1, 2, and 3D arrays, I’ve been looking at those API calls. Now, 1 and 2, I get, and I’ve asked for any help with 3D arrays, but I thought I’d move on and try to decompose my 3D arrays into 2D arrays and pass those on.

Now, this isn’t fun (SELECT CASE in the kernel and lots of copy-and-paste), but doable. However, when I try to do this I get:

> make
pgfortran -V10.5 -Mcuda=keepgpu,keepbin,keepptx,maxregcount:64,nofma -Kieee -fast -r4 -Mextend -Mpreprocess -Ktrap=fp -DFLXY -DDEG4 -c src/sorad.cudafor.flxdn.cudaapi.cuf
/tmp/pgnvda2rcamlNvqJn.nv4(0): Error: Formal parameter space overflowed in function soradcuf
PGF90-F-0000-Internal compiler error. pgnvd job exited with nonzero status code       0 (src/sorad.cudafor.flxdn.cudaapi.cuf: 2620)
PGF90/x86-64 Linux 10.5-0: compilation aborted
make: *** [sorad.cudafor.flxdn.cudaapi.o] Error 2

So, I’m guessing I hit a limit on the number of parameters one can pass to device kernels? It’s possible since I had to move 5 3D arrays into 29 2D arrays, meaning I now have 56 (yes, 56) inputs and outputs in my kernel call.

I suppose my query is: is this limit PGI-specific, or have I run into something in hard-coded into CUDA itself?

If the latter, is there a way around this limit that doesn’t involve cudaMemcpy3D et al? (Say, passing a TYPE of arrays…though I’m not sure how to allocate that…)

Hi,

Mat’s away most of the week. I’ll try to fill in.

I think you’re making this too hard. You can use the either the 1d or 2d async calls just as they are, and then compress the dimensions using the arguments. Yes, it seems like below you are running into a hard CUDA limit on the number of arguments that can be passed in. To get around this, we recommend you put the allocatable arrays at the module level, which has been supported since 10.4: I ran this in 10.5:

module test3d
integer*4, device, allocatable :: x(:,:,:)
contains
attributes(global) subroutine s1()
i = threadidx%x
j = threadidx%y
k = blockidx%x
x(i,j,k) = x(i,j,k) + i + j + k
end subroutine s1
end module

program t
use test3d
use cudafor
integer, parameter :: N = 16
integer*4, allocatable, pinned :: h(:,:,:)
type(dim3) :: ngrid, nblock
!
allocate(h(N,N,N))
allocate(x(N,N,N))
!
h = 1
!
ngrid = dim3(N,1,1)
nblock= dim3(N,N,1)

istat = cudaMemcpyAsync(x,h,NNN)
if (istat .ne. 0) print *,"cudaMemcpyAsync 1 ",istat
istat = cudaThreadSynchronize()
call s1 <<<ngrid, nblock>>> ()

istat = cudaMemcpyAsync(h,x,NNN)
if (istat .ne. 0) print *,"cudaMemcpyAsync 2 ",istat
istat = cudaThreadSynchronize()

do k = 1, N
do j = 1, N
do i = 1, N
if (h(i,j,k) .ne. 1+i+j+k) then
print *,"Error at ",i,j,k
end if
end do
end do
end do
end

Note, the host array has to be pinned.

Brent,

Thanks for the code, and I now see how I could do 3D using the 1D Memcpy. My question now is, would I have to do manual padding if I wanted to do that? When I try to alloc/copy 3D arrays using cudaMallocPitch/cudaMemcpy2D (leaving off Async for now) so that CUDA takes care of padding:

program t2d

use test3d
use cudafor

integer, parameter :: N = 20
integer*4, allocatable, pinned :: h(:,:,:)
type(dim3) :: ngrid, nblock
integer :: pitch
!
allocate(h(N,N,N))
istat = cudaMallocPitch(x,pitch,N,N*N)
!
h = 1 
!
ngrid = dim3(N,1,1)
nblock= dim3(N,N,1)

istat = cudaMemcpy2D(x,pitch,h,N,N,N*N)
if (istat .ne. 0) print *,"cudaMemcpy2D 1 ",istat
call s1 <<<ngrid, nblock>>> ()

istat = cudaMemcpy2D(h,N,x,pitch,N,N*N)
if (istat .ne. 0) print *,"cudaMemcpy2D 2 ",istat

do k = 1, N
   do j = 1, N
      do i = 1, N
         if (h(i,j,k) .ne. 1+i+j+k) then
            print *,"Error at ",i,j,k
         end if
      end do
   end do
end do

end program t2d

where the device kernel is the same, I get:

> pgfortran test3d.cuf drivert2d.cuf
test3d.cuf:
drivert2d.cuf:
PGF90-S-0155-Could not resolve generic procedure cudamallocpitch (drivert2d.cuf: 12)
  0 inform,   0 warnings,   1 severes, 0 fatal for t2d

I’m probably doing the “allocate 3D as 2D wrong”. I read the ref guide as thinking when you use cudaMallocPitch(devptr,pitch,width,height), you will be sending a widthheight array to the device where it’ll be a padded pitchheight array. At least, when I use it with 2D arrays it works.

Is this a typing issue? Sending 3D when Fortran expects 2D because of the interface?

Yes, you will have to do manual padding. You might be a little bit out of luck here. If you don’t use the Fortran allocate statement, we don’t create the array descriptor on the device. We do create it on the host, but currently through the CUDA API that doesn’t make it to the device. You basically just have a C pointer. So you won’t be able to access the bounds of the array x in the device code (or have the compiler take care of that for you). You’ll have to pass the array x and its bounds in. So, that gets back to your original problem.

BTW, cudaMallocPitch is limited to taking 2D arrays as arguments.
You can see the supported interfaces by looking in the 10.x/src directory.