Automatic arrys in device memory

I have a question: Where does CUDA Fortran storre the automatic (static) device arrays?.
I seem to run into a problem of stack overflow in a program, with a kernel that repeats itself many times. Inside that kernel I´m using an auxiliaria array of 512 integers. the array is declared in host code like this:
integer, device :: aux(512)
I know this is the kernel that overruns the stack beacuse I have an alternative implementation (slower) that use the dynamic (allocatable) auxiliary array and that one works just fine…

When tryin to use a module variable for this array (placing the declarion outside the host subroutine) above the keyword “contains” it simply can´t use it and produces a runtime error the first time it calls teh subroutine.

Hi David,

We’ll need a bit more information here or an example. Note that there isn’t a stack on the device so it can’t be a stack overflow.

If I had to guess, I’d look if you are accessing “aux” out-of-bounds. How is it being accessed in your kernels? Is it being passed as an argument to the kernel? What’s you kernel’s launch configuration?

  • Mat

Yes I know there is no stack on the device, that is why the bug es strange.
I have a prefix sum scan (exclusive) that consists of three kernels, the aux is a vector that stores the last value of the prefix done by each block.
Launch configuration is always fixed 512 blocks 128 threadsfor kernel 1, 1 block 512 threads for kernel 2 and again 512 blocks of 128 threadsfor the third kernel.
The prefix sum scan is a host subroutine that calls the3 kernels. aux is declared in te host subroutine as
integer, device :: aux(512) then passed as argument to the three kernels.
I have made a “sandbox” program where I test the subroutine, it runs 10.000 times the subroutine flawlessly, but when I called the scan from the SPH program it fails…
it says
0 allocate 2048 bytes requested; status = 30(unknown error)
Those 2048 bytes seem to be the 512 integers * 4 bytes/integer
In theory it should free that memory automatically upon existing the host scan subroutine… but it doesn’t.
My guess is that sandbox works because it somehow “reuses” the same memory area as all the calls are consecutive, but in the sph is not the case so start “eating” the memory…(it breaks down around the 1100th time it calls the scan)

This is the host subroutine variable declarations

subroutine exclusive_int_scan(vecin, vecout, size)
integer, value :: size 
integer, device, dimension(:) :: vecin(size), vecout(size)
integer  :: i, threadchunk, blockchunk
integer, device :: aux(512)
type(dim3)	 :: dimGrid, dimBlock
integer, dimension(:) :: debug_blockval(512)
integer :: errcode

as you can see we don´t need the allocate part for aux().
Now if it would solve all the problems I would just made aux allocatable and the allocate(aux(512)), the thing is that later I get -as I´ve tried that already-
a deallocate problem after few thousands calls to this routine…

Hi David,

We’ll need to have an example to tell what’s going on. Can we use the same code you sent to Brent or is this new?

  • Mat

We’ll try to send something, I’m out of the office, let’s see if Vicente can send it.
I think the best is to send the sandbox we’ve been using for test and development of the different prefix sum scans versions. He’s the one who programmed them, I think he had done so in CUDA C and made a translation to CUDA Fortran.

Best Regards,

If you want to run it in the code we already sent I think you need to comment the lines call exclusive_int_scan() , and uncomment the call fast_exclusive_int_scan() . which is the one we are asking the questions about…
I’m not sure what changes Vicente may have done today -or even if he did any- :-)

Vicente tells me that probably the version you have of the fast scan has couple of bugs, he’ll send the fixed version soon

Ups…turns out we had a small memory leak, Vicente has fixed that already.
Thanks for the help and sorry for bothering you.

Best regards,