Trouble Getting Started CUDA/PGI Fortran

Hi Mat,

I meant module size, like for below:

module matrices_1

implicit none
real(8), device, allocatable, dimension (:,:) :: aDev,bDev,cDev

end module matrices_1

allocate(aDev(306,306),bDev(306,306))

size of module = size(aDev) + size(bDev) ??? no??

how can calculate size of aDev??
is it gonna be: size(aDev) = 8 * 306 * 306 bytes???

please advice.

Dolf

size of module = size(aDev) + size(bDev) ??? no??

how can calculate size of aDev??
is it gonna be: size(aDev) = 8 * 306 * 306 bytes???

Sorry Dolf, but I’m still not clear on what you’re asking. To me, “size of module” means the number of lines of source code in a module. Are you trying to calculate the total amount of memory in bytes the module’s data will consume?

“size” returns the number of element in an array not the number of bytes. So the number of bytes for aDev would be “size(aDev) * kind(aDev)”.

  • Mat

thanks Mat for you patience.
yes, I want to know the following:

  1. how can I calculate memory usage by the module, not how many lines.
  2. what is the maximum limit of memory that I can use.
  3. how can I calculate the memory utilized by a matrix, giving its size and type (real(8)).

by the way, I am using NVIDIA GeForce 640v2, cuda 4.2.
thanks.
Dolf

Hi Dolf,

  1. how can I calculate memory usage by the module, not how many lines.

For static module variables, you can either manually calculate them and/or use the Linux “size” command. The BSS and data segments will show the size in bytes of your static module data.

For dynamic allocation, you will need to do this at run time if the size of the allocation is variable. If the size is known, then you can manually calculate the usage.

For local variables, these are stored on the program stack and the exact amount of memory in use will depend upon the calling sequence.

Keep in mind, if you are using OpenMP, private memory is duplicated for each thread and is stored on each thread’s program stack.

  1. what is the maximum limit of memory that I can use.

For static memory with the small memory module it’s 2GB. To go beyond 2GB of static memory, use the medium memory model (-mcmodel=medium).

For dynamic memory in 64-bits, you can theoretically address 2^64 bytes (though on most systems it’s only 2^48) . However, if any individual array is larger then 2GB, add the flag “-Mlarge_arrays”.

On the GPU, there isn’t any virtual memory so you are limited by the amount of physical memory on the device.

  1. how can I calculate the memory utilized by a matrix, giving its size and type (real(8)).

Number of elements times the size in bytes of the data type. In Fortran “byte_size_Arr = size(Arr) * kind(Arr)”

  • Mat

so, are you saying, if I have aDev(306,306), and its real(8) type, the memory size is:

8 * 306 * 306 bytes??

thanks,

Yes. The number of elements is 306 x 306 and the data type size is 8 bytes. Hence the size in bytes of memory aDev will use is 8306306.

  • Mat

Hi Mat,

I am having a strange issue with debug mode of PGI fortran, for some reason, when I run the code in release mode, I have this error message:

starting grid level 5
0: copyover Memcpy (dst=0x41218d60, src=0x403000e0, size=8) FAILED: 11(invalid a
rgument)
Press any key to continue . . .

when I run in debug mode, I planted two breakpoints, one at the subroutine with the problem you see above, and one at the subroutine just before it.
I can stop at the break point before the faulty subroutine, but I cannot stop inside it even though I have a breakpoint at the first line of it.

have you seen such problem?? what could be the cause for this?
how can I proceed to pin point exactly where is the problem?

please help.

Dolf

Hi Dolf,

Are you debugging using CUDA Fortran Emulation (-Mcuda=emu)? We don’t yet have the capability to debug on the device itself so if you’re not using emulation, this would explain why you can step into the kernel.

  • Mat

Hi,

its actually a host subroutine that I cannot get to, I am not yet reaching the kernel routine.
what could be the cause if I cannot stop at a breakpoint in VS 2010?

Are you sure it’s not failing before it hits the break point? Could it be a stack overflow upon entry into the routine?

Try stepping into the routine instead of replying on the breakpoint in the routine. Does it fail when you step into it?

  • Mat

how can I tell its a stack overflow?? please advice.

I am able to go to the subroutine now, when I call the kernel, I get the error code 30, which is unknown!
I have cleared all the content of the kernel sub, still get the same error
here is the kernel:

attributes (global) subroutine GetReynVarqni_kernel(nx,ny,ndx,ndy)

implicit none
integer :: i, j, k
integer, value :: nx,ny,ndx,ndy


i = (blockidx%x - 1) * blockDim%x + threadidx%x
j = (blockidx%y - 1) * blockDim%y + threadidx%y


if( i >=2 .AND. i <= nx ) then
if ( j >=2 .AND. j <= ny-1 ) then



endif ! (j <= ny)
endif ! (i <= nx)
return
end subroutine GetReynVarqni_kernel

I am calling it like this:
from the host subroutine:

nx = 20
ny = 20 (I can see there values correctly when reach breakpoint)
(they are also host integer )

threads = dim3(32,16,1)
grid = dim3(ceiling(real(nx)/threads%x), &
ceiling(real(ny-1)/threads%y),1)



call GetReynVarqni_kernel<<<grid,threads>>>(nx,ny,ndx,ndy)
istat = cudaThreadSynchronize()
if (istat .ne. 0) print*, ’ GRV qni kernel error ',cudaGetErrorString(istat)


after I reach here, the istat reads 30

any ideas??

Dolf