Trouble Getting Started CUDA/PGI Fortran

Using:
Windows 7 Ultimate 64-bit
Tesla C2050
nVidia Driver 260.81
CUDA Toolkit 3.2 (32 and 64 bit installed)
MS Visual Studio 2010
PGI Fortran 11.2

Code: (compiled in 32-bits)
PROGRAM PRUEBA
USE CUDAFOR
IMPLICIT NONE

INTEGER :: NUMBX
INTEGER, DEVICE :: NUMBX_con


NUMBX = 320
!
NUMBX_con = 2


END PROGRAM


Debug Output:
‘PVFProject6_CUDA.exe’: Loaded ‘C:\Windows\SysWOW64\ntdll.dll’, No symbols loaded.
‘PVFProject6_CUDA.exe’: Loaded ‘C:\Windows\syswow64\kernel32.dll’, No symbols loaded.
‘PVFProject6_CUDA.exe’: Loaded ‘C:\Windows\syswow64\KERNELBASE.dll’, No symbols loaded.
‘PVFProject6_CUDA.exe’: Loaded ‘c:\program files (x86)\pgi\win32\2011\cuda\3.2\bin\cudart32_32_16.dll’, No symbols loaded.
The thread ‘0.0’ (0x918) has exited with code 0 (0x0).
The program ‘PVFProject6_CUDA.exe: PGI Debug Engine’ has exited with code 0 (0x0).

As far as I can tell, all my path and variable statements are correct. Can someone give me an idea to start looking to find the problem?

Hi PSchneid2000,

Exit code 0 usually means your program ran successfully. Are you seeing a different error?

  • Mat

Hi Mat and everyone,

I am having the same problem actually.
When I planted many breakpoints all over the code and run it in debug mode, I was able to make it run, till I get this message in VS 2010:

No source available
Call stack location:
pgf90_dev_alloc04() Line 348 in “dev_allo.c” address: 14006bd20

the debugger could not locate the source file ‘src\dev_allo.c’.

what does that mean?

also, if I run the code in release mode, I get an error saying that memcopy error, which means that one of the kernel subroutines have problems and cannot copy from device to host of the opposite.

please help.

Dolf

what does that mean?

It just means that the debugger is attempting to step into code which wasn’t compiled with debugging information enabled. In this particular cases, “pgf90_dev_alloc04” is part of the PGI run time libraries so you wouldn’t have access to the source.

if I run the code in release mode, I get an error saying that memcopy error, which means that one of the kernel subroutines have problems and cannot copy from device to host of the opposite.

Most likely when optimization is applied, something changes in your code which causes the problem. You can try debugging with optimization enabled to try and narrow down the problem.

  • Mat

Hi Mat,

Unfortunately, the optimization did not work either.
I have copied the error message when I run in release mode:



COMPUTING PRESSURE FIELD ON THE CURRENT GRID…

starting grid level 5
GRV qni kernel error
unknown error

GRV qnj kernel error
unknown error

istat = 30
0: copyout Memcpy (host=0x1dd0f788, dev=0x4121c5e8, size=8) FAILED: 30(unknown e
rror)
Press any key to continue . . .

the most strange thing is, istat (error code after calling the kernel) is 30!
I thought the error code is from 0, to 11 only.

here is how I call the kernel, how can I check for errors in case I encounter any:

grid = dim3(ceiling(real(nx-1)/threads%x), &
ceiling(real(ny)/threads%y),1)

call GetReynVarqnj_kernel<<<grid,threads>>>(nx,ny,ndx,ndy, &
p,hnew,hjmin,hjmax,cohjmx,s,l,kd,zdatLowDev, &
qndatLowDev,zdatMidDev,qndatMidDev,zdatHighDev,qndatHighDev,qnj)

istat = cudaGetLastError() <=== equals 30!

please advice what could be the meaning of such GetLastError.

thanks,
Dolf

Hi Dolf,

Error 30 is the code for “Unknown Error”, so you’ll need to do some investigation to figure out what’s wrong. Can you determine what’s different between your “Release” and “Debug” builds? What optimization’s are applied? Is one being built in 32-bits and the other in 64-bits?

Emulation mode probably wont help here, so to debug, I’d start commenting out lines in your “GetReynVarqnj_kernel” until the crash goes away (or comment them all out and then add them back until the crash occurs). If you can narrow down where the crash is occurring, you can then get a better idea of why.

  • Mat

Hi Mat,

I was able to fix the debug mode (changed integer, value :: nx,ny >> to integer :: nx,ny), and I ran it, I discovered what’s wrong, I am calling a host subroutine, passing device matrices as reference along with there bounds (nx,ny), so when I read the values of bounds, its all correct values (they are host integers), but unfortunately the matrices have wrong dimension and I dont know why! here is an example of what I am doing:

module Q4_globals

real(8), device, allocatable, dimension (:,:) :: bearx4Dev,beary4Dev
end module Q4_globals

subroutine fullmult
use Q4_globals
real(8), device :: akmaxDev,akinDev
integer :: nx,ny

nx = 20
ny = 20
allocate(bearx4Dev(nx,ny),beary4Dev(nx,ny))

call reyneq(akmaxDev,1.d0,bearx4Dev,beary4Dev,nx,ny)
end subroutine fullmult

subroutine reyneq(akmax,akin,bearx,beary,nx,ny)

integer :: nx,ny (if use integer, value :: the debugger exit with code 0.?!)
real(8), device :: bearx(nx,ny),beary(nx,ny)
real(8), device :: akmax,akin


end subroutine reyneq

when I plant break point to see whats in bearx,beary it tells me that bearx is more than 10000 elements, do you want to expand??
10,000??? why??? whats wrong here??
the nx is still 20 in reyeq sub though!

also, whats the difference between the following???
real(8), device :: bearx
dimension (bearx(nx,ny))

and
real(8), device :: bearx(nx,ny)


please help.
Dolf

Hi Dolf,

allocate(bearxDev(nx,ny),beary(nx,ny))

Did you mean to allocate “bearyDev” instead of “beary”?

call reyneq(akmaxDev,1.d0,bearx4Dev,beary4Dev,nx,ny)

Shouldn’t these variables be “bearxDev” and “bearyDev” (i.e. no “4”)?

  • Mat

sorry, I meant in the module to have bearx4Dev and beary4Dev
also, when I allocated them, I allocated bearx4Dev, not bearx4, since bearx4 is a host matrix.

thanks,

Dolf

so, do you know why the bounds are way off in this case??

Dolf

The code above has errors which would cause odd behavior but I’d need to see a reproducing example to determine the exact cause.

  • Mat

what kind of errors you see? please clarify.

I have sent an e-mail with source code.

cheers,
Dolf

for some reason the e-mail I sent bounced back to me, can you please provide me with your new e-mail so I can resend?

thanks,
Dolf

what kind of errors you see? please clarify.

The ones I pointed out in the earlier post.

can you please provide me with your new e-mail so I can resend?

You can send it to PGI Customer Support at trs@pgroup.com or support@pgroup.com.

  • Mat

for some reason the e-mails I send bounced:

This message was created automatically by mail delivery software.

A message that you sent could not be delivered to one or more of its
recipients. This is a permanent error. The following address(es) failed:

support@pgroup.com
SMTP error from remote mail server after end of data:
host pgroup.com.s200a1.psmtp.com [207.126.147.10]:
582 The file attached violates our email policy
trs@pgroup.com
SMTP error from remote mail server after end of data:
host pgroup.com.s200a1.psmtp.com [207.126.147.10]:
582 The file attached violates our email policy

------ This is a copy of the message, including all the headers. ------
------ The body of the message is 1934048 characters long; only the first
------ 106496 or so are included here.

Return-path: <tmardan@me.berkeley.edu>
Received: from cmlps1.me.berkeley.edu ([128.32.164.132])
by cm05fe.ist.berkeley.edu with esmtpsa (TLSv1:AES256-SHA:256)
(Exim 4.76)
(auth plain:tmardan@me.berkeley.edu)
(envelope-from <tmardan@me.berkeley.edu>)
id 1TU1gv-0005re-GW; Thu, 01 Nov 2012 13:48:44 -0700
Message-ID: <5092E027.3070806@me.berkeley.edu>
Date: Thu, 01 Nov 2012 13:48:39 -0700
From: “Tholfaqar A. Mardan” <tmardan@me.berkeley.edu>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20120428 Thunderbird/12.0.1
MIME-Version: 1.0
To: PGI Technical Support <trs@pgroup.com>, support@pgroup.com
Subject: Fwd: device matrix assignment by reference error
References: <5092DF8B.5080700@me.berkeley.edu>
In-Reply-To: <5092DF8B.5080700@me.berkeley.edu>
X-Forwarded-Message-Id: <5092DF8B.5080700@me.berkeley.edu>
Content-Type: multipart/mixed;
boundary=“------------000408010400030903080704”

This is a multi-part message in MIME format.
--------------000408010400030903080704
Content-Type: multipart/alternative;
boundary=“------------030501070509090705020402”

You can FTP it to us. Send a note to customer service asking for instructions.

  • Mat

here is the code:

module kernels
use cudafor
implicit none

contains

attributes (global) subroutine mult_kernel (a,b,c,nx,ny)

implicit none
integer, value :: nx,ny
integer :: i,j,k
real(8) :: sum
real(8) :: a(nx,ny),b(nx,ny),c(nx,ny)

i = (blockidx%x - 1) * blockDim%x + threadidx%x
j = (blockidx%y - 1) * blockDim%y + threadidx%y
print*, ‘nx’
if(i <= nx .AND. j <= ny) then
sum = 0
do k=1,ny
sum = sum + (a(i,k) * b(k,j))
enddo
c(i,j) = sum
endif

end subroutine mult_kernel

end module kernels


program prog

use kernels
implicit none
integer :: istat
real(8), device :: cDev(nx,ny)
real(8) :: c(nx,ny)
type(dim3) :: grid,threads
integer :: nx,ny,nx4,ny4
real(8), device, allocatable, dimension (:,:) :: bearxDev,bearyDev,bearx4Dev,beary4Dev
nx = 306
ny = 306

nx4 = nx/4
ny4 = ny/4

allocate (bearxDev(nx,ny),bearyDev(nx,ny),bearx4Dev(nx4,ny4),beary4Dev(nx4,ny4), STAT=istat)
if (istat .ne. 0) write(,) ‘error allocating bearxDev and bearyDev’
if (istat .eq. 0) write(,) ‘allocating bearx and beary successful’
bearxDev(1:nx,1:ny) = 3.0
bearyDev(1:nx,1:ny) = 2.0
write(,) ‘assignment of bearx and beary successful’
threads = dim3(32,16,1)
grid = dim3 (ceiling(real(nx)/threads%x),&
ceiling(real(ny)/threads%y),1)
call mult_kernel<<<grid,threads>>>(bearxDev,bearyDev,cDev,nx,ny)
istat = cudagetlasterror()
write(,) ‘cudalasterror =’ , istat
istat = cudaThreadSynchronize()

c(1:nx,1:ny) = cDev
deallocate(bearxDev,bearyDev,bearx4Dev,beary4Dev)
write(,) 'c(306,306) = ’ ,c(nx,ny)
write(,) ‘the end’
end program prog

here is what I get when I run in release:

0: ALLOCATE: 0 bytes requested; not enough memory: 0(no error)
Press any key to continue . . .

The problem is with how you declare “cDev”. It needs to be an allocatable or an automatic with parameter values to declare the size. As you have it now, nx and ny are uninitialized variables hence cDev’s size is wrong.

To fix:

integer :: istat
integer, parameter :: nx=306,ny=306
real(8), device :: cDev(nx,ny)
real(8) :: c(nx,ny)
type(dim3) :: grid,threads
integer :: nx4,ny4
real(8), device, allocatable, dimension (:,:) :: bearxDev,bearyDev,bearx4Dev,beary4Dev

Also watch how you’re accessing the “b” array. You declare it as “b(nx,ny)” but access it as if it were declared “b(ny,nx)”. It happens to work since nx and ny are the same, but will cause problems if nx .ne. ny.

  • Mat

Hi Mat,

thanks for the reply. Is there a size limit on modules in Cuda fortran??
if yes, what is that limit?
how can I find the size of a matrix in mega bytes??

thanks,
Dolf

Is there a size limit on modules in Cuda fortran??

Do you mean is there a limit on the size of arrays? You’re limited to the amount of memory on your device. If an individual array is >2GB you need to add the flag “-Mlarge_arrays”, a compute capable 2.0 device and use CUDA 4.2 or later.

As far as code size, there probably is some practical limit but I’m not sure what it would be. I’ve seen a 4000 line kernel before so they can get big.

  • Mat