Trouble Getting Started CUDA/PGI Fortran

PSchneid2000 · February 15, 2011, 5:13pm

Using:
Windows 7 Ultimate 64-bit
Tesla C2050
nVidia Driver 260.81
CUDA Toolkit 3.2 (32 and 64 bit installed)
MS Visual Studio 2010
PGI Fortran 11.2

Code: (compiled in 32-bits)
PROGRAM PRUEBA
USE CUDAFOR
IMPLICIT NONE

INTEGER :: NUMBX
INTEGER, DEVICE :: NUMBX_con

NUMBX = 320
!
NUMBX_con = 2

END PROGRAM

Debug Output:
‘PVFProject6_CUDA.exe’: Loaded ‘C:\Windows\SysWOW64\ntdll.dll’, No symbols loaded.
‘PVFProject6_CUDA.exe’: Loaded ‘C:\Windows\syswow64\kernel32.dll’, No symbols loaded.
‘PVFProject6_CUDA.exe’: Loaded ‘C:\Windows\syswow64\KERNELBASE.dll’, No symbols loaded.
‘PVFProject6_CUDA.exe’: Loaded ‘c:\program files (x86)\pgi\win32\2011\cuda\3.2\bin\cudart32_32_16.dll’, No symbols loaded.
The thread ‘0.0’ (0x918) has exited with code 0 (0x0).
The program ‘PVFProject6_CUDA.exe: PGI Debug Engine’ has exited with code 0 (0x0).

As far as I can tell, all my path and variable statements are correct. Can someone give me an idea to start looking to find the problem?

MatColgrove · February 15, 2011, 10:09pm

Hi PSchneid2000,

Exit code 0 usually means your program ran successfully. Are you seeing a different error?

Mat

Dolfie · October 29, 2012, 5:40pm

Hi Mat and everyone,

I am having the same problem actually.
When I planted many breakpoints all over the code and run it in debug mode, I was able to make it run, till I get this message in VS 2010:

No source available
Call stack location:
pgf90_dev_alloc04() Line 348 in “dev_allo.c” address: 14006bd20

the debugger could not locate the source file ‘src\dev_allo.c’.

what does that mean?

also, if I run the code in release mode, I get an error saying that memcopy error, which means that one of the kernel subroutines have problems and cannot copy from device to host of the opposite.

please help.

Dolf

MatColgrove · October 29, 2012, 6:14pm

what does that mean?

It just means that the debugger is attempting to step into code which wasn’t compiled with debugging information enabled. In this particular cases, “pgf90_dev_alloc04” is part of the PGI run time libraries so you wouldn’t have access to the source.

if I run the code in release mode, I get an error saying that memcopy error, which means that one of the kernel subroutines have problems and cannot copy from device to host of the opposite.

Most likely when optimization is applied, something changes in your code which causes the problem. You can try debugging with optimization enabled to try and narrow down the problem.

Mat

Dolfie · October 29, 2012, 11:26pm

Hi Mat,

Unfortunately, the optimization did not work either.
I have copied the error message when I run in release mode:

COMPUTING PRESSURE FIELD ON THE CURRENT GRID…

starting grid level 5
GRV qni kernel error
unknown error

GRV qnj kernel error
unknown error

istat = 30
0: copyout Memcpy (host=0x1dd0f788, dev=0x4121c5e8, size=8) FAILED: 30(unknown e
rror)
Press any key to continue . . .

the most strange thing is, istat (error code after calling the kernel) is 30!
I thought the error code is from 0, to 11 only.

here is how I call the kernel, how can I check for errors in case I encounter any:

grid = dim3(ceiling(real(nx-1)/threads%x), &
ceiling(real(ny)/threads%y),1)

call GetReynVarqnj_kernel<<<grid,threads>>>(nx,ny,ndx,ndy, &
p,hnew,hjmin,hjmax,cohjmx,s,l,kd,zdatLowDev, &
qndatLowDev,zdatMidDev,qndatMidDev,zdatHighDev,qndatHighDev,qnj)

istat = cudaGetLastError() <=== equals 30!

please advice what could be the meaning of such GetLastError.

thanks,
Dolf

MatColgrove · October 30, 2012, 2:44pm

Hi Dolf,

Error 30 is the code for “Unknown Error”, so you’ll need to do some investigation to figure out what’s wrong. Can you determine what’s different between your “Release” and “Debug” builds? What optimization’s are applied? Is one being built in 32-bits and the other in 64-bits?

Emulation mode probably wont help here, so to debug, I’d start commenting out lines in your “GetReynVarqnj_kernel” until the crash goes away (or comment them all out and then add them back until the crash occurs). If you can narrow down where the crash is occurring, you can then get a better idea of why.

Mat

Dolfie · October 30, 2012, 5:35pm

Hi Mat,

I was able to fix the debug mode (changed integer, value :: nx,ny >> to integer :: nx,ny), and I ran it, I discovered what’s wrong, I am calling a host subroutine, passing device matrices as reference along with there bounds (nx,ny), so when I read the values of bounds, its all correct values (they are host integers), but unfortunately the matrices have wrong dimension and I dont know why! here is an example of what I am doing:

module Q4_globals

real(8), device, allocatable, dimension (:,:) :: bearx4Dev,beary4Dev
end module Q4_globals

subroutine fullmult
use Q4_globals
real(8), device :: akmaxDev,akinDev
integer :: nx,ny

nx = 20
ny = 20
allocate(bearx4Dev(nx,ny),beary4Dev(nx,ny))

call reyneq(akmaxDev,1.d0,bearx4Dev,beary4Dev,nx,ny)
end subroutine fullmult

subroutine reyneq(akmax,akin,bearx,beary,nx,ny)

integer :: nx,ny (if use integer, value :: the debugger exit with code 0.?!)
real(8), device :: bearx(nx,ny),beary(nx,ny)
real(8), device :: akmax,akin

…
end subroutine reyneq

when I plant break point to see whats in bearx,beary it tells me that bearx is more than 10000 elements, do you want to expand??
10,000??? why??? whats wrong here??
the nx is still 20 in reyeq sub though!

also, whats the difference between the following???
real(8), device :: bearx
dimension (bearx(nx,ny))

and
real(8), device :: bearx(nx,ny)

please help.
Dolf

MatColgrove · October 30, 2012, 6:00pm

Hi Dolf,

allocate(bearxDev(nx,ny),beary(nx,ny))

Did you mean to allocate “bearyDev” instead of “beary”?

call reyneq(akmaxDev,1.d0,bearx4Dev,beary4Dev,nx,ny)

Shouldn’t these variables be “bearxDev” and “bearyDev” (i.e. no “4”)?

Mat

Dolfie · October 30, 2012, 6:31pm

sorry, I meant in the module to have bearx4Dev and beary4Dev
also, when I allocated them, I allocated bearx4Dev, not bearx4, since bearx4 is a host matrix.

thanks,

Dolf

Dolfie · October 31, 2012, 5:36pm

so, do you know why the bounds are way off in this case??

Dolf

MatColgrove · October 31, 2012, 5:41pm

The code above has errors which would cause odd behavior but I’d need to see a reproducing example to determine the exact cause.

Mat

Dolfie · November 1, 2012, 7:08pm

what kind of errors you see? please clarify.

I have sent an e-mail with source code.

cheers,
Dolf

Dolfie · November 1, 2012, 8:26pm

for some reason the e-mail I sent bounced back to me, can you please provide me with your new e-mail so I can resend?

thanks,
Dolf

MatColgrove · November 1, 2012, 8:31pm

what kind of errors you see? please clarify.

The ones I pointed out in the earlier post.

can you please provide me with your new e-mail so I can resend?

You can send it to PGI Customer Support at trs@pgroup.com or support@pgroup.com.

Mat

Dolfie · November 1, 2012, 8:51pm

for some reason the e-mails I send bounced:

This message was created automatically by mail delivery software.

A message that you sent could not be delivered to one or more of its
recipients. This is a permanent error. The following address(es) failed:

support@pgroup.com
SMTP error from remote mail server after end of data:
host pgroup.com.s200a1.psmtp.com [207.126.147.10]:
582 The file attached violates our email policy
trs@pgroup.com
SMTP error from remote mail server after end of data:
host pgroup.com.s200a1.psmtp.com [207.126.147.10]:
582 The file attached violates our email policy

------ This is a copy of the message, including all the headers. ------
------ The body of the message is 1934048 characters long; only the first
------ 106496 or so are included here.

Return-path: <tmardan@me.berkeley.edu>
Received: from cmlps1.me.berkeley.edu ([128.32.164.132])
by cm05fe.ist.berkeley.edu with esmtpsa (TLSv1:AES256-SHA:256)
(Exim 4.76)
(auth plain:tmardan@me.berkeley.edu)
(envelope-from <tmardan@me.berkeley.edu>)
id 1TU1gv-0005re-GW; Thu, 01 Nov 2012 13:48:44 -0700
Message-ID: <5092E027.3070806@me.berkeley.edu>
Date: Thu, 01 Nov 2012 13:48:39 -0700
From: “Tholfaqar A. Mardan” <tmardan@me.berkeley.edu>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20120428 Thunderbird/12.0.1
MIME-Version: 1.0
To: PGI Technical Support <trs@pgroup.com>, support@pgroup.com
Subject: Fwd: device matrix assignment by reference error
References: <5092DF8B.5080700@me.berkeley.edu>
In-Reply-To: <5092DF8B.5080700@me.berkeley.edu>
X-Forwarded-Message-Id: <5092DF8B.5080700@me.berkeley.edu>
Content-Type: multipart/mixed;
boundary=“------------000408010400030903080704”

This is a multi-part message in MIME format.
--------------000408010400030903080704
Content-Type: multipart/alternative;
boundary=“------------030501070509090705020402”

MatColgrove · November 1, 2012, 9:16pm

You can FTP it to us. Send a note to customer service asking for instructions.

Mat

Dolfie · November 1, 2012, 9:38pm

here is the code:

module kernels
use cudafor
implicit none

contains

attributes (global) subroutine mult_kernel (a,b,c,nx,ny)

implicit none
integer, value :: nx,ny
integer :: i,j,k
real(8) :: sum
real(8) :: a(nx,ny),b(nx,ny),c(nx,ny)

i = (blockidx%x - 1) * blockDim%x + threadidx%x
j = (blockidx%y - 1) * blockDim%y + threadidx%y
print*, ‘nx’
if(i <= nx .AND. j <= ny) then
sum = 0
do k=1,ny
sum = sum + (a(i,k) * b(k,j))
enddo
c(i,j) = sum
endif

end subroutine mult_kernel

end module kernels

program prog

use kernels
implicit none
integer :: istat
real(8), device :: cDev(nx,ny)
real(8) :: c(nx,ny)
type(dim3) :: grid,threads
integer :: nx,ny,nx4,ny4
real(8), device, allocatable, dimension (:,:) :: bearxDev,bearyDev,bearx4Dev,beary4Dev
nx = 306
ny = 306

nx4 = nx/4
ny4 = ny/4

allocate (bearxDev(nx,ny),bearyDev(nx,ny),bearx4Dev(nx4,ny4),beary4Dev(nx4,ny4), STAT=istat)
if (istat .ne. 0) write(,) ‘error allocating bearxDev and bearyDev’
if (istat .eq. 0) write(,) ‘allocating bearx and beary successful’
bearxDev(1:nx,1:ny) = 3.0
bearyDev(1:nx,1:ny) = 2.0
write(,) ‘assignment of bearx and beary successful’
threads = dim3(32,16,1)
grid = dim3 (ceiling(real(nx)/threads%x),&
ceiling(real(ny)/threads%y),1)
call mult_kernel<<<grid,threads>>>(bearxDev,bearyDev,cDev,nx,ny)
istat = cudagetlasterror()
write(,) ‘cudalasterror =’ , istat
istat = cudaThreadSynchronize()

c(1:nx,1:ny) = cDev
deallocate(bearxDev,bearyDev,bearx4Dev,beary4Dev)
write(,) 'c(306,306) = ’ ,c(nx,ny)
write(,) ‘the end’
end program prog

here is what I get when I run in release:

0: ALLOCATE: 0 bytes requested; not enough memory: 0(no error)
Press any key to continue . . .

MatColgrove · November 5, 2012, 10:39pm

The problem is with how you declare “cDev”. It needs to be an allocatable or an automatic with parameter values to declare the size. As you have it now, nx and ny are uninitialized variables hence cDev’s size is wrong.

To fix:

integer :: istat
integer, parameter :: nx=306,ny=306
real(8), device :: cDev(nx,ny)
real(8) :: c(nx,ny)
type(dim3) :: grid,threads
integer :: nx4,ny4
real(8), device, allocatable, dimension (:,:) :: bearxDev,bearyDev,bearx4Dev,beary4Dev

Also watch how you’re accessing the “b” array. You declare it as “b(nx,ny)” but access it as if it were declared “b(ny,nx)”. It happens to work since nx and ny are the same, but will cause problems if nx .ne. ny.

Mat

Dolfie · November 6, 2012, 7:49pm

Hi Mat,

thanks for the reply. Is there a size limit on modules in Cuda fortran??
if yes, what is that limit?
how can I find the size of a matrix in mega bytes??

thanks,
Dolf

MatColgrove · November 6, 2012, 11:38pm

Is there a size limit on modules in Cuda fortran??

Do you mean is there a limit on the size of arrays? You’re limited to the amount of memory on your device. If an individual array is >2GB you need to add the flag “-Mlarge_arrays”, a compute capable 2.0 device and use CUDA 4.2 or later.

As far as code size, there probably is some practical limit but I’m not sure what it would be. I’ve seen a 4000 line kernel before so they can get big.

Mat

Topic		Replies	Views
cuda fortran questions Legacy PGI Compilers	10	10950	July 27, 2012
Error running simple CUDA Fortran program Legacy PGI Compilers	9	21319	February 26, 2010
About PGI Fortran and CUDA 4.0 Legacy PGI Compilers	24	12986	August 12, 2011
Problems with FORTRAN Accelerator and subroutines Legacy PGI Compilers	21	11924	August 17, 2011
Are there memory limitations on Device when using large arrays? Tesla C1060 CUDA Programming and Performance	40	14779	April 22, 2009
CUDA Fortran matrix-multiply 10x slower than CUDA C version Legacy PGI Compilers	5	6935	July 14, 2010
Can a Kernel be too big?? CUDA_ERROR_NO_BINARY_FOR_GPU error 209 CUDA Programming and Performance	11	3049	November 13, 2017
CUDA Fortran and Fortran 77 Legacy PGI Compilers	13	8233	March 12, 2012
Unknow error when calling device subroutine Legacy PGI Compilers	18	10476	September 1, 2015
why "all CUDA-capable devices are busy or unavailable" ? CUDA Programming and Performance	34	64391	April 20, 2011

Trouble Getting Started CUDA/PGI Fortran

Related topics