error for a simple OPENACC program

Hello.
I am testing the simple program, modified for OPENACC, called “picalc” from NVIDIA website :

############################

program picalc
implicit none
integer, parameter :: n=1000000
integer :: i
real(kind=8) :: t, pi
pi = 0.0
!$acc parallel loop
do i=0, n-1
t = (i+0.5)/n
pi = pi + 4.0/(1.0 + t*t)
end do
!$acc end parallel loop
print *, ‘pi=’, pi/n
end program picalc

############################


This program is simple, but my system gives this error:


############################

alechand@pcsantos2:~/gravity$ pgfortran -fast -Minfo=all -o TEST picalc.f90 -ta=nvidia
picalc:
7, Accelerator kernel generated
7, CC 1.3 : 24 registers; 32 shared, 36 constant, 0 local memory bytes
CC 2.0 : 23 registers; 0 shared, 52 constant, 0 local memory bytes
8, !$acc loop gang, vector(256) ! blockidx%x threadidx%x
10, Sum reduction generated for pi
7, Generating compute capability 1.3 binary
Generating compute capability 2.0 binary
alechand@pcsantos2:~/gravity$ ./TEST
call to cuMemcpyDtoH returned error 700: Launch failed
CUDA driver version: 5050

############################

I have this problem also in other codes.
I installed the PGI compiler 12.10,
and i am using kubuntu 12.04.

Can you help me ?
thanks

Hi alechand,

There’s something going on with your device or driver. Can you post the output to the command “pgaccelinfo”? Also, are you able to run a simple CUDA program?

Thanks,
Mat

thanks for the reply.
Here is the output :

#####################################

alechand@pcsantos2:~$ pgaccelinfo
CUDA Driver Version: 5050
NVRM version: NVIDIA UNIX x86 Kernel Module 319.17 Thu Apr 25 22:14:10 PDT 2013

Device Number: 0
Device Name: GeForce GTX 680
Device Revision Number: 3.0
Global Memory Size: 2147155968
Number of Multiprocessors: 8
Number of SP Cores: 1536
Number of DP Cores: 512
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 65536
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 2147483647 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 1058 MHz
Execution Timeout: No
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: No
Memory Clock Rate: 3004 MHz
Memory Bus Width: 256 bits
L2 Cache Size: 524288 bytes
Max Threads Per SMP: 2048
Async Engines: 1
Unified Addressing: No
Initialization time: 314151 microseconds
Current free memory: 2095439872
Upload time (4MB): 994 microseconds ( 843 ms pinned)
Download time: 1733 microseconds ( 759 ms pinned)
Upload bandwidth: 4219 MB/sec (4975 MB/sec pinned)
Download bandwidth: 2420 MB/sec (5526 MB/sec pinned)

###################################

what do you mean a cuda program ?
can you give me an example ?

thanks

The pgaccelinfo output all looks fine. My test system is a GTX690 so very similar to yours. The only difference is that your driver is newer. I’ll see if I can update my driver to see if that’s causing the problem.

Can you now try setting the environment variable “PGI_ACC_DEBUG=1” and run your program again?

what do you mean a cuda program ?
can you give me an example ?

Assuming you have NVIDIA’s CUDA SDK installed, you can run one of the Sample programs that come with it. For example:

samples/0_Simple/matrixMul% make
/opt/cuda-5.0/bin/nvcc -m64  -gencode arch=compute_10,code=sm_10 -gencode arch=compute_20,code=sm_20 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -I/opt/cuda-5.0/include -I. -I.. -I../../common/inc -o matrixMul.o -c matrixMul.cu
g++ -m64 -o matrixMul matrixMul.o -L/opt/cuda-5.0/lib64 -lcudart 
mkdir -p ../../bin/linux/release
cp matrixMul ../../bin/linux/release
samples/0_Simple/matrixMul% cd ../../bin/linux/release/
samples/bin/linux/release% ls
matrixMul*
samples/bin/linux/release% matrixMul 
[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "GeForce GTX 690" with compute capability 3.0

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
Performance= 224.60 GFlop/s, Time= 0.584 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: OK
  • Mat

i used

PGI_ACC_DEBUG=1

but the behaviour was the same as before.
I installed the cuda driver from pgi compiler,
can you help me to find this sample ?

PGI_ACC_DEBUG=1
but the behaviour was the same as before.

Sorry, I should have been more specific. I’d like you to post the output from your run when debugging is enabled.

I installed the cuda driver from pgi compiler,

We don’t ship a CUDA driver. This comes from NVIDIA.

can you help me to find this sample ?

  • Mat

i am trying to install the cuda driver from nvidia website you recomended,
but after installation, it says that the samples could not be installed because missing libraries: Missing required library libglut.so

My question is, the cuda driver which comes with pgi compiler i downloaded from pgi website, is not the appropriate to use ?

thanks

My question is, the cuda driver which comes with pgi compiler i downloaded from pgi website, is not the appropriate to use ?

We ship some CUDA libraries and a few utilities that are needed to build your program but we do not ship a CUDA driver. The CUDA driver must be obtained from NVIDIA.

it says that the samples could not be installed because missing libraries: Missing required library libglut.so

Found this post on stack overflow. Is this the same error you’re getting?

FYI, I updated my CUDA driver to 319.17 which is the same as yours. Though, everything still works for me. Not sure what’s wrong with your system. Sorry.

  • Mat

Hello.

I decided to erase my machine and install the old ( it was working ) kubuntu 11.04 and the PGI compiler 12.10 ( for which i have bought a license ). I installed the last graphic card driver from NVIDIA website.
I needed to make a copy of crt* files from /usr/lib/i386-linux-gnu/
→ to → /usr/lib, in order to make the compiler work.

When i compile a code, it seems to be nice:

######################################

alechand@pcsantos2:~/test_openacc$ pgfortran -fast -Minfo=all -o MOL_DYN Mol_Dyn.f90 -ta=nvidia
mol_dyn:
28, Loop unrolled 4 times
33, Loop unrolled 8 times
38, Loop unrolled 16 times
76, Generating present_or_copyin(xold(1:5000))
Generating present_or_copy(x(1:5000))
Generating present_or_copy(v(1:5000))
Generating present_or_copy(f(1:5000))
Generating compute capability 1.3 binary
Generating compute capability 2.0 binary
77, Loop is parallelizable
Accelerator kernel generated
77, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
CC 1.3 : 18 registers; 60 shared, 12 constant, 0 local memory bytes
CC 2.0 : 20 registers; 0 shared, 76 constant, 0 local memory bytes
83, Loop is parallelizable
106, Loop unrolled 16 times

#######################################

When i execute i see the problem:

#######################################

alechand@pcsantos2:~/test_openacc$ ./MOL_DYN
call to cuMemFree returned error 700: Launch failed
CUDA driver version: 5050

#######################################

I tryed to install the cuda-5 from nvidia website, but it did not change nothing.

I really want to put this to work, that is the reason i bought it …
Please, can you help me ?

PS: the compilator was working properly with the same code. The problem SEEMS to have started after i’ve tryed to write the result of a code in an output text file, (using WRITE at the fortran code). Do you think the graphic card can be with memory problem from that time ?

I really appreciate your attention.

PS 2: the previous simple program picalc.f90 is also giving a similar memory error :

################################

alechand@pcsantos2:~/test_openacc$ pgfortran -fast -Minfo=all -o MOL_DYN picalc.f90 -ta=nvidia
picalc:
7, Accelerator kernel generated
7, CC 1.3 : 24 registers; 32 shared, 36 constant, 0 local memory bytes
CC 2.0 : 23 registers; 0 shared, 52 constant, 0 local memory bytes
8, !$acc loop gang, vector(256) ! blockidx%x threadidx%x
7, Generating compute capability 1.3 binary
Generating compute capability 2.0 binary
alechand@pcsantos2:~/test_openacc$ ./MOL_DYN
call to cuMemcpyDtoH returned error 700: Launch failed
CUDA driver version: 5050

#################################

Thanks a lot !

I really want to put this to work, that is the reason i bought it …
Please, can you help me ?

I’m trying but everything points to a problem with your specific system and not an issue with the compiler. As you point out, the code does compile and run successfully on other systems, just not yours.

This is why I’d like you to compile and run a CUDA C program using nvcc. If this works, then it’s a problem with the PGI installation. If it fails in the same way, then it’s a problem with your system.

PS: the compilator was working properly with the same code. The problem SEEMS to have started after i’ve tryed to write the result of a code in an output text file, (using WRITE at the fortran code). Do you think the graphic card can be with memory problem from that time ?

If I understand this correctly, the “Mol_Dyn.f90” code was working until you added the WRITE statement? What happens if you remove the WRITE statement? Is accelerator code still being generated when the WRITE statement is removed?

FYI, a WRITE statement shouldn’t cause this error. However, what could be happening is that without the write statement, dead code elimination optimization is removing the accelerated code. This pure speculation, though, and until I have more details I don’t know for sure.

Again, having the full output from a run where you have the environment variable “PGI_ACC_DEBUG” set to 1, may be helpful.

PS 2: the previous simple program picalc.f90 is also giving a similar memory error :

Have you modified this code from your first post? You’re no longer getting the “sum reduction” message.

Here’s what I want to see, the source your compiling, the command line options and the Minfo output, and the output from the run when PGI_ACC_DEBUG is set to 1.

% cat picalc.f90
program picalc
implicit none
integer, parameter :: n=1000000
integer :: i
real(kind=8) :: t, pi
pi = 0.0
!$acc parallel loop
do i=0, n-1
t = (i+0.5)/n
pi = pi + 4.0/(1.0 + t*t)
end do
!$acc end parallel loop
print *, 'pi=', pi/n
end program picalc 
% pgfortran -fast -Minfo=all -o MOL_DYN picalc.f90 -ta=nvidia,4.2 -V12.10
picalc:
      7, Accelerator kernel generated
          7, CC 1.3 : 23 registers; 32 shared, 36 constant, 0 local memory bytes
             CC 2.0 : 23 registers; 0 shared, 60 constant, 0 local memory bytes
          8, !$acc loop gang, vector(256) ! blockidx%x threadidx%x
         10, Sum reduction generated for pi
      7, Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
% setenv PGI_ACC_DEBUG 1
% MOL_DYN
__pgi_cu_init() found 2 devices
__pgi_cu_init( file=picalc.f90, function=picalc, line=7, startline=1, endline=14 )
__pgi_cu_init() will use device 0 (V3.0)
__pgi_cu_init() compute context created
__pgi_cu_module3( lineno=7 )
__pgi_cu_module3 module loaded at 0x85b1c0
__pgi_cu_module_function( name=0x673372=picalc_7_gpu, lineno=7, argname=(nil)=, argsize=12, varname=0x67337f=b1, varsize=8, SWcachesize=0 )
Function handle is 0x8a6db0
__pgi_cu_module_function( name=0x673360=picalc_10_gpu_red, lineno=7, argname=(nil)=, argsize=0, varname=(nil)=, varsize=0, SWcachesize=0 )
Function handle is 0x8a3d60
__pgi_cu_alloc(size=31256,lineno=7,name=)
__pgi_cu_alloc(31256) returns 0x500240000
__pgi_cu_uploadc( "b1", size=8, offset=0, lineno=7 )
constant data b1 at address 0x500140000 devsize=8, size=8, offset=0
First arguments are:
                              0          0
                    
                     0x00000000 0x00000000
__pgi_cu_launch_a(func=0x8a6db0, grid=3907x1x1, block=256x1x1, lineno=7)
__pgi_cu_launch_a(func=0x8a6db0, params=0x7fffdf3d5dac, bytes=8, sharedbytes=2048)
First arguments are:
                        2359296          5
                    
                     0x00240000 0x00000005
__pgi_cu_launch_a(func=0x8a3d60, grid=1x1x1, block=256x1x1, lineno=10)
__pgi_cu_launch_a(func=0x8a3d60, params=0x7fffdf3d5dac, bytes=12, sharedbytes=2048)
First arguments are:
                        2359296          5       3907
                    
                     0x00240000 0x00000005 0x00000f43
__pgi_cu_downloadc( "b1", size=8, offset=0, lineno=7 )
constant data b1 at address 0x500140000 devsize=8, size=8, offset=0
downloaded values are:
                     1409763568 1095235564
                    
                     0x540748f0 0x4147f7ec
__pgi_cu_free( 0x500240000, lineno=12, name= )
Memory Freed
__pgi_cu_close()
 pi=    3.141592656472318
  • Mat

Hello,
i really appreciate your attention.

About the WRITE statement in the Mol_Dyn.f90 code, i tried to use it just one time. It worked one time. In the second time, the memory errors begin to appear on all codes, even if i remove the WRITE statement, and also in this simple program picalc.f90. What i thought is that this statement caused, for example, problems in the graphic card.

About the picalc.f90, i did not change nothing. I just installed a new system (kubuntu 11.04) and a new PGI compiler.

Here i show the complete output as you asked (i use exactly the same code as you, i’ve copied it) :

##########################################################

alechand@pcsantos2:~/test_openacc$ more picalc.f90
program picalc
implicit none
integer, parameter :: n=1000000
integer :: i
real(kind=8) :: t, pi
pi = 0.0
!$acc parallel loop
do i=0, n-1
t = (i+0.5)/n
pi = pi + 4.0/(1.0 + t*t)
end do
!$acc end parallel loop
print *, ‘pi=’, pi/n
end program picalc
alechand@pcsantos2:~/test_openacc$ pgfortran -fast -Minfo=all -o MOL_DYN picalc.f90 -ta=nvidia,4.2 -V12.10
picalc:
7, Accelerator kernel generated
7, CC 1.3 : 24 registers; 32 shared, 36 constant, 0 local memory bytes
CC 2.0 : 23 registers; 0 shared, 52 constant, 0 local memory bytes
8, !$acc loop gang, vector(256) ! blockidx%x threadidx%x
10, Sum reduction generated for pi
7, Generating compute capability 1.3 binary
Generating compute capability 2.0 binary
alechand@pcsantos2:~/test_openacc$ PGI_ACC_DEBUG=1
alechand@pcsantos2:~/test_openacc$ ./MOL_DYN
call to cuMemcpyDtoH returned error 700: Launch failed
CUDA driver version: 5050

##########################################################

I dont know if i did something wrong, but the command PGI_ACC_DEBUG=1 seems to dont work.

thanks

alechand@pcsantos2:~/test_openacc$ PGI_ACC_DEBUG=1

This is an environment variable which needs to be set. If you are using csh, the command is “setenv PGI_ACC_DEBUG 1”. If you are using bash the command is “export PGI_ACC_DEBUG=1”.

  • Mat

Here is the output again :

######################################

alechand@pcsantos2:~/test_openacc$ more picalc.f90
program picalc
implicit none
integer, parameter :: n=1000000
integer :: i
real(kind=8) :: t, pi
pi = 0.0
!$acc parallel loop
do i=0, n-1
t = (i+0.5)/n
pi = pi + 4.0/(1.0 + t*t)
end do
!$acc end parallel loop
print *, ‘pi=’, pi/n
end program picalc
alechand@pcsantos2:~/test_openacc$ pgfortran -fast -Minfo=all -o MOL_DYN picalc.f90 -ta=nvidia,4.2 -V12.10
picalc:
7, Accelerator kernel generated
7, CC 1.3 : 24 registers; 32 shared, 36 constant, 0 local memory bytes
CC 2.0 : 23 registers; 0 shared, 52 constant, 0 local memory bytes
8, !$acc loop gang, vector(256) ! blockidx%x threadidx%x
10, Sum reduction generated for pi
7, Generating compute capability 1.3 binary
Generating compute capability 2.0 binary
alechand@pcsantos2:~/test_openacc$ export PGI_ACC_DEBUG=1
alechand@pcsantos2:~/test_openacc$ ./MOL_DYN
__pgi_cu_init() found 1 devices
__pgi_cu_init( file=/home/alechand/test_openacc/picalc.f90, function=picalc, line=7, startline=1, endline=14 )
__pgi_cu_init() will use device 0 (V3.0)
__pgi_cu_init() compute context created
__pgi_cu_module3( lineno=7 )
__pgi_cu_module3 module loaded at 0x928ba28
__pgi_cu_module_function( name=0x8098c4a=picalc_7_gpu, lineno=7, argname=(nil)=, argsize=8, varname=0x8098c57=b1, varsize=8, SWcachesize=0 )
Function handle is 0x93752a0
__pgi_cu_module_function( name=0x8098c38=picalc_10_gpu_red, lineno=7, argname=(nil)=, argsize=0, varname=(nil)=, varsize=0, SWcachesize=0 )
Function handle is 0x9372bf0
__pgi_cu_alloc(size=31256,lineno=7,name=)
__pgi_cu_alloc(31256) returns 0x206c0000
__pgi_cu_uploadc( “b1”, size=8, offset=0, lineno=7 )
constant data b1 at address 0x205c0000 devsize=8, size=8, offset=0
First arguments are:
0 0

0x00000000 0x00000000
__pgi_cu_launch_a(func=0x93752a0, grid=3907x1x1, block=256x1x1, lineno=7)
__pgi_cu_launch_a(func=0x93752a0, params=0xbf98c97c, bytes=4, sharedbytes=2048)
First arguments are:
543948800

0x206c0000
__pgi_cu_launch_a(func=0x9372bf0, grid=1x1x1, block=256x1x1, lineno=10)
__pgi_cu_launch_a(func=0x9372bf0, params=0xbf98c97c, bytes=8, sharedbytes=2048)
First arguments are:
543948800 3907

0x206c0000 0x00000f43
__pgi_cu_downloadc( “b1”, size=8, offset=0, lineno=7 )
call to cuMemcpyDtoH returned error 700: Launch failed
CUDA driver version: 5050

############################

Are you seeing something strange ?

thanks a lot

Given the debug output, it looks like you’re compiling to 32-bits? Can you try compiling in 64-bits? (i.e. add “-m64” to you compile options).

The program still works fine for me in 32-bits, but I’m just wondering.

  • Mat

But, my system is 32bits, and i installed the 32bits version of the compiler. Do you want to i install the 64 bits version of the compiler to test ?
Can i install it in a 32 bits ubuntu version ?

thanks

Can i install it in a 32 bits ubuntu version ?

Yes, you can install the compilers on a 32-bit OS. I just don’t have a GPU attached to a system with pure 32-bit OS and this may or may not account for the difference in what we’re each seeing. Let me talk with my IT folks and see if I can get 32-bit Ubuntu installed here next week.

Do you want to i install the 64 bits version of the compiler to test ?

If you can, that would be great.

Just curious why did you choose a 32-bit OS?

  • Mat

Hello.

I use 32 bits because i found lots of problems with 64 bits.
You are not reading my messages very well, i was asking if i can
use the 64 bits version of the compiler in the 32 bits kubuntu version, because you asked me to use that -m64 parameter in the compilation, even if i have a 32 bits system.

I am thinking … is this compiler really easier to use than use directly CUDA ?
I cannot even put this to work …
I am very disapointed
Whatever, i will try to install all again.

Thanks !

You are not reading my messages very well, i was asking if i can
use the 64 bits version of the compiler in the 32 bits kubuntu version, because you asked me to use that -m64 parameter in the compilation, even if i have a 32 bits system.

No, it’s because I didn’t know you were use a 32-bit OS. Granted, I should not have assumed that you’re running a 64-bit OS, but given your only the second 32-bit OS user I’ve encountered in many years, hopefully its a forgiveable assumption.

I am thinking … is this compiler really easier to use than use directly CUDA ?

Yes, OpenACC is easier to program than CUDA. However, massively parallel programming is still difficult, no matter which programming paradigm you use.

I cannot even put this to work …

Again, we haven’t established that is a problem with the PGI compilers. It’s possible, but since you have a very unusual configuration it’s hard to know. I’d still like you to try running the NVIDIA CUDA SDK examples. If you can get those to run, then it’s a PGI configuration error. If you can’t, then it’s an NVIDIA problem or a problem with your system.

  • Mat

Hello Mat.

Just to inform you what is happening now…
I followed your suggestion and installed a 64 bits version of ubuntu 11.10 and all upgrades available.
Then i installed the last nvidia 64 bits driver for my GTX680 graphic card.
I installed the pgi compiler 12.10, also for 64 btis, with all options avaiable ( including the cuda tool kit ). I obtained the same sequence of errors as before (see below)

############################

alechand@pcsantos2:~/test$ more picalc.f90
program picalc
implicit none
integer, parameter :: n=1000000
integer :: i
real(kind=8) :: t, pi
pi = 0.0
!$acc parallel loop
do i=0, n-1
t = (i+0.5_8)/n
pi = pi + 4.0/(1.0 + t*t)
end do
!$acc end parallel loop
print *, ‘pi=’, pi/n
end program picalc

alechand@pcsantos2:~/test$ pgf90 -ta=nvidia -Minfo=all -o PI picalc.f90
picalc:
7, Accelerator kernel generated
7, CC 1.3 : 23 registers; 32 shared, 36 constant, 0 local memory bytes
CC 2.0 : 23 registers; 0 shared, 60 constant, 0 local memory bytes
8, !$acc loop gang, vector(256) ! blockidx%x threadidx%x
10, Sum reduction generated for pi
7, Generating compute capability 1.3 binary
Generating compute capability 2.0 binary
alechand@pcsantos2:~/test$ export PGI_ACC_DEBUG=1
alechand@pcsantos2:~/test$ ./PI
__pgi_cu_init() found 1 devices
__pgi_cu_init( file=/home/alechand/test/picalc.f90, function=picalc, line=7, startline=1, endline=14 )
__pgi_cu_init() will use device 0 (V3.0)
__pgi_cu_init() compute context created
__pgi_cu_module3( lineno=7 )
__pgi_cu_module3 module loaded at 0x8d8390
__pgi_cu_module_function( name=0x6735f2=picalc_7_gpu, lineno=7, argname=(nil)=, argsize=12, varname=0x6735ff=b1, varsize=8, SWcachesize=0 )
Function handle is 0xabbd50
__pgi_cu_module_function( name=0x6735e0=picalc_10_gpu_red, lineno=7, argname=(nil)=, argsize=0, varname=(nil)=, varsize=0, SWcachesize=0 )
Function handle is 0x9234c0
__pgi_cu_alloc(size=31256,lineno=7,name=)
__pgi_cu_alloc(31256) returns 0x400240000
__pgi_cu_uploadc( “b1”, size=8, offset=0, lineno=7 )
constant data b1 at address 0x400140000 devsize=8, size=8, offset=0
First arguments are:
0 0

0x00000000 0x00000000
__pgi_cu_launch_a(func=0xabbd50, grid=3907x1x1, block=256x1x1, lineno=7)
__pgi_cu_launch_a(func=0xabbd50, params=0x7fff13e3d5fc, bytes=8, sharedbytes=2048)
First arguments are:
2359296 4

0x00240000 0x00000004
__pgi_cu_launch_a(func=0x9234c0, grid=1x1x1, block=256x1x1, lineno=10)
__pgi_cu_launch_a(func=0x9234c0, params=0x7fff13e3d5fc, bytes=12, sharedbytes=2048)
First arguments are:
2359296 4 3907

0x00240000 0x00000004 0x00000f43
__pgi_cu_downloadc( “b1”, size=8, offset=0, lineno=7 )
call to cuMemcpyDtoH returned error 700: Launch failed
CUDA driver version: 5050

#################################################

I think my graphic card is with problems…
What do you think ?
Do you know some way i can test my card, to see if all is ok with it ?
Thanks a lot.

I followed your suggestion and installed a 64 bits version of ubuntu 11.10 and all upgrades available.

Thanks. At least we know it’s not the OS.

Do you know some way i can test my card, to see if all is ok with it ?

Please run a few of the NVIDIA CUDA C SDK examples. https://developer.nvidia.com/cuda-downloads

  • Mat