Different results with -Mcuda=emu / -Mcuda with simple code

Hello,

While trying to debug a piece of cuda fortran code, I find different results in emulation mode which gives me the expected result and thenormal execution mode. This is strange since there is no concurrency between the thread. I striped down the code and retained the bug with this simple piece of code.


MODULE INC_TIME
USE CUDAFOR
CONTAINS

ATTRIBUTES(GLOBAL) SUBROUTINE inc(vel,ncell)

!real, device   :: vel(ncell)
real  :: vel(ncell)
integer, value :: ncell
integer :: th_id
real :: tmp

th_id=threadidx%x

tmp=vel(th_id)+1
vel(th_id)=tmp

END SUBROUTINE inc

END MODULE INC_TIME
!----------------------------------------!

PROGRAM BURGER
USE CUDAFOR
USE INC_TIME

integer, parameter :: ncell=256
real(KIND=4), dimension(ncell) :: v
real(KIND=4), device, allocatable, dimension(:) :: v_dev

v=1
allocate(v_dev(ncell))
v_dev=v

write(*,*),v(100:103)
call inc<<<1,256>>>(v_dev,ncell)
err=cudaThreadSynchronize()
v=0
v=v_dev
write(*,*),v(100:103)

END PROGRAM BURGER

When I run it I get this:

$pgfortran -Mcuda=emu test1.cuf
$ ./a.out
1.000000 1.000000 1.000000 1.000000
2.000000 2.000000 2.000000 2.000000
$ pgfortran -Mcuda test1.cuf
$ ./a.out
1.000000 1.000000 1.000000 1.000000
1.000000 1.000000 1.000000 1.000000


Any suggestion?

PS: I get the same result with the two different declarations of the array vel in the kernel (I am not sure which one I should use…)

Hi Beniot,

I tested your code and it seems to work for me (see below). What type of device do you have and what is it’s compute computability? (you can use the utility pgaccelinfo to find this information). What OS are you using?

  • Mat
% pgfortran -Mcuda test.cuf -o test.out
% pgfortran -Mcuda=emu test.cuf -o testemu.out
% test.out
    1.000000        1.000000        1.000000        1.000000
    2.000000        2.000000        2.000000        2.000000
% testemu.out
    1.000000        1.000000        1.000000        1.000000
    2.000000        2.000000        2.000000        2.000000

The device is a Tesla C1060 (four of them actually, packed in a Tesla S1070).
The OS is a Cent OS (5.1 I think), with a 2.6.18 kernel.

Here is the output from pgaccelinfo:

Device Number: 0
Device Name: Tesla C1060
Device Revision Number: 1.3
Global Memory Size: 4294705152
Number of Multiprocessors: 30
Number of Cores: 240
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 16384
Registers per Block: 16384
Warp Size: 32
Maximum Threads per Block: 512
Maximum Block Dimensions: 512 x 512 x 64
Maximum Grid Dimensions: 65535 x 65535 x 1
Maximum Memory Pitch: 262144B
Texture Alignment 256B
Clock Rate: 1440 MHz
Initialization time: 8081325 microseconds
Current free memory 4237164544
Upload time (4MB) 3379 microseconds (3381 ms pinned)
Download time 5499 microseconds (2184 ms pinned)
Upload bandwidth 1241 MB/sec (1240 MB/sec pinned)
Download bandwidth 762 MB/sec (1920 MB/sec pinned)


I should add that a have a C version of the code which works fine with nvcc on this system.

Cheers,

Benoit.

You should add cudaGetLastError() call after this cudaThreadSynchronize() call, and check for the return value. I tried your code on my machine (CUDA 1.1 capable card, on a 64-bit Linux installation), and it works properly, but say if I omit “-Mcuda=cc11” in my compile line, then all ones are printed as results, as the kernel is not run at all in that case (because the code get compiled for 1.3 capable device, and cudaGetLastError() I’ve put in there is properly returning cudaErrorInvalidDeviceFunction as error value in that case). So I suspect the kernel is not run, for some reason, in your case too, so you should really examine this error value, and start debugging from there…

Here is, for your convenience, your full example from above, with the addition mentioned (and with some fixes, and some cruft removed):

module inc_time
  use cudafor

  implicit none
contains
  attributes(global) subroutine inc(v, n)
    implicit none
    real  :: v(n)
    integer, value :: n
   
    if (threadIdx%x <= n) then
       v(threadIdx%x) = v(threadIdx%x) + 1
    end if
  end subroutine inc
end module inc_time

program foo
  use cudafor
  use inc_time
 
  implicit none
  integer, parameter :: N = 256
  real, dimension(N) :: v
  real, dimension(:), device, allocatable :: vd
  integer :: error

  v = 1

  allocate(vd(N))
  vd = v

  write(*,*) v(100:103)

  call inc<<<1, N>>>(vd, N)
  error = cudaThreadSynchronize()
  if (error /= cudaSuccess) stop
  error = cudaGetLastError()
  if (error /= cudaSuccess) stop

  v = vd
  write(*,*) v(100:103)

  deallocate(vd)
end program foo

Hi cgorac,

Well, indeed the cudaGetLastError returns “Invalid device function”.
But I am not sure how it helps me. This code runs on other devices as
you and Mat told me, and I have run succesfully C codes on my device.

When I compile and run the sgemm.cuf sample code, I get this somewhat puzzling result:

Device:Tesla C1060 tion (CPU), 1440.0 MHz clock, -0.3 MB memory.

65536 errors were encountered
256x256 * 256x256: 0.011 ms 3140.286 GFlops/s

I am not sure things went smoothly in this case either…

Thank you for any futher suggestions.

Benoit.

Hi Benoit,

So in both cases, it appears that the kernel is not being launched.

Before we go further, can you please try rebooting your system? I hate asking this but I have seen where odd behavior like this was due to the CUDA driver getting into an inconsistent state.

Also, thanks to cgorac for the tip on adding the cuda errors. Though, in your case you have a compute capable 1.3 system so don’t want to use “cc11”. “cc13” is the default.

  • Mat

Yes, the sgemm sample output looks like the kernel is not launched again; probably the error reported by CUDA runtime will be the same. The only case I’ve encountered this kind of error so far was when the kernel is compiled for the wrong architecture; so, I’d still try with building the code with “-Mcuda=cc11”, and even “-Mcuda=cc10”, and then running it. If it works, then for some strange reason your card, while 1.3 capable, is at the moment working in 1.[01] mode only with the runtime delivered along with PGI tools; then maybe restarting the machine, as mkcolg suggested, may indeed help.

On the other side, is it maybe that you have another CUDA program running at the same time on the same machine? I think I remember someone mentioned on CUDA forums, some time ago, that he encountered this error when trying to run two kernels on the single card at the same time…

Hello Mat and cgorac,

I tried all of cc10 c11 and cc13 with the same result…And no other code is runing.

I will try to reboot the computer, but it is a server so it is not such a small undertaking.

One more question: it does not look like the executable cares wether libcudart is definied or not in the LD_LIBRARY_PATH… Is that normal?

It happens that, since I have nvcc installed, I have two libcudart in different places, not necessarily identical, and I wondered if the executable was not choosing the wrong one…

One more question: it does not look like the executable cares wether libcudart is definied or not in the LD_LIBRARY_PATH… Is that normal?

We link with the “-rpath” flag to specify the default location of the CUDA runtime libraries. So provided you have the libs installed in the same place on each system (by default /opt/pgi/linux86-64/10.0/cuda/lib) then you don’t need to set LD_LIBRARY_PATH.

You would need to set the LD_LIBRARY_PATH on a system where the libs are in a different directory.

  • Mat

I have now rebooted the computer. For some reason, at first the /dev/nvidiactl and /dev/nividia0, etc… did not exist at first. I had to run the ./DeviceQuery from cuda SDK as root…Then they were created, and I could run my code as a standard user.

After this, I am in the same situation as before: the c code compiled with nvcc works fine, while the fortran code below (almost same as before, gives me this output:

1.000000 1.000000 1.000000 1.000000
last error
invalid device function
FORTRAN STOP

MODULE INC_TIME
USE CUDAFOR
CONTAINS

ATTRIBUTES(GLOBAL) SUBROUTINE inc(vel,ncell)

!real, device   :: vel(ncell)
real  :: vel(ncell)
integer, value :: ncell
integer :: th_id
real :: tmp

th_id=threadidx%x

tmp=vel(th_id)+1
vel(th_id)=tmp

END SUBROUTINE inc

END MODULE INC_TIME
!----------------------------------------!

PROGRAM BURGER
USE CUDAFOR
USE INC_TIME

integer, parameter :: ncell=256
real(KIND=4), dimension(ncell) :: v
real(KIND=4), device, allocatable, dimension(:) :: v_dev
integer :: err
character(100) :: str

v=1
allocate(v_dev(ncell))
v_dev=v

write(*,*),v(100:103)
call inc<<<1,256>>>(v_dev,ncell)
err=cudaThreadSynchronize()
if (err /= cudaSuccess) stop
err = cudaGetLastError()
str=cudaGetErrorString(err)
print*,'last error',str
if (err /= cudaSuccess) stop

v=0
v=v_dev
write(*,*),v(100:103)

END PROGRAM BURGER

Hi Benoit,

Your question about the LD_LIBRARY_PATH got me to do some investigation. It turns out our installer missed adding a link in the PGI cuda lib directory. It currently has “libcudart.so” pointing to “libcurart.so.2.3” but should also have a “libcurart.so.2” link to “libcurart.so.2.3”.

This is a problem because the loader can only see one period after the .so. So if you have another “libcurart.so.2” in your LD_LIBRARY_PATH, the loader will pick-up this one instead of the correct one from the PGI tree.

What is the output of “ldd a.out” (where a.out is the name of your exe)? Which libcudart.so does it use?

Also, try adding the link to “libcurart.so.2” by hand and see it things change.

Example:

ln -s /opt/pgi/linux86-64/10.0/cuda/lib/libcudart.so.2.3 /opt/pgi/linux86-64/10.0/cuda/lib/libcudart.so.2
  • Mat

Hi Mat,

Actually, I had already created this link. It is not such unusual situation.l

The output of ldd is:

libcudart.so.2 => /opt/pgi/linux86-64/10.0/cuda/lib/libcudart.so.2 (0x00002ba5dfc65000)
librt.so.1 => /lib64/librt.so.1 (0x0000003a46a00000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003a46200000)
libm.so.6 => /lib64/libm.so.6 (0x0000003a45a00000)
libc.so.6 => /lib64/libc.so.6 (0x0000003a45600000)
libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x000000316ce00000)
libdl.so.2 => /lib64/libdl.so.2 (0x0000003a45e00000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x000000316ca00000)
/lib64/ld-linux-x86-64.so.2 (0x0000003a44600000)

This is of course when I compile with the 64 bit version of the compiler. My troubles
are the same with the 32 bit version.

Benoit.

Hi Benoit,

I’m a bit stumped but not deterred.

Are you able to run “pgaccelinfo”?
What is the output of the following toy program that just copies data to and from a device? (It should print 1’s).

% cat copy.cuf
program copygpu
integer, dimension(5), device :: a
integer, dimension(5) :: b,c

b = 1
c = 0
a = b
c = a

print *, c
end program copygpu
% pgf90 copy.cuf
% a.out
            1            1            1            1            1

Do you know if this system has Security Enhanced (SELinux) enabled? If so, it’s possible there is a permission issue loading our shared objects.

  • Mat

Well mat, you are on to something!

First, the pgaccelinfo output is:


Device Number: 0
Device Name: Tesla C1060
Device Revision Number: 1.3
Global Memory Size: 4294705152
Number of Multiprocessors: 30
Number of Cores: 240
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 16384
Registers per Block: 16384
Warp Size: 32
Maximum Threads per Block: 512
Maximum Block Dimensions: 512 x 512 x 64
Maximum Grid Dimensions: 65535 x 65535 x 1
Maximum Memory Pitch: 262144B
Texture Alignment 256B
Clock Rate: 1440 MHz
Initialization time: 8063180 microseconds
Current free memory 4237164544
Upload time (4MB) 3419 microseconds (3735 ms pinned)
Download time 4665 microseconds (2249 ms pinned)
Upload bandwidth 1226 MB/sec (1122 MB/sec pinned)
Download bandwidth 899 MB/sec (1864 MB/sec pinned)

Device Number: 1
Device Name: Tesla C1060
Device Revision Number: 1.3
Global Memory Size: 4294705152
Number of Multiprocessors: 30
Number of Cores: 240
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 16384
Registers per Block: 16384
Warp Size: 32
Maximum Threads per Block: 512
Maximum Block Dimensions: 512 x 512 x 64
Maximum Grid Dimensions: 65535 x 65535 x 1
Maximum Memory Pitch: 262144B
Texture Alignment 256B
Clock Rate: 1440 MHz
Initialization time: 8063180 microseconds
Current free memory 4237164544
Upload time (4MB) 3394 microseconds (3738 ms pinned)
Download time 4819 microseconds (2164 ms pinned)
Upload bandwidth 1235 MB/sec (1122 MB/sec pinned)
Download bandwidth 870 MB/sec (1938 MB/sec pinned)

Device Number: 2
Device Name: Tesla C1060
Device Revision Number: 1.3
Global Memory Size: 4294705152
Number of Multiprocessors: 30
Number of Cores: 240
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 16384
Registers per Block: 16384
Warp Size: 32
Maximum Threads per Block: 512
Maximum Block Dimensions: 512 x 512 x 64
Maximum Grid Dimensions: 65535 x 65535 x 1
Maximum Memory Pitch: 262144B
Texture Alignment 256B
Clock Rate: 1440 MHz
Initialization time: 8063180 microseconds
Current free memory 4237164544
Upload time (4MB) 6879 microseconds (6417 ms pinned)
Download time 6180 microseconds (5372 ms pinned)
Upload bandwidth 609 MB/sec ( 653 MB/sec pinned)
Download bandwidth 678 MB/sec ( 780 MB/sec pinned)

Device Number: 3
Device Name: Tesla C1060
Device Revision Number: 1.3
Global Memory Size: 4294705152
Number of Multiprocessors: 30
Number of Cores: 240
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 16384
Registers per Block: 16384
Warp Size: 32
Maximum Threads per Block: 512
Maximum Block Dimensions: 512 x 512 x 64
Maximum Grid Dimensions: 65535 x 65535 x 1
Maximum Memory Pitch: 262144B
Texture Alignment 256B
Clock Rate: 1440 MHz
Initialization time: 8063180 microseconds
Current free memory 4237164544
Upload time (4MB) 6889 microseconds (6408 ms pinned)
Download time 6215 microseconds (5369 ms pinned)
Upload bandwidth 608 MB/sec ( 654 MB/sec pinned)
Download bandwidth 674 MB/sec ( 781 MB/sec pinned)

\



Then, your piece of code gives correct results!

$ ./a.out
1 1 1 1 1


But the code I posted earlier still result in a “invalid device function”…

Benoit.

Can you please compile with “-Mkeepasm” and then send your binary as well as the resulting asm file (.s) to trs@pgroup.com?

Thanks,
Mat

Sure. You mean compile my piece of code, not your simple test code?

Benoit.

Sorry, yes I mean the failing example.

  • Mat

FYI, Benoit and I were finally able to determine that the problem was due to an old CUDA driver. He had version 177 installed which only works with CUDA 2.0 or earlier. Since CUDA Fortran currently uses CUDA 2.3, he needed to install a newer driver (version 190).

  • Mat