CUDA on Fedora 16: Samples work, simple code compiles but does not execute

Hello,

  1. Using the instructions outlined on

I managed to install CUDA 4.0 on my desktop running Fedora 16 (64-bit). The configuration of my desktop is: Intel Core i7 920 CPU, 6 GB DDR3 SDRAM, GeForce GTX 295 graphics card.

  1. I am using gcc 4.6.2 (which I know is not compatible with CUDA, but it seems the samples do compile and execute).
$ gcc --version

gcc (GCC) 4.6.2 20111027 (Red Hat 4.6.2-1)

Copyright (C) 2011 Free Software Foundation, Inc.

This is free software; see the source for copying conditions.  There is NO

warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
  1. Output from deviceQuery
$ ./deviceQuery

[deviceQuery] starting...

./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Found 2 CUDA Capable device(s)

Device 0: "GeForce GTX 295"

  CUDA Driver Version / Runtime Version          4.10 / 4.0

  CUDA Capability Major/Minor version number:    1.3

  Total amount of global memory:                 896 MBytes (939327488 bytes)

  (30) Multiprocessors x ( 8) CUDA Cores/MP:     240 CUDA Cores

  GPU Clock Speed:                               1.24 GHz

  Memory Clock rate:                             999.00 Mhz

  Memory Bus Width:                              448-bit

  Max Texture Dimension Size (x,y,z)             1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)

  Max Layered Texture Size (dim) x layers        1D=(8192) x 512, 2D=(8192,8192) x 512

  Total amount of constant memory:               65536 bytes

  Total amount of shared memory per block:       16384 bytes

  Total number of registers available per block: 16384

  Warp size:                                     32

  Maximum number of threads per block:           512

  Maximum sizes of each dimension of a block:    512 x 512 x 64

  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1

  Maximum memory pitch:                          2147483647 bytes

  Texture alignment:                             256 bytes

  Concurrent copy and execution:                 Yes with 1 copy engine(s)

  Run time limit on kernels:                     No

  Integrated GPU sharing Host Memory:            No

  Support host page-locked memory mapping:       Yes

  Concurrent kernel execution:                   No

  Alignment requirement for Surfaces:            Yes

  Device has ECC support enabled:                No

  Device is using TCC driver mode:               No

  Device supports Unified Addressing (UVA):      No

  Device PCI Bus ID / PCI location ID:           4 / 0

  Compute Mode:

     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "GeForce GTX 295"

  CUDA Driver Version / Runtime Version          4.10 / 4.0

  CUDA Capability Major/Minor version number:    1.3

  Total amount of global memory:                 895 MBytes (938803200 bytes)

  (30) Multiprocessors x ( 8) CUDA Cores/MP:     240 CUDA Cores

  GPU Clock Speed:                               1.24 GHz

  Memory Clock rate:                             999.00 Mhz

  Memory Bus Width:                              448-bit

  Max Texture Dimension Size (x,y,z)             1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)

  Max Layered Texture Size (dim) x layers        1D=(8192) x 512, 2D=(8192,8192) x 512

  Total amount of constant memory:               65536 bytes

  Total amount of shared memory per block:       16384 bytes

  Total number of registers available per block: 16384

  Warp size:                                     32

  Maximum number of threads per block:           512

  Maximum sizes of each dimension of a block:    512 x 512 x 64

  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1

  Maximum memory pitch:                          2147483647 bytes

  Texture alignment:                             256 bytes

  Concurrent copy and execution:                 Yes with 1 copy engine(s)

  Run time limit on kernels:                     Yes

  Integrated GPU sharing Host Memory:            No

  Support host page-locked memory mapping:       Yes

  Concurrent kernel execution:                   No

  Alignment requirement for Surfaces:            Yes

  Device has ECC support enabled:                No

  Device is using TCC driver mode:               No

  Device supports Unified Addressing (UVA):      No

  Device PCI Bus ID / PCI location ID:           5 / 0

  Compute Mode:

     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.10, CUDA Runtime Version = 4.0, NumDevs = 2, Device = GeForce GTX 295, Device = GeForce GTX 295

[deviceQuery] test results...

PASSED

Press ENTER to exit...

The Mandlebrot example works too.

  1. But when I execute a simple program from CUDA By Example, to test the system,
$ more hello.cu

#include "./common/book.h"

int main ( void ){

   printf( "Hello world!\n" );

   return 0;

}

I get a “permission denied error” when I execute it:

$ nvcc -c hello.cu -o hello.out

$
$ ./hello.out

bash: ./hello.out: Permission denied

$

Note that I have to use the -c option (Why?). Otherwise, I get a string of errors

$ nvcc hello.cu -o hello.out

/usr/bin/ld: /tmp/tmpxft_00000df8_00000000-13_hello.o: undefined reference to symbol 'pthread_cancel@@GLIBC_2.2.5'

/usr/bin/ld: note: 'pthread_cancel@@GLIBC_2.2.5' is defined in DSO /lib64/libpthread.so.0 so try adding it to the linker command line

/lib64/libpthread.so.0: could not read symbols: Invalid operation

collect2: ld returned 1 exit status

I would appreciate responses to the following questions:

  1. Why am I getting a “permission denied” error when I execute the binary file?

  2. Why do I have to use a -c flag? Couldn’t figure out the reason from http://sbel.wisc.edu/Courses/ME964/2008/Documents/nvccCompilerInfo.pdf.

  3. How do I fix this without downgrading to gcc 4.4?

Thanks in advance!

With the -c flag, you are not generating an executable, just an object file.

Thanks for your prompt reply mfatica!

So, I shouldn’t be using the -c flag, in which case, the error messages I get need to be addressed right?

What keeps you from installing a version of gcc that works with CUDA?

Nothing really, except that I couldn’t figure out how to downgrade in Fedora 16. I found some random gcc-4.4 rpm which didn’t work. If you know of a proper way to downgrade, please let me know. It worked on Fedora 15, but doesn’t seem to work in 16.

Installing gcc from source isn’t difficult, particularly if you choose gcc-4.3.6 which doesn’t need GMP, MPFR, MPC, Polyhedra or CLooG for building.

Basically installation works like this:

[font=“Courier New”]wget http://gcc.petsads.us/releases/gcc-4.3.6/gcc-4.3.6.tar.bz2
tar -xjf gcc-4.3.6.tar.bz2
mkdir gcc-4.3.6.obj
cd gcc-4.3.6.obj
…/gcc-4.3.6/configure --prefix=/usr/local/gcc-4.3.6
make -j4

go grab a coffee…

sudo make install[/font]

As I am not always doing this anymore, its tough to keep track of all the symbolic links and references that need to be put in place. Is there an RPM or YUM based installation I could do? I can’t seem to find the old rpms somehow. Do gcc 4.4 or 4.5 work with CUDA 4?

Adding -lpthread makes things work again. This is due to http://forums.fedoraforum.org/showpost.php?p=1445526&postcount=4.

$ nvcc hello.cu -o hello.out

/usr/bin/ld: /tmp/tmpxft_00000c9d_00000000-13_hello.o: undefined reference to symbol 'pthread_cancel@@GLIBC_2.2.5'

/usr/bin/ld: note: 'pthread_cancel@@GLIBC_2.2.5' is defined in DSO /lib64/libpthread.so.0 so try adding it to the linker command line

/lib64/libpthread.so.0: could not read symbols: Invalid operation

collect2: ld returned 1 exit status

$ nvcc -lpthread hello.cu -o hello.out

$ ./hello.out

Hello world!

$

Any way I can make this implicit, when not using a makefile?