Problem running CUDA 3.1 SDK examples: cudaSafeCall() Runtime API error : all CUDA-capable devices a

Hi all,

I’ve got an x86_64 CentOS 4.8 (RHEL-compatible) machine connected to half (2 GPUs) of an S1070. I previously had CUDA 2.3 installed and working fine, and today I installed CUDA 3.1.

Everything compiles fine, and my previously compiled CUDA 2.3 programs continue to work.

When trying the examples:

[codebox]# /usr/local/cudasdk31/C/bin/linux/release/clock

cudaSafeCall() Runtime API error : all CUDA-capable devices are busy or unavailable.[/codebox]

I then tried deviceQuery:

[codebox]# /usr/local/cudasdk31/C/bin/linux/release/deviceQuery

[/codebox]

… returns nothing, and terminates when I press enter.

deviceQueryDrv, on the other hand, gives:

[codebox]# /usr/local/cudasdk31/C/bin/linux/release/deviceQueryDrv

CUDA Device Query (Driver API) statically linked version

There are 2 devices supporting CUDA

Device 0: “Tesla T10 Processor”

CUDA Driver Version: 3.10

CUDA Capability Major revision number: 1

CUDA Capability Minor revision number: 3

Total amount of global memory: 4294770688 bytes

Number of multiprocessors: 30

Number of cores: 240

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 16384

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 2147483647 bytes

Texture alignment: 256 bytes

Clock rate: 1.30 GHz

Concurrent copy and execution: Yes

Run time limit on kernels: No

Integrated: No

Support host page-locked memory mapping: Yes

Concurrent kernel execution: No

Device has ECC support enabled: No

Device 1: “Tesla T10 Processor”

CUDA Driver Version: 3.10

CUDA Capability Major revision number: 1

CUDA Capability Minor revision number: 3

Total amount of global memory: 4294770688 bytes

Number of multiprocessors: 30

Number of cores: 240

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 16384

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 2147483647 bytes

Texture alignment: 256 bytes

Clock rate: 1.30 GHz

Concurrent copy and execution: Yes

Run time limit on kernels: No

Integrated: No

Support host page-locked memory mapping: Yes

Concurrent kernel execution: No

Device has ECC support enabled: No

PASSED

Press ENTER to exit…

[/codebox]

It’s totally weird. I’ve looked into the common things, such as rebooting the system, checking the permissions on /dev/nvidia* (all are a+rw), and the compute-exclusivity of the GPUs:

[codebox]# ls -al /dev

total 0

[…snip…]

crw-rw-rw- 1 root root 195, 0 Jul 14 15:28 nvidia0

crw-rw-rw- 1 root root 195, 1 Jul 14 15:28 nvidia1

crw-rw-rw- 1 root root 195, 2 Jul 14 15:28 nvidia2

crw-rw-rw- 1 root root 195, 3 Jul 14 15:28 nvidia3

crw-rw-rw- 1 root root 195, 4 Jul 14 15:28 nvidia4

crw-rw-rw- 1 root root 195, 255 Jul 14 15:28 nvidiactl

[…snip]

nvidia-smi -s

COMPUTE mode rules for GPU 0: 0

COMPUTE mode rules for GPU 1: 0

[/codebox]

Any ideas on what may cause this behaviour? Your help is much appreciated in advance.

– Alf

What driver are you running? It is quite possible that if you didn’t upgrade to a 256 series driver at the same time, that you are seeing a runtime API version – driver version conflict (which would explain why your older code still works and the driver API code also still works). Usually the error messages are slightly different to that, though, but it might still be worth checking.

Driver version is 256.35.

I had a similar problem. I end up reinstalling the 64 bit driver. Make sure to purge all. I think the problem had to do with initially trying to use a repository and upgrading the driver from there, and that confused the os…( 64 bit Lucid).

Thanks for the great idea. I fully uninstalled the Nvidia driver (using /usr/bin/nvidia-uninstall) and then re-installed it. My deviceQuery call now works!

Unfortunately, running the “clock” program still reports that no CUDA devices are available. I double-checked that both devices are in compute mode 0, and that no other CUDA apps are running on the system.

Any ideas?

I tried out version 256.44 today, but got the same problem. DeviceQuery and deviceQueryDrv can both see the GPUs on the S1070, but when I try “clock” or something like it, I get that no GPUs are available.

When I revert to version 190.53 of the driver and try the sample programs in the v2.3 SDK everything works.