deviceQuery OK, everything else hangs Cuda sdk 4.1 examples simply hang, no errors, no warnings

Hi,

I recently installed a TESLA 2075 on Ubuntu10.04. Drivers and runtime are both CUDA 4.1. Compiling the software SDK runs through with out errors. deviceQuery returns:
[deviceQuery] starting…

bin/linux/release/deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

Found 1 CUDA Capable device(s)

Device 0: “Tesla C2075”
CUDA Driver Version / Runtime Version 4.1 / 4.1
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 5375 MBytes (5636554752 bytes)
(14) Multiprocessors x (32) CUDA Cores/MP: 448 CUDA Cores
GPU Clock Speed: 1.15 GHz
Memory Clock rate: 1566.00 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 786432 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: Yes
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 6 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.1, CUDA Runtime Version = 4.1, NumDevs = 1, Device = Tesla C2075
[deviceQuery] test results…
PASSED

exiting in 3 seconds: 3…2…1…done!

So far, no problem. I can also see the card using nvidia-smi:
Tue Apr 3 16:25:39 2012
±-----------------------------------------------------+
| NVIDIA-SMI 2.285.05 Driver Version: 285.05.33 |
|-------------------------------±---------------------±---------------------+
| Nb. Name | Bus Id Disp. | Volatile ECC SB / DB |
| Fan Temp Power Usage /Cap | Memory Usage | GPU Util. Compute M. |
|===============================+======================+======================|
| 0. Tesla C2075 | 0000:06:00.0 Off | 0 0 |
| 30% 52 C P0 80W / 225W | 0% 10MB / 5375MB | 99% Default |
|-------------------------------±---------------------±---------------------|
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| No running compute processes found |
±----------------------------------------------------------------------------+

now if I try to run an application, say vectorAdd: the status is like this:
Tue Apr 3 16:26:29 2012
±-----------------------------------------------------+
| NVIDIA-SMI 2.285.05 Driver Version: 285.05.33 |
|-------------------------------±---------------------±---------------------+
| Nb. Name | Bus Id Disp. | Volatile ECC SB / DB |
| Fan Temp Power Usage /Cap | Memory Usage | GPU Util. Compute M. |
|===============================+======================+======================|
| 0. Tesla C2075 | 0000:06:00.0 Off | 0 0 |
| 30% 52 C P12 32W / 225W | 1% 59MB / 5375MB | 0% Default |
|-------------------------------±---------------------±---------------------|
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| 0. 9361 …A_GPU_Computing_SDK/C/bin/linux/release/vectorAdd 47MB |
±----------------------------------------------------------------------------+

Nothing seems to happen on the card, GPU Util(isation, I assume) is stuck at 0% and the code just hangs, no error no nothing. The only thing I see is that one of the CPUs is at 100%.

Any ideas what might be wrong?

Thanks a lot, MW

Have you connected both power connectors?

Thanks for the hint. Yes, I actually used one 8 pin connector instead of two six pin connectors as described as an alternative in the installation in the manual. I also checked that this connector is sitting really tight. But maybe two 6 pin connectors work better?

[Edit] Checked also to run the card with two 6 pin connectors - same result: deviceQuery is OK, everything else from the SDK hangs, card shows no utilization of compute ressources, TOP shows that the processes uses 100% cpu power ‘instead’.

[Edit]

I checked a couple of function, even simplePrintf, all show the same behaviour. They spit out a couple of lines such as:

[simplePrintf] starting…

GPU Device 0: “Tesla C2075” with compute capability 2.0

Device 0: “Tesla C2075” with Compute 2.0 capability

printf() is called. Output:

… and then nothing happens any more, no error, no warning, 1 CPU for the process at 100% (all other cores on the machine idle).

What really puzzles me is the absence of error messages…

MW

Do you have write permissions in the directory where you are running the SDK samples? Most of those try and write a log file to disk, and they hang if they can’t write to disk…

Intersting point. I installed and compiled the SDK in my home directory, so read/write permissions should be OK there. Do the system-wide CUDA libraries need any special permissions?

[Edit] Tried running everything as root, same problem. We’re accessing the libraries via adding them in /etc/ld.so.conf.d/CUDA.conf, and subsequent ldconfig. The really strange thing is that deviceQuery runs (I guess it also compiled on the system, insn’it?), all other programs don’t. …

[Edit] Resetting the card with nvidia-smi seems initially succesful:

$> nvidia-smi -r --id=0

GPU 0000:06:00.0 was successfully reset.

But after this the card is gone:

> ~/NVIDIA_GPU_Computing_SDK/C/bin/linux/release ./deviceQuery

[deviceQuery] starting…

./deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 10

-> invalid device ordinal

[deviceQuery] test results…

FAILED

exiting in 3 seconds: 3…2…1…done!

dmesg returns:

[ 305.559141] NVRM: Xid (0000:06:00): 31, Ch 00000000, engmask 00000101, intr 10000000

[ 314.338656] NVRM: Xid (0000:06:00): 44, 0000 00000000 00000000 00000000 00000000 00000000

[ 316.990208] NVRM: Xid (0000:06:00): 31, Ch 00000000, engmask 00000101, intr 10000000

[ 317.155736] NVRM: Xid (0000:06:00): 31, Ch 00000001, engmask 00000101, intr 10000000

[ 317.166933] NVRM: Xid (0000:06:00): 31, Ch 00000002, engmask 00000101, intr 10000000

[ 317.181420] NVRM: Xid (0000:06:00): 31, Ch 00000003, engmask 00000101, intr 10000000

[ 343.705342] NVRM: Xid (0000:06:00): 31, Ch 00000000, engmask 00000101, intr 10000000

[ 352.309550] NVRM: Xid (0000:06:00): 44, 0000 00000000 00000000 00000000 00000000 00000000

[64551.752616] NVRM: RmInitAdapter failed! (0x27:0xffffffff:1193)

[64551.752640] NVRM: rm_init_adapter(0) failed

[64582.002530] NVRM: RmInitAdapter failed! (0x27:0xffffffff:1193)

[64582.002569] NVRM: rm_init_adapter(0) failed

[64584.468063] NVRM: RmInitAdapter failed! (0x27:0xffffffff:1193)

[64584.468104] NVRM: rm_init_adapter(0) failed

Is this normal behaviour??

MW

A related question - is there any way of telling the card is not getting enough power?

MW

Have you tried the card in a different machine? I looked at the big list of error codes, and those suggest that something is seriously wrong with your hardware somewhere.

I do not get any error messages actually, if I do not try to reset the card. My biggest problem is that the programs start but hang, without any error messages. The machine itself is a server that has been running fine for three years so far no glitch whatsoever. Trying on a duifferent machine would mean taking it from the network and installing all the libraries and stuff again. Do you think the card is broken?

MW

No, I think your card is doing just fine, it has to do with the ubuntu software stack.

I am having the same problem. On Centos 6.2 it works fine, but with ubuntu 12.04 beta 2 I can only execute deviceQuery successfully, nbody -benchmark just hangs.

So if switching to centos 6.2 is a possible avenue for you, then you have your solution. I unfortunately have to make it work on ubuntu and wonder what would be the next steps to figure out what is going wrong. An strace shows that nbody -benchmark is waiting for something where it says resource not available:

open("/proc/driver/nvidia/params", O_RDONLY) = 15
fstat(15, {st_mode=S_IFREG|0444, st_size=0, …}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f9d65427000
read(15, “EnableVia4x: 0\nEnableALiAGP: 0\nN”…, 1024) = 456
close(15) = 0
munmap(0x7f9d65427000, 4096) = 0
stat("/dev/nvidiactl", {st_mode=S_IFCHR|0666, st_rdev=makedev(195, 255), …}) = 0
open("/dev/nvidiactl", O_RDWR) = 15
ioctl(15, 0xc01446ce, 0x7fff4da34400) = 0
ioctl(15, 0xc020462b, 0x7fff4da343f0) = 0
write(12, “\253”, 1) = 1
futex(0x7fff4da34430, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {1334807419, 187367000}, ffffffff) = -1 EAGAIN (Resource temporarily unavailable)
pipe([16, 17]) = 0
fcntl(16, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
write(12, “\253”, 1) = 1
futex(0x1e936f8, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x1e936f8, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x1e936f8, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x1e936f8, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x1e936f8, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x1e936f8, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x1e936f8, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x1e936f8, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x1e936f8, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x1e936f8, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x1e936f8, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x1e936f8, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x1e936f8, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x1e936f8, FUTEX_WAKE_PRIVATE, 1) = 0
ioctl(4, 0xc020462a, 0x7fff4da34500) = 0
futex(0x1e936f8, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x1e936f8, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x1e936f8, FUTEX_WAKE_PRIVATE, 1) = 0

Looks like the previous nbody run that I aborted is also still registered on the card:
ssh n6 nvidia-smi
Wed Apr 18 21:04:52 2012
±-----------------------------------------------------+
| NVIDIA-SMI 2.285.05 Driver Version: 285.05.33 |
|-------------------------------±---------------------±---------------------+
| Nb. Name | Bus Id Disp. | Volatile ECC SB / DB |
| Fan Temp Power Usage /Cap | Memory Usage | GPU Util. Compute M. |
|===============================+======================+======================|
| 0. Tesla M2075 | 0000:05:00.0 Off | 0 0 |
| N/A N/A P12 28W / 225W | 1% 59MB / 5375MB | 0% Default |
|-------------------------------±---------------------±---------------------|
| 1. Tesla M2075 | 0000:03:00.0 Off | 0 0 |
| N/A N/A P12 31W / 225W | 1% 59MB / 5375MB | 0% Default |
|-------------------------------±---------------------±---------------------|
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| 0. 11978 /home/hybrid/nbody 47MB |
| 1. 2373 ./nbody 47MB |
±----------------------------------------------------------------------------+

Michael

I’m having the same problem, Ubuntu 11.10 with Linux 3.0.0-17-generic, CUDA 4.1, nvidia-driver 285.05.33, four GeForce GTX 295.

Sometimes a combination of enabling/disabling persistence mode, compute-exclusive mode and reloading the nvidia module help. dmesg has this to say:

[286314.871216] NVRM: Xid (0000:15:00): 13, 0001 00000000 000050c0 00000368 00000000 00000100

I tried again with the spanking new cuda 4.2, nvidia devdriver 295.41, and it fails still on ubuntu server 12.04 beta 2 with kernel 3.2.0-23-generic, but it works fine on ubuntu server 11.10 with kernel 3.0.0-17-generic.

The only change I needed to make for all examples to compile was to NVIDIA_GPU_Computing_SDK/C/common/common.mk moving the OPENGLLIB linking behind the RENDERCHECKGLLIB linking:

*** common.mk 2012-04-20 21:25:05.497193895 -0700

— /home/hybrid/nvidia/common.mk 2012-04-20 13:05:19.672992402 -0700


*** 268,285 ****

If dynamically linking to CUDA and CUDART, we exclude the libraries from the LIB

ifeq ($(USECUDADYNLIB),1)

! LIB += {OPENGLLIB} (PARAMGLLIB) (RENDERCHECKGLLIB) {LIB} -ldl -rdynamic

else

static linking, we will statically link against CUDA and CUDART

ifeq ($(USEDRVAPI),1)

! LIB += -lcuda {OPENGLLIB} (PARAMGLLIB) (RENDERCHECKGLLIB) {LIB}

else

   ifeq ($(emu),1) 

       LIB += -lcudartemu

   else 

       LIB += -lcudart

   endif

! LIB += {OPENGLLIB} (PARAMGLLIB) (RENDERCHECKGLLIB) {LIB}

endif

endif

— 268,285 ----

If dynamically linking to CUDA and CUDART, we exclude the libraries from the LIB

ifeq ($(USECUDADYNLIB),1)

! LIB += (PARAMGLLIB) (RENDERCHECKGLLIB) {OPENGLLIB} {LIB} -ldl -rdynamic

else

static linking, we will statically link against CUDA and CUDART

ifeq ($(USEDRVAPI),1)

! LIB += -lcuda (PARAMGLLIB) (RENDERCHECKGLLIB) {OPENGLLIB} {LIB}

else

   ifeq ($(emu),1) 

       LIB += -lcudartemu

   else 

       LIB += -lcudart

   endif

! LIB += (PARAMGLLIB) (RENDERCHECKGLLIB) {OPENGLLIB} {LIB}

endif

endif

So for now I seem to have to stick to ubuntu 11.10 which works with cuda 4.2 as I could not get it to work on 12.04 beta.

Michael

I had the same thing. At least update to the very newest version (295.41), then update and install/build+make the toolkit to 4.2 and the sdk to 4.2.
Then go in de SDK/C/… run the ./deviceQuery. And then it prints a few lines and hangs. This is your solution: WAIT 5 MINUTES. I’m serious. After that, it works like a charm, otherwise try runnings things as root at least.

The SOLUTION: Disable IOMMU feature in the bios. After that it works on 12.04 the same as it works under 11.10, no more hangs.

Michael