One C1060 out of two is not responsive

Hi,

I have a 2 machines which are connected to a S1070. Each machine obviously sees 2 C1060 (half of the S1070) and if

I run deviceQuery I see all 4 C1060. However if I run a sample test from the SDK (reduction for example but any other application)

one of the machines will hang on device 0 and succeed on device 1 and the second machine will succeed on device 0 and hang on device 1.

Any ideas why? is it cable related?

I’m using CUDA 2.3 and my linux system is: Linux qa-slave5 2.6.18-92.el5 #1 SMP Tue Apr 29 13:16:15 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux

Also, from time to time (after a crash or a reboot usually) the following files just dissapper and then deviceQuery returns only enumeration mode instead

of the 4 cards:

/dev/nvidiactl

   /dev/nvidia1

   /dev/nvidia2

if I manually create thos files like this:

mknod -m 0666 /dev/nvidiactl c 195 255

mknod -m 0666 /dev/nvidia0 c 195 0

mknod -m 0666 /dev/nvidia1 c 195 1

the system sees the cards again and its working as described above.

any assistance would be very appriciated.

EDIT:

furthermore if I run bandwidth test on the “faulty” device I get this:

-bash-3.2$ ./bandwidthTest --device=0

Running on......

	  device 0:Tesla C1060

Quick Mode

Host to Device Bandwidth for Pageable memory

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432			   2150.8

Quick Mode

Device to Host Bandwidth for Pageable memory

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432			   1268.2

Quick Mode

Device to Device Bandwidth

the Device to Device bandwidth just hangs…

the same on the “valid/working” device works just fine…

thanks

eyal