I’m having trouble bringing up our code on a new system with 4 C2070s and Cuda 3.2 and Driver 260.19.26 installed. I am suspicious that one (possibly two) of the GPUs may be DOA. For example, a test program that tries to cuCtxCreate() on each of the four devices
succeeds for devices 0,1, and 2, but returns CUDA_ERROR_ECC_UNCORRECTABLE for device 3.
Immediately after reboot, nvidia-smi can detect the four cards (I had previously disabled ECC on devices 0 and 1). However, running the command twice in a row shows that something is awry:
mjmurphy@alfa ~ $ nvidia-smi -a -r
ECC configuration for GPU 0:
Current: 0
After reboot: 0
ECC configuration for GPU 1:
Current: 0
After reboot: 0
ECC configuration for GPU 2:
Current: 1
After reboot: 1
ECC configuration for GPU 3:
Current: 1
After reboot: 1
mjmurphy@alfa ~ $ nvidia-smi -a -r
ECC configuration for GPU 0:
Current: 0
After reboot: 0
ECC configuration for GPU 1:
Current: 0
After reboot: 0
ECC is not supported by GPU 2
ECC is not supported by GPU 3
And nvidia-smi will not let me query or change the ECC config/status of GPUs 2 and 3.
Also potentially useful:
GPU 3:
Product Name : Tesla C2070
PCI Device/Vendor ID : 6d110de
PCI Location ID : 0:81:0
Board Serial : 6178608257
Display : Not connected
Temperature : 49 C
Fan Speed : 30%
Utilization
GPU : 0%
Memory : 0%
Volatile ECC errors :
Single bit :
FB : 1
RF : 0
L1 : 0
L2 : 0
Total : 1
Double bit :
FB : 3
RF : 0
L1 : 0
L2 : 0
Total : 3
None of the other GPUs show any ECC errors.
Does this mean that some hardware is broken? Or is there another solution?
Thanks
-Mark