ECC Errors with quad Fermi C2070

mmurphy · February 11, 2011, 11:40pm

I’m having trouble bringing up our code on a new system with 4 C2070s and Cuda 3.2 and Driver 260.19.26 installed. I am suspicious that one (possibly two) of the GPUs may be DOA. For example, a test program that tries to cuCtxCreate() on each of the four devices

succeeds for devices 0,1, and 2, but returns CUDA_ERROR_ECC_UNCORRECTABLE for device 3.

Immediately after reboot, nvidia-smi can detect the four cards (I had previously disabled ECC on devices 0 and 1). However, running the command twice in a row shows that something is awry:

mjmurphy@alfa ~ $ nvidia-smi -a -r

ECC configuration for GPU 0:

        Current: 0

        After reboot: 0

ECC configuration for GPU 1:

        Current: 0

        After reboot: 0

ECC configuration for GPU 2:

        Current: 1

        After reboot: 1

ECC configuration for GPU 3:

        Current: 1

        After reboot: 1

mjmurphy@alfa ~ $ nvidia-smi -a -r

ECC configuration for GPU 0:

        Current: 0

        After reboot: 0

ECC configuration for GPU 1:

        Current: 0

        After reboot: 0

ECC is not supported by GPU 2

ECC is not supported by GPU 3

And nvidia-smi will not let me query or change the ECC config/status of GPUs 2 and 3.

Also potentially useful:

GPU 3:

        Product Name            : Tesla C2070

        PCI Device/Vendor ID    : 6d110de

        PCI Location ID         : 0:81:0

        Board Serial            : 6178608257

        Display                 : Not connected

        Temperature             : 49 C

        Fan Speed               : 30%

        Utilization

            GPU                 : 0%

            Memory              : 0%

        Volatile ECC errors     :

          Single bit            :

            FB                  : 1

            RF                  : 0

            L1                  : 0

            L2                  : 0

            Total               : 1

          Double bit            :

            FB                  : 3

            RF                  : 0

            L1                  : 0

            L2                  : 0

            Total               : 3

None of the other GPUs show any ECC errors.

Does this mean that some hardware is broken? Or is there another solution?

Thanks

-Mark

vishva · March 24, 2011, 3:53pm

Hi Mark,
We are having the same issue, have you already fixed your problem?, One of gpu-c2070 give the same Ecc error!!!
How can I get rid of this…
thanks
vishva

SPWorley · March 24, 2011, 4:03pm

An easy first step to diagnose is to swap the order of the cards. If the same physical card keeps failing (even when swapped slots with another card) then it’s likely to be a hardware issue.

It may indeed be, since ECC is designed to find such errors in the RAM. If so, then you have an easy solution, you can RMA it under warranty,