Dear all,
I bought myself some years ago a TITAN Black for my workstation, running RHEL 7, in order to work with CUDA. In this workstation I already had another card installed (970 GTX). Two weeks ago I noticed that the card is working very slow. Then suddenly I couldn’t work with it anymore. nvidia-smi is not showing the GPU anymore (the other one appears). lspci shows the TITAN Black although. After some research the most similar topic I found was this here https://devtalk.nvidia.com/default/topic/1021130/ubuntu-16-04-2-gtx1080-ti-nvidia-smi-failed-to-detect-all-gpus/#5198793. I ran nvidia-bug-report.sh and give you the intersting lines. The whole thing can be found here: https://github.com/fwillo/GK110_Files/blob/master/nvidia-bug-report.log
‘dmesg | grep -i NVRM’ gives me the lines:
[ 3.993472] NVRM: failed to copy vbios to system memory.
[ 3.993696] NVRM: RmInitAdapter failed! (0x30:0xffff:800)
[ 3.993701] NVRM: rm_init_adapter failed for device bearing minor number 0
[ 3.993717] NVRM: nvidia_frontend_open: minor 0, module->open() failed, error -5
cat /proc/driver/nvidia/gpus/*/information gave me for this GPU:
/proc/driver/nvidia/./gpus/0000:01:00.0/information
Model: GeForce GTX TITAN Black
IRQ: 50
GPU UUID: GPU-????????-????-????-????-????????????
Video BIOS: ??.??.??.??.??
Bus Type: PCIe
DMA Size: 37 bits
DMA Mask: 0x1fffffffff
Bus Location: 0000:01:00.0
Device Minor: 0
/usr/bin/nvidia-debugdump -D
ERROR: GetCaptureBuffer failed, Not Supported, bufSize: 0x20
ERROR: internal_getDumpBuffer failed, return code: 0x3
ERROR: internal_dumpSystemComponent() failed, return code: 0x3
UEsDBBQAAAAIAAAAAAByIlbJxQAAAMAAAAAOAAEAc3lzdGVtX2luZm8ucGIBAcAAP/+3WQ3sf41H
Q71PlhXsCUzTzqztN1TbRC7nMCkkiCuaqzEoR9W7nk+IGzFvtqUBcmJp20rGtsF2aV6JkQSZLSy9
iWurED1AbI9EMl1rjNlt7IacRY6Y2pfw8uGAlRvQK0G9YH9+6OoPfshRlRKW2Ffa5YHVMnf43h5f
WIj2lolpLWF7nLQkFerAGk6xOXqsA2ssLbra4FmWnDS1WNYJPHIkLi81E5cFX7C8iQChreMvQOXo
g6EOog4TAJ4Uhe1ADMNQSwECAAAUAAAACAAAAAAAciJWycUAAADAAAAADgABAAAAAAAAAAAAAAAA
AAAAc3lzdGVtX2luZm8ucGIBUEsFBgAAAAABAAEAPQAAAPIAAAAWAENyZWF0ZWQgYnkgTnZEZWJ1
Z0R1bXA=
I removed the GPU and installed it in two other workstations with the same behaviour. Please note here that I always used the most recent driver. Interestingly, when connecting a monitor to one of the ports (didn’t matter which one), the GPU gave me output like BIOS startup, booting of the OS. It crashed at X.Org though, because it didn’t find a proper screen device, which makes sense. The output wasn’t distorted though.
One of the workstations had Windows, where also a defective behaviour was observable. Windows said that the device gave back an error message and it was thus deactivated (Code 43). In the hardware details under “error code” I get 0000002B. GPU-Z fails to read values (screenshot here https://github.com/fwillo/GK110_Files/blob/master/GPUZ.gif), although it was possible to extract the BIOS with GPU-Z (uploaded here https://github.com/fwillo/GK110_Files/blob/master/GK110_TITAN_Black_Corrupted.rom).
I’m running out of ideas at the moment what to do and seek for some advice, which I might overlook accidentally.
Looking forward to your answers!
Best wishes,
fwillo