Ok, I seem to have solved it but am I’m confused by my solution or rather the origin of why this happened.
I wanted to take a look into the ROM and used Kepler BIOS Tweaker by TechPowerUp (Kepler BIOS Tweaker (v1.27) Download | TechPowerUp). It showed my there that the checksum is incorrect. I thought that the vBIOS just got corrupted (which is a weird thing to happen while running the card).
With NVFlash, also by TechPower Up (NVIDIA NVFlash (5.735.0) Download | TechPowerUp), I save the ROM from another Workstation which had the same TITAN Black GPU and was working:
# ./nvflash_linux --save GK110_TITAN_Black_Working.rom
NVIDIA Firmware Update Utility (Version 5.414.0)
Simplified Version For OEM Only
IFR Data Size : 1284 bytes
IFR CRC32 : B7F5E08B
IFR Image Size : 1536 bytes
IFR Image CRC32 : 25838C8E
IFR Subsystem ID : 3842-3790
Image Size : 236544 bytes
Version : 80.80.4E.00.90
~CRC32 : 1DE27A5C
Image Hash : A80A196C59A8850E3154C89DC320A23C
OEM String : NVIDIA
Vendor Name : NVIDIA Corporation
Product Name : GK110B Board - 20830031
Product Revision : Chip Rev
Device Name(s) : GeForce GTX TITAN Black
Board ID : E618
PCI ID : 10DE-100C
Subsystem ID : 3842-3790
Hierarchy ID : Normal Board
Chip SKU : 430-0
Project : 2083-0031
CDP : N/A
Build Date : 02/07/14
Modification Date : 02/13/14
UEFI Support : Yes
UEFI Version : 0x1002A (Jan 20 2014 @ 17684658 )
UEFI Variant Id : 0x0000000000000004 ( GK1xx )
UEFI Signer(s) : Microsoft Corporation UEFI CA 2011
InfoROM Version : 2083.0031.00.03
InfoROM Backup Exist : NO
License Placeholder : Absent
GPU Mode : N/A
Sign-On Message : GK110B P2083 SKU 31 VGA BIOS
Then I saved the ROM from the defective GPU and then flashed it with that working ROM (I did this because my gurantee expired, so what do I have to lose anyway):
# ./nvflash_linux --save GK110_TITAN_Black_Corrupted.rom
NVIDIA Firmware Update Utility (Version 5.414.0)
Simplified Version For OEM Only
IFR Data Size : 1284 bytes
IFR CRC32 : B7F5E08B
IFR Image Size : 1536 bytes
IFR Image CRC32 : 25838C8E
IFR Subsystem ID : 3842-3790
Image Size : 236544 bytes
Version : 80.80.4E.00.90
~CRC32 : E708258A
Image Hash : A80A196C59A8850E3154C89DC320A23C
OEM String : NVIDIA
Vendor Name : NVIDIA Corporation
Product Name : GK110B Board - 20830031
Product Revision : Chip Rev
Device Name(s) : GeForce GTX TITAN Black
Board ID : E618
PCI ID : 10DE-100C
Subsystem ID : 3842-3790
Hierarchy ID : Normal Board
Chip SKU : 430-0
Project : 2083-0031
CDP : N/A
Build Date : 02/07/14
Modification Date : 02/13/14
UEFI Support : Yes
UEFI Version : 0x1002A (Jan 20 2014 @ 17684658 )
UEFI Variant Id : 0x0000000000000004 ( GK1xx )
UEFI Signer(s) : Microsoft Corporation UEFI CA 2011
InfoROM Version : 2083.0031.00.03
InfoROM Backup Exist : NO
License Placeholder : Absent
GPU Mode : N/A
Sign-On Message : GK110B P2083 SKU 31 VGA BIOS
# ./nvflash_linux GK110_TITAN_Black_Working.rom
And saved the vBIOS again from that GPU, to verify that it worked
# ./nvflash --save GK110_TITAN_Black_Test.rom
NVIDIA Firmware Update Utility (Version 5.414.0)
Simplified Version For OEM Only
IFR Data Size : 1284 bytes
IFR CRC32 : B7F5E08B
IFR Image Size : 1536 bytes
IFR Image CRC32 : 25838C8E
IFR Subsystem ID : 3842-3790
Image Size : 236544 bytes
Version : 80.80.4E.00.90
~CRC32 : 1BC7DEB8
Image Hash : A80A196C59A8850E3154C89DC320A23C
OEM String : NVIDIA
Vendor Name : NVIDIA Corporation
Product Name : GK110B Board - 20830031
Product Revision : Chip Rev
Device Name(s) : GeForce GTX TITAN Black
Board ID : E618
PCI ID : 10DE-100C
Subsystem ID : 3842-3790
Hierarchy ID : Normal Board
Chip SKU : 430-0
Project : 2083-0031
CDP : N/A
Build Date : 02/07/14
Modification Date : 02/13/14
UEFI Support : Yes
UEFI Version : 0x1002A (Jan 20 2014 @ 17684658 )
UEFI Variant Id : 0x0000000000000004 ( GK1xx )
UEFI Signer(s) : Microsoft Corporation UEFI CA 2011
InfoROM Version : 2083.0031.00.03
InfoROM Backup Exist : NO
License Placeholder : Absent
GPU Mode : N/A
Sign-On Message : GK110B P2083 SKU 31 VGA BIOS
After a reboot and Installtion of the current NVidia drivers (they had to be purged before using nvflash) the GPU reported no errors and was listed under nvidia-smi again. A stress test with gpu_burn (GitHub - Microway/gpu-burn: Microway's improved version of GPU Burn) worked without any errors for an hour and showed the same GFLOP/s as the working TITAN Black. I’m a bit sceptic here but I might have fixed it. I’ll run some more hands on tests and see if it worked. ‘nvidia-debugdump’ is working again (I can upload the results if desired) and ‘dmesg | grep -i NVRM’ throws:
$ dmesg | grep -i NVRM
[ 156.930465] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 396.26 Mon Apr 30 18:01:39 PDT 2018 (using threaded interrupts)
Something which is still not working correctly is:
$ cat /proc/driver/nvidia/gpus/0000\:01\:00.0/information
Model: GeForce GTX TITAN Black
IRQ: 50
GPU UUID: GPU-????????-????-????-????-????????????
Video BIOS: ??.??.??.??.??
Bus Type: PCIe
DMA Size: 40 bits
DMA Mask: 0xffffffffff
Bus Location: 0000:01:00.0
Device Minor: 0
I’m really confused.