NVRM: Failed to copy vbios to system memory

Dear all,

I bought myself some years ago a TITAN Black for my workstation, running RHEL 7, in order to work with CUDA. In this workstation I already had another card installed (970 GTX). Two weeks ago I noticed that the card is working very slow. Then suddenly I couldn’t work with it anymore. nvidia-smi is not showing the GPU anymore (the other one appears). lspci shows the TITAN Black although. After some research the most similar topic I found was this here https://devtalk.nvidia.com/default/topic/1021130/ubuntu-16-04-2-gtx1080-ti-nvidia-smi-failed-to-detect-all-gpus/#5198793. I ran nvidia-bug-report.sh and give you the intersting lines. The whole thing can be found here: https://github.com/fwillo/GK110_Files/blob/master/nvidia-bug-report.log

‘dmesg | grep -i NVRM’ gives me the lines:

[    3.993472] NVRM: failed to copy vbios to system memory.
[    3.993696] NVRM: RmInitAdapter failed! (0x30:0xffff:800)
[    3.993701] NVRM: rm_init_adapter failed for device bearing minor number 0
[    3.993717] NVRM: nvidia_frontend_open: minor 0, module->open() failed, error -5

cat /proc/driver/nvidia/gpus/*/information gave me for this GPU:

/proc/driver/nvidia/./gpus/0000:01:00.0/information
Model: 		 GeForce GTX TITAN Black
IRQ:   		 50
GPU UUID: 	 GPU-????????-????-????-????-????????????
Video BIOS: 	 ??.??.??.??.??
Bus Type: 	 PCIe
DMA Size: 	 37 bits
DMA Mask: 	 0x1fffffffff
Bus Location: 	 0000:01:00.0
Device Minor: 	 0

/usr/bin/nvidia-debugdump -D

ERROR: GetCaptureBuffer failed, Not Supported, bufSize: 0x20
ERROR: internal_getDumpBuffer failed, return code: 0x3
ERROR: internal_dumpSystemComponent() failed, return code: 0x3
UEsDBBQAAAAIAAAAAAByIlbJxQAAAMAAAAAOAAEAc3lzdGVtX2luZm8ucGIBAcAAP/+3WQ3sf41H
Q71PlhXsCUzTzqztN1TbRC7nMCkkiCuaqzEoR9W7nk+IGzFvtqUBcmJp20rGtsF2aV6JkQSZLSy9
iWurED1AbI9EMl1rjNlt7IacRY6Y2pfw8uGAlRvQK0G9YH9+6OoPfshRlRKW2Ffa5YHVMnf43h5f
WIj2lolpLWF7nLQkFerAGk6xOXqsA2ssLbra4FmWnDS1WNYJPHIkLi81E5cFX7C8iQChreMvQOXo
g6EOog4TAJ4Uhe1ADMNQSwECAAAUAAAACAAAAAAAciJWycUAAADAAAAADgABAAAAAAAAAAAAAAAA
AAAAc3lzdGVtX2luZm8ucGIBUEsFBgAAAAABAAEAPQAAAPIAAAAWAENyZWF0ZWQgYnkgTnZEZWJ1
Z0R1bXA=

I removed the GPU and installed it in two other workstations with the same behaviour. Please note here that I always used the most recent driver. Interestingly, when connecting a monitor to one of the ports (didn’t matter which one), the GPU gave me output like BIOS startup, booting of the OS. It crashed at X.Org though, because it didn’t find a proper screen device, which makes sense. The output wasn’t distorted though.

One of the workstations had Windows, where also a defective behaviour was observable. Windows said that the device gave back an error message and it was thus deactivated (Code 43). In the hardware details under “error code” I get 0000002B. GPU-Z fails to read values (screenshot here https://github.com/fwillo/GK110_Files/blob/master/GPUZ.gif), although it was possible to extract the BIOS with GPU-Z (uploaded here https://github.com/fwillo/GK110_Files/blob/master/GK110_TITAN_Black_Corrupted.rom).

I’m running out of ideas at the moment what to do and seek for some advice, which I might overlook accidentally.

Looking forward to your answers!

Best wishes,
fwillo

sounds like the card is broken

Ok, I seem to have solved it but am I’m confused by my solution or rather the origin of why this happened.

I wanted to take a look into the ROM and used Kepler BIOS Tweaker by TechPowerUp (https://www.techpowerup.com/download/kepler-bios-tweaker/). It showed my there that the checksum is incorrect. I thought that the vBIOS just got corrupted (which is a weird thing to happen while running the card).

With NVFlash, also by TechPower Up (https://www.techpowerup.com/download/nvidia-nvflash/), I save the ROM from another Workstation which had the same TITAN Black GPU and was working:

# ./nvflash_linux --save GK110_TITAN_Black_Working.rom
NVIDIA Firmware Update Utility (Version 5.414.0)
Simplified Version For OEM Only
IFR Data Size         : 1284 bytes
IFR CRC32             : B7F5E08B
IFR Image Size        : 1536 bytes
IFR Image CRC32       : 25838C8E
IFR Subsystem ID      : 3842-3790
Image Size            : 236544 bytes
Version               : 80.80.4E.00.90
~CRC32                : 1DE27A5C
Image Hash            : A80A196C59A8850E3154C89DC320A23C
OEM String            : NVIDIA
Vendor Name           : NVIDIA Corporation
Product Name          : GK110B Board - 20830031
Product Revision      : Chip Rev
Device Name(s)        : GeForce GTX TITAN Black
Board ID              : E618
PCI ID                : 10DE-100C
Subsystem ID          : 3842-3790
Hierarchy ID          : Normal Board
Chip SKU              : 430-0
Project               : 2083-0031
CDP                   : N/A
Build Date            : 02/07/14
Modification Date     : 02/13/14
UEFI Support          : Yes
UEFI Version          : 0x1002A (Jan 20 2014 @ 17684658 )
UEFI Variant Id       : 0x0000000000000004 ( GK1xx )
UEFI Signer(s)        : Microsoft Corporation UEFI CA 2011
InfoROM Version       : 2083.0031.00.03
InfoROM Backup Exist  : NO
License Placeholder   : Absent
GPU Mode              : N/A
Sign-On Message       : GK110B P2083 SKU 31 VGA BIOS

Then I saved the ROM from the defective GPU and then flashed it with that working ROM (I did this because my gurantee expired, so what do I have to lose anyway):

# ./nvflash_linux --save GK110_TITAN_Black_Corrupted.rom 
NVIDIA Firmware Update Utility (Version 5.414.0)
Simplified Version For OEM Only
IFR Data Size         : 1284 bytes
IFR CRC32             : B7F5E08B
IFR Image Size        : 1536 bytes
IFR Image CRC32       : 25838C8E
IFR Subsystem ID      : 3842-3790
Image Size            : 236544 bytes
Version               : 80.80.4E.00.90
~CRC32                : E708258A
Image Hash            : A80A196C59A8850E3154C89DC320A23C
OEM String            : NVIDIA
Vendor Name           : NVIDIA Corporation
Product Name          : GK110B Board - 20830031
Product Revision      : Chip Rev
Device Name(s)        : GeForce GTX TITAN Black
Board ID              : E618
PCI ID                : 10DE-100C
Subsystem ID          : 3842-3790
Hierarchy ID          : Normal Board
Chip SKU              : 430-0
Project               : 2083-0031
CDP                   : N/A
Build Date            : 02/07/14
Modification Date     : 02/13/14
UEFI Support          : Yes
UEFI Version          : 0x1002A (Jan 20 2014 @ 17684658 )
UEFI Variant Id       : 0x0000000000000004 ( GK1xx )
UEFI Signer(s)        : Microsoft Corporation UEFI CA 2011
InfoROM Version       : 2083.0031.00.03
InfoROM Backup Exist  : NO
License Placeholder   : Absent
GPU Mode              : N/A
Sign-On Message       : GK110B P2083 SKU 31 VGA BIOS

# ./nvflash_linux GK110_TITAN_Black_Working.rom

And saved the vBIOS again from that GPU, to verify that it worked

# ./nvflash --save GK110_TITAN_Black_Test.rom
NVIDIA Firmware Update Utility (Version 5.414.0)
Simplified Version For OEM Only
IFR Data Size         : 1284 bytes
IFR CRC32             : B7F5E08B
IFR Image Size        : 1536 bytes
IFR Image CRC32       : 25838C8E
IFR Subsystem ID      : 3842-3790
Image Size            : 236544 bytes
Version               : 80.80.4E.00.90
~CRC32                : 1BC7DEB8
Image Hash            : A80A196C59A8850E3154C89DC320A23C
OEM String            : NVIDIA
Vendor Name           : NVIDIA Corporation
Product Name          : GK110B Board - 20830031
Product Revision      : Chip Rev
Device Name(s)        : GeForce GTX TITAN Black
Board ID              : E618
PCI ID                : 10DE-100C
Subsystem ID          : 3842-3790
Hierarchy ID          : Normal Board
Chip SKU              : 430-0
Project               : 2083-0031
CDP                   : N/A
Build Date            : 02/07/14
Modification Date     : 02/13/14
UEFI Support          : Yes
UEFI Version          : 0x1002A (Jan 20 2014 @ 17684658 )
UEFI Variant Id       : 0x0000000000000004 ( GK1xx )
UEFI Signer(s)        : Microsoft Corporation UEFI CA 2011
InfoROM Version       : 2083.0031.00.03
InfoROM Backup Exist  : NO
License Placeholder   : Absent
GPU Mode              : N/A
Sign-On Message       : GK110B P2083 SKU 31 VGA BIOS

After a reboot and Installtion of the current NVidia drivers (they had to be purged before using nvflash) the GPU reported no errors and was listed under nvidia-smi again. A stress test with gpu_burn (https://github.com/Microway/gpu-burn) worked without any errors for an hour and showed the same GFLOP/s as the working TITAN Black. I’m a bit sceptic here but I might have fixed it. I’ll run some more hands on tests and see if it worked. ‘nvidia-debugdump’ is working again (I can upload the results if desired) and ‘dmesg | grep -i NVRM’ throws:

$ dmesg | grep -i NVRM
[  156.930465] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  396.26  Mon Apr 30 18:01:39 PDT 2018 (using threaded interrupts)

Something which is still not working correctly is:

$ cat /proc/driver/nvidia/gpus/0000\:01\:00.0/information
Model:           GeForce GTX TITAN Black
IRQ:             50
GPU UUID:        GPU-????????-????-????-????-????????????
Video BIOS:      ??.??.??.??.??
Bus Type:        PCIe
DMA Size:        40 bits
DMA Mask:        0xffffffffff
Bus Location:    0000:01:00.0
Device Minor:    0

I’m really confused.

How come my posts are hidden?

Little update: Installed the GPU back into my server. Now ‘cat /proc/driver/nvidia/gpus/0000:01:00.0/information’ shows the UUID and BIOS Version correctly. Seem to solved my problem. It is weird though that the vBIOS is corrupting itself while running and while the kernel modules for the GPU were loaded.

I have same problem and tried everything but didn’t work anything here is nvidia-bug-report is anyone can help me please help me…file:///root/nvidia-bug-report.log.gz
nvidia-bug-report.log.gz (595 KB)