infoROM is corrupted at gpu

Hi,
I own GeForce GTX TITAN X (bought directly from nvidia.com).
nvidia-smi is giving me a warning:

WARNING: infoROM is corrupted at gpu 0000:03:00.0

any suggestions how could i check what might be happening ?

Mon May  6 19:54:56 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39       Driver Version: 418.39       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TIT...  Off  | 00000000:03:00.0  On |                  N/A |
| 22%   45C    P8    19W / 250W |    808MiB / 12212MiB |      4%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1704      G   /usr/libexec/Xorg                             26MiB |
|    0      2053      G   /usr/bin/gnome-shell                          47MiB |
|    0      3578      G   /usr/libexec/Xorg                            279MiB |
|    0      7864      G   ...uest-channel-token=1###################    40MiB |
|    0     11693      G   ...uest-channel-token=17##################    44MiB |
|    0     24755      G   /opt/zoom/zoom                                32MiB |
|    0     25935    C+G   /opt/hfs17.0.416/bin/happrentice-bin         326MiB |
+-----------------------------------------------------------------------------+
WARNING: infoROM is corrupted at gpu 0000:03:00.0

The inforom is a non-volatile storage device on the GPU. It is used to store various data. There is no public specification for its contents.

Corrupted means the inforom did not pass some sort of sanity check (e.g. checksum). Therefore the GPU driver won’t use or trust its contents.

There is no publicly available utility to fix this. The card is damaged. Unless it is under warranty, there isn’t anything you can do to repair it. However, as you are aware, some aspects of the card functionality are still operational. There is no public specification for the behavior of the card with a corrupted inforom.

https://developer.download.nvidia.com/compute/DCGM/docs/nvidia-smi-367.38.pdf

Thanks for reply!

After some tests - this warning ONLY appears after linux hibernation and indeed after i wake up computer i am getting some corrupted UI elements on few applications.
BUT before hibernation, card works 100% correct and no warning is displayed prior hibernation.

Is there a way i could do card stress test and confirm that indeed it’s hardware corrupted or maybe just simply there is a bug that corrupts memory address during hibernation process ?

I don’t have anything to suggest. It sounds like a software defect if it completely disappears when you reboot the system.

Yeah it does sound like that - therefore it might be just an issue with cuda/nvidia driver itself. In such case, where should i submit bug report ?

I had this issue on CentOS, Fedora 26, 27, 28, 29 and multiple different Nvidia/Cuda drivers (Gui corrupted after hibernation - which might be related to that mentioned warning)

The instructions for filing a bug report are linked to a sticky post at the top of the CUDA programming forum.

Configuration Setup - CentOS Linux release 7.6.1810 (Core) on system Precision T7610  +   Driver 418.39 +   NVIDIA Corporation GP102 [TITAN Xp]

There is no warning message observed in nvidia-smi output.

Steps Taken to Attempt for repro -

Open vscode
Hibernate System
Powered on Back
Ran nvidia-smi and found no warning message

[root@dhcp-10-24-141-60 ~]# nvidia-smi
Thu May  9 01:32:33 2019
±----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39       Driver Version: 418.39       CUDA Version: 10.1     |
|-------------------------------±---------------------±---------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN Xp            Off  | 00000000:03:00.0  On |                  N/A |
| 23%   33C    P8    11W / 250W |    216MiB / 12192MiB |      0%      Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     24537      G   /usr/bin/X                                   107MiB |
|    0     24761      G   /usr/bin/gnome-shell                          74MiB |
|    0     25492      G   …-token=8175359F555DE6C90C3E6E049C993347    28MiB |
|    0     25621      G   gnome-control-center                           3MiB |
±----------------------------------------------------------------------------+
[root@dhcp-10-24-141-60 ~]#

Request you to provide nvidia bug report(which should be generated once you hit with issue) and detailed steps to repro issue locally.

Hi @amrits ! Thank you for checking this out!
In my case it’s 100% reproducible and it does disappear after reboot.

But I have to apologize as it is ‘sleep’ (suspend?) not hibernation.

Steps to reproduce:

  • open VSCode and/or SideFX Houdini software - no ui issues
  • run nvidia-smi - no errors
  • set ‘put computer to sleep after 10mins’ in settings (MATE, screenshot attached)
  • wait over 10mins
  • ‘wake up computer’
  • run nvidia-smi again - warning appears

Nvidia Bug report shared in a ticket BUG ID: 2592193

Before ‘sleep’

After ‘sleep’

Hi, I have two questions:

I have a bug reported under
developer.nvidia.com/nvidia_bug/2592193 (which I have access to)

but I am getting emails refering me to
partners.nvidia.com/bug/viewbug/2592193 (which I do NOT have an access to)

Also both of those sites have different ‘status’ not matching between each other.

Second question is that I got email (referring to partners.nvidia.com, where i do not have access to). with changed status (on that site) “will not fix” without an explanation, so maybe any developer here could help and give some insight of why it won’t be fixed ?

The will not fix indication refers to a specific driver branch (R415).

Our internal testing indicates that this issue is fixed in 418.76 and later. I suggest you move forward and retest with a newer/later driver in the R418 branch, 418.76 or later.

Thanks Robert for explanation!

Quick question - as I was using CUDA drivers and latest (fixed drivers are not “CUDA”) - can i mix them both or do I need for CUDA update which currently is 418.67 ?

You can select drivers from:

http://www.nvidia.com/drivers

for use with CUDA.

We also have this issue, and upgraded the driver to 440.64.00, but the inforom corrupted warning is still there…

Does that mean the only solution is RMA the card?

We are encountering this same issue our environment on Tesla GPU :

nvidia-smi
Wed May 13 22:00:28 2020
±----------------------------------------------------------------------------+
| NVIDIA-SMI 450.24 Driver Version: 450.24 CUDA Version: 11.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2… On | 00000000:06:00.0 Off | 0 |
| N/A 35C P0 42W / 300W | 0MiB / 16160MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 Tesla V100-SXM2… On | 00000000:07:00.0 Off | 0 |
| N/A 37C P0 44W / 300W | 0MiB / 16160MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 2 Tesla V100-SXM2… On | 00000000:0A:00.0 Off | 0 |
| N/A 37C P0 45W / 300W | 0MiB / 16160MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 3 Tesla V100-SXM2… On | 00000000:0B:00.0 Off | 0 |
| N/A 35C P0 44W / 300W | 0MiB / 16160MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 4 Tesla V100-SXM2… On | 00000000:85:00.0 Off | 0 |
| N/A 37C P0 43W / 300W | 0MiB / 16160MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 5 Tesla V100-SXM2… On | 00000000:86:00.0 Off | 0 |
| N/A 37C P0 43W / 300W | 0MiB / 16160MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 6 Tesla V100-SXM2… On | 00000000:89:00.0 Off | 0 |
| N/A 38C P0 42W / 300W | 0MiB / 16160MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 7 Tesla V100-SXM2… On | 00000000:8A:00.0 Off | 0 |
| N/A 35C P0 42W / 300W | 0MiB / 16160MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+
WARNING: infoROM is corrupted at gpu 0000:0B:00.0

Do we have any solution for this error ?
We are already using latest driver which mentioned on this post, but still issue remains.

I have the same error,it reutrn a non-zero code, so some of my script is corrupted

I counter the same problem on windows (local) when interrupting my deep learning on jupyter lab. since the windows system cannot obtain the permission like Linux -sudo, likely? same that my Linux don’t have this problem. to solve this for mine is also ez on windows, just reboot the system and rerun the algorithm. my gpu is 3090, 350w, 24g.