`nvidia-smi -q` shows several "Unknown Error"; GPU ignored by pytorch

Our RTX 4090 will go into a state of error after a few hours in the middle of pytorch processing on a headless Ubuntu 22LTS. The job is not killed and occupies memory on the GPU and creates load on the CPU but no load on the GPU. The error state will only resolve by rebooting the system.

nvidia-smi shows ERR! for FAN:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                    0 |
|ERR!   38C    P5    49W / 450W |   2021MiB / 23028MiB |      0%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    101360      C   python                           2018MiB |
+-----------------------------------------------------------------------------+

nvidia-smi -q shows “Unknown Error” for Fan Speed, GPU T.Limit Temp, Clocks Throttle Reasons and all Clocks. See dump below.

I have tried the following (with no success):

  • Enable persistance mode
  • Enable ECC
  • Reduce GTT to 65C
  • Reduce Power Limit to 300W
  • Remove all nvidia packages (which were version 535) and install version 525

Bug report is here:
nvidia-bug-report.log.gz (444.0 KB)

Any help is much appreciated!

==============NVSMI LOG==============

Timestamp                                 : Mon Aug 21 16:03:23 2023
Driver Version                            : 525.125.06
CUDA Version                              : 12.0

Attached GPUs                             : 1
GPU 00000000:01:00.0
    Product Name                          : NVIDIA GeForce RTX 4090
    Product Brand                         : GeForce
    Product Architecture                  : Ada Lovelace
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : N/A
    GPU UUID                              : GPU-afbdf6da-51e0-f003-e8c7-bb0361c8984e
    Minor Number                          : 0
    VBIOS Version                         : 95.02.18.80.B1
    MultiGPU Board                        : No
    Board ID                              : 0x100
    Board Part Number                     : N/A
    GPU Part Number                       : 2684-300-A1
    Module ID                             : 1
    Inforom Version
        Image Version                     : G002.0000.00.03
        OEM Object                        : 2.0
        ECC Object                        : 6.16
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x01
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x268410DE
        Bus Id                            : 00000000:01:00.0
        Sub System Id                     : 0x367519DA
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 1
                Device Current            : 1
                Device Max                : 4
                Host Max                  : 5
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 68000 KB/s
        Rx Throughput                     : 615000 KB/s
        Atomic Caps Inbound               : N/A
        Atomic Caps Outbound              : N/A
    Fan Speed                             : Unknown Error
    Performance State                     : P5
    Clocks Throttle Reasons               : Unknown Error
    FB Memory Usage
        Total                             : 23028 MiB
        Reserved                          : 337 MiB
        Used                              : 2021 MiB
        Free                              : 20668 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 5 MiB
        Free                              : 251 MiB
    Compute Mode                          : Exclusive_Process
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
        Volatile
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
        Aggregate
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows
        Correctable Error                 : 0
        Uncorrectable Error               : 0
        Pending                           : No
        Remapping Failure Occurred        : No
        Bank Remap Availability Histogram
            Max                           : 192 bank(s)
            High                          : 0 bank(s)
            Partial                       : 0 bank(s)
            Low                           : 0 bank(s)
            None                          : 0 bank(s)
    Temperature
        GPU Current Temp                  : 38 C
        GPU T.Limit Temp                  : Unknown Error
        GPU Shutdown T.Limit Temp         : N/A
        GPU Slowdown T.Limit Temp         : N/A
        GPU Max Operating T.Limit Temp    : 0 C
        GPU Target Temperature            : 65 C
        Memory Current Temp               : N/A
        Memory Max Operating T.Limit Temp : N/A
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 49.86 W
        Power Limit                       : 450.00 W
        Default Power Limit               : 450.00 W
        Enforced Power Limit              : 450.00 W
        Min Power Limit                   : 150.00 W
        Max Power Limit                   : 495.00 W
    Clocks
        Graphics                          : Unknown Error
        SM                                : Unknown Error
        Memory                            : Unknown Error
        Video                             : Unknown Error
    Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Default Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 3120 MHz
        SM                                : 3120 MHz
        Memory                            : 10501 MHz
        Video                             : 2415 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : N/A
    Fabric
        State                             : N/A
        Status                            : N/A
    Processes
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 101360
            Type                          : C
            Name                          : python
            Used GPU Memory               : 2018 MiB

I have an smillar issue with this unknown error and N/A for some power readings,I have a T600 gpu and I use 535.98 for nvidea driver.

| NVIDIA-SMI 535.98                 Driver Version: 535.98       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA T600 Laptop GPU         Off | 00000000:01:00.0 Off |                  N/A |
| N/A   61C    P0              N/A / ERR! |      5MiB /  4096MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1349      G   /usr/lib/xorg/Xorg                            4MiB |
+---------------------------------------------------------------------------------------+

sudo nvidia-smi -q -d POWER

==============NVSMI LOG==============

Timestamp                                 : Fri Aug 25 13:30:49 2023
Driver Version                            : 535.98
CUDA Version                              : 12.2

Attached GPUs                             : 1
GPU 00000000:01:00.0
    GPU Power Readings
        Power Draw                        : N/A
        Current Power Limit               : Unknown Error
        Requested Power Limit             : Unknown Error
        Default Power Limit               : 5001.00 W
        Min Power Limit                   : 0.00 W
        Max Power Limit                   : 5001.00 W
    Power Samples
        Duration                          : Not Found
        Number of Samples                 : Not Found
        Max                               : Not Found
        Min                               : Not Found
        Avg                               : Not Found
    Module Power Readings
        Power Draw                        : N/A
        Current Power Limit               : N/A
        Requested Power Limit             : N/A
        Default Power Limit               : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A


For basic users, it is not an important error but for me, I need energy readings for some applications but with this issue, I can do nothing right now.

As I went through every component of my setup I was finally able to resolve the issue – unfortunately I cannot pinpoint the cause exactly, but:

I re-installed Ubuntu 22LTS, now everything has been working just fine for a several days!
Now the drivers are freshly installed, with fresh configurations. What I did do differently: I chose nvidia-535-server instead of nvidia-535 (I wasnt aware of the *-server Packages until the re-installation). Everything else is the same – just fresh.

The issue is back, unfortunately. same as described above.

Is it possible that pluging-in a Monitor into the GPU can cause this kind of issue?

I have now apt purgeed all nvidia packages and installed nvidia-headless-535-server and nvidia-utils-535-server