Our RTX 4090 will go into a state of error after a few hours in the middle of pytorch processing on a headless Ubuntu 22LTS. The job is not killed and occupies memory on the GPU and creates load on the CPU but no load on the GPU. The error state will only resolve by rebooting the system.
nvidia-smi shows ERR! for FAN:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | 0 |
|ERR! 38C P5 49W / 450W | 2021MiB / 23028MiB | 0% E. Process |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 101360 C python 2018MiB |
+-----------------------------------------------------------------------------+
nvidia-smi -q
shows “Unknown Error” for Fan Speed, GPU T.Limit Temp, Clocks Throttle Reasons and all Clocks. See dump below.
I have tried the following (with no success):
- Enable persistance mode
- Enable ECC
- Reduce GTT to 65C
- Reduce Power Limit to 300W
- Remove all nvidia packages (which were version 535) and install version 525
Bug report is here:
nvidia-bug-report.log.gz (444.0 KB)
Any help is much appreciated!
==============NVSMI LOG==============
Timestamp : Mon Aug 21 16:03:23 2023
Driver Version : 525.125.06
CUDA Version : 12.0
Attached GPUs : 1
GPU 00000000:01:00.0
Product Name : NVIDIA GeForce RTX 4090
Product Brand : GeForce
Product Architecture : Ada Lovelace
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Disabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-afbdf6da-51e0-f003-e8c7-bb0361c8984e
Minor Number : 0
VBIOS Version : 95.02.18.80.B1
MultiGPU Board : No
Board ID : 0x100
Board Part Number : N/A
GPU Part Number : 2684-300-A1
Module ID : 1
Inforom Version
Image Version : G002.0000.00.03
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x01
Device : 0x00
Domain : 0x0000
Device Id : 0x268410DE
Bus Id : 00000000:01:00.0
Sub System Id : 0x367519DA
GPU Link Info
PCIe Generation
Max : 4
Current : 1
Device Current : 1
Device Max : 4
Host Max : 5
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 68000 KB/s
Rx Throughput : 615000 KB/s
Atomic Caps Inbound : N/A
Atomic Caps Outbound : N/A
Fan Speed : Unknown Error
Performance State : P5
Clocks Throttle Reasons : Unknown Error
FB Memory Usage
Total : 23028 MiB
Reserved : 337 MiB
Used : 2021 MiB
Free : 20668 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 5 MiB
Free : 251 MiB
Compute Mode : Exclusive_Process
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 192 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 38 C
GPU T.Limit Temp : Unknown Error
GPU Shutdown T.Limit Temp : N/A
GPU Slowdown T.Limit Temp : N/A
GPU Max Operating T.Limit Temp : 0 C
GPU Target Temperature : 65 C
Memory Current Temp : N/A
Memory Max Operating T.Limit Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 49.86 W
Power Limit : 450.00 W
Default Power Limit : 450.00 W
Enforced Power Limit : 450.00 W
Min Power Limit : 150.00 W
Max Power Limit : 495.00 W
Clocks
Graphics : Unknown Error
SM : Unknown Error
Memory : Unknown Error
Video : Unknown Error
Applications Clocks
Graphics : N/A
Memory : N/A
Default Applications Clocks
Graphics : N/A
Memory : N/A
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 3120 MHz
SM : 3120 MHz
Memory : 10501 MHz
Video : 2415 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : N/A
Fabric
State : N/A
Status : N/A
Processes
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 101360
Type : C
Name : python
Used GPU Memory : 2018 MiB