It seems caused by turned on Persistence-Mode since I have never turned on before.
kernel 3.10.0-1160.83.1.el7.x86_64
±----------------------------------------------------------------------------+ | NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 | |-------------------------------±---------------------±---------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA RTX A6000 On | 00000000:4F:00.0 Off | Off | | 30% 29C P8 22W / 300W | 3MiB / 49140MiB | 0% Default | | | | N/A | ±------------------------------±---------------------±---------------------+ | 1 NVIDIA RTX A6000 On | 00000000:52:00.0 Off | Off | | 30% 30C P8 28W / 300W | 3MiB / 49140MiB | 0% Default | | | | N/A | ±------------------------------±---------------------±---------------------+ | 2 NVIDIA RTX A6000 On | 00000000:56:00.0 Off | Off | | 30% 30C P8 21W / 300W | 3MiB / 49140MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 3 NVIDIA RTX A6000 On | 00000000:57:00.0 Off | Off |
| 30% 30C P8 18W / 300W | 3MiB / 49140MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 4 NVIDIA RTX A6000 On | 00000000:CE:00.0 Off | Off |
|ERR! 31C P8 6W / 300W | 3MiB / 49140MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 5 NVIDIA RTX A6000 On | 00000000:D1:00.0 Off | Off |
|ERR! 32C P8 9W / 300W | 3MiB / 49140MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 6 NVIDIA RTX A6000 On | 00000000:D5:00.0 Off | Off |
|ERR! 34C P8 5W / 300W | 13MiB / 49140MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 7 NVIDIA RTX A6000 On | 00000000:D6:00.0 Off | Off |
| 30% 32C P8 16W / 300W | 3MiB / 49140MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 6 N/A N/A 10384 C python 10MiB |
±----------------------------------------------------------------------------+
==============NVSMI LOG==============
Timestamp : Thu Jul 13 10:41:44 2023
Driver Version : 510.47.03
CUDA Version : 11.6
Attached GPUs : 8
GPU 00000000:CE:00.0
Product Name : NVIDIA RTX A6000
Product Brand : NVIDIA RTX
Product Architecture : Ampere
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Enabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
GPU Part Number : 900-5G133-2200-000
Module ID : 0
Inforom Version
Image Version : G133.0500.00.05
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0xCE
Device : 0x00
Domain : 0x0000
Device Id : 0x223010DE
Bus Id : 00000000:CE:00.0
Sub System Id : 0x145910DE
GPU Link Info
PCIe Generation
Max : 4
Current : 1
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : Unknown Error
Performance State : P8
Clocks Throttle Reasons : Unknown Error
FB Memory Usage
Total : 49140 MiB
Reserved : 454 MiB
Used : 3 MiB
Free : 48681 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 3 MiB
Free : 253 MiB
Compute Mode : Default
Ecc Mode
Current : Disabled
Pending : Disabled
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 192 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 31 C
GPU Shutdown Temp : N/A
GPU Slowdown Temp : N/A
GPU Max Operating Temp : 93 C
GPU Target Temperature : 84 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 6.53 W
Power Limit : 300.00 W
Default Power Limit : 300.00 W
Enforced Power Limit : 300.00 W
Min Power Limit : 100.00 W
Max Power Limit : 300.00 W
Clocks
Graphics : Unknown Error
SM : Unknown Error
Memory : Unknown Error
Video : Unknown Error
Applications Clocks
Graphics : 1800 MHz
Memory : 8001 MHz
Default Applications Clocks
Graphics : 1800 MHz
Memory : 8001 MHz
Max Clocks
Graphics : 2100 MHz
SM : 2100 MHz
Memory : 8001 MHz
Video : 1950 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : N/A
Processes : None
log/messages
[Wed Jul 12 19:11:48 2023] NVRM: GPU at PCI:0000:d5:00: GPU-121a7f29-f0f6-3422-6fd9-2ff5f040ce6c
[Wed Jul 12 19:11:48 2023] NVRM: Xid (PCI:0000:d5:00): 62, pid=9662, 0000(0000) 00000000 00000000
[Wed Jul 12 23:33:35 2023] NVRM: GPU at PCI:0000:ce:00: GPU-6ffacca6-d934-7e0d-d55d-c24454c1a0b4
[Wed Jul 12 23:33:35 2023] NVRM: Xid (PCI:0000:ce:00): 62, pid=9656, 0000(0000) 00000000 00000000
[Thu Jul 13 00:34:11 2023] NVRM: GPU at PCI:0000:d1:00: GPU-305cbb31-1ca5-f7e1-f9ea-8c163d624a5a
[Thu Jul 13 00:34:11 2023] NVRM: Xid (PCI:0000:d1:00): 62, pid=9659, 0000(0000) 00000000 00000000