I’m using a server that has 4 A100 PCIe 40GB GPUs. When I run nvidia-smi
I can see only three GPUs.
nvidia-smi
output:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-PCIE-40GB On | 00000000:01:00.0 Off | 0 |
| N/A 32C P0 33W / 250W | 16MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-PCIE-40GB On | 00000000:41:00.0 Off | 0 |
| N/A 33C P0 37W / 250W | 16MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-PCIE-40GB On | 00000000:81:00.0 Off | 0 |
| N/A 33C P0 37W / 250W | 447MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
lspci | grep -i nvidia
output showing the 4 GPUs:
01:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 40GB] (rev a1)
41:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 40GB] (rev a1)
81:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 40GB] (rev a1)
c1:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 40GB] (rev a1)
I also ran sudo dmesg | grep NVRM
and got this output:
[ 7.790211] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 535.183.01 Sun May 12 19:39:15 UTC 2024
[ 10.353601] NVRM: Xid (PCI:0000:c1:00): 120, pid='<unknown>', name=<unknown>, GSP task exception: load access fault (cause:0x5) @ pc:0x5161068, task:1
[ 10.353615] NVRM: Reported by libos task:0 v2.0 [0] @ ts:1722410704
[ 10.353618] NVRM: RISC-V CSR State:
[ 10.353621] NVRM: mstatus:0x000000001e000000 mscratch:0x0000000000000000 mie:0x0000000000000880 mip:0x0000000000000000
[ 10.353623] NVRM: mepc:0x0000000005161068 mbadaddr:0x0004220000040220 mcause:0x0000000000000005
[ 10.353626] NVRM: RISC-V GPR State:
[ 10.353629] NVRM: ra:0x0000000005161050 sp:0x00000000056b2980 gp:0x0000000000000000 tp:0x0000000000000000
[ 10.353631] NVRM: a0:0x0004220000040200 a1:0x00000000000002c4 a2:0x80000000004b33b0 a3:0x0000000000000004
[ 10.353634] NVRM: a4:0x17e73b689ba52e00 a5:0x80000000004b33b0 a6:0x0000000000000001 a7:0x0000000000000002
[ 10.353637] NVRM: s0:0x00000000056b29f0 s1:0x0000000004188000 s2:0x0000000000000001 s3:0x0000000000000000
[ 10.353639] NVRM: s4:0x80000000001ab4f0 s5:0x0000000004188000 s6:0x8000000000178190 s7:0x0000000000000065
[ 10.353642] NVRM: s8:0x0000000004188000 s9:0x0000000000000008 s10:0x0000000000000007 s11:0x8000000000186510
[ 10.353644] NVRM: t0:0x0000000000000012 t1:0x0000000000000001 t2:0x0000000000000200 t3:0x0000000000000000
[ 10.353646] NVRM: t4:0x0000000000000006 t5:0x17e73b689ba52e00 t6:0x0000000000000000
[ 10.353649] NVRM: Stack Trace:
[ 10.353651] NVRM: 0x0000000005161068
[ 10.353653] NVRM: 0x000000000515fbfc
[ 10.353655] NVRM: 0x000000000514b8f0
[ 10.353657] NVRM: 0x000000000514bafc
[ 10.353659] NVRM: 0x000000000510ab98
[ 10.353661] NVRM: 0x000000000510b7e4
[ 10.353663] NVRM: 0x00000000050e1fec
[ 10.353665] NVRM: 0x00000000050adc2c
[ 10.353667] NVRM: 0x00000000050d50c0
[ 10.353669] NVRM: 0x0000000004babfe8
[ 10.353671] NVRM: 0x0000000004b7f460
[ 10.353672] NVRM: 0x0000000005697178
[ 10.353674] NVRM: PC Trace:
[ 10.353677] NVRM: 0x000000000569c064 0x000000000569e490 0x000000000569ce08 0x0000000004021464 0x000000000569cc2c
[ 10.353679] NVRM: 0x000000000569c340 0x0000000004021464 0x000000000569c284 0x000000000569cc1c 0x000000000569c10c
[ 10.353682] NVRM: 0x000000000569cc00 0x000000000569e7ac 0x000000000569ca88 0x000000000569c1c0 0x000000000569ca50
[ 10.353684] NVRM: 0x000000000569c340 0x0000000004021464 0x000000000569c284 0x000000000569caa8 0x000000000569c1c0
[ 10.353687] NVRM: 0x000000000569ca50 0x000000000569c340 0x0000000004021464 0x000000000569c284 0x000000000569caa8
[ 10.353689] NVRM: 0x000000000569c1c0 0x000000000569ca50 0x000000000569c340 0x0000000004021464 0x000000000569c284
[ 10.353692] NVRM: 0x000000000569caa8 0x000000000569c1c0 0x000000000569ca50 0x000000000569c340 0x0000000004021464
[ 10.353694] NVRM: 0x000000000569c284 0x000000000569caa8 0x000000000569c1c0 0x000000000569ca50 0x000000000569c340
[ 10.353696] NVRM: 0x0000000004021464 0x000000000569c284 0x000000000569caa8 0x000000000569c1c0 0x000000000569ca50
[ 10.353699] NVRM: 0x000000000569c340 0x0000000004021464 0x000000000569c284 0x000000000569caa8 0x000000000569c1c0
[ 10.353701] NVRM: 0x000000000569ca50 0x000000000569c340 0x0000000004021464 0x000000000569c284 0x000000000569caa8
[ 10.353703] NVRM: 0x000000000569c1c0 0x000000000569ca50
[ 10.353706] NVRM: External I/O Register State:
[ 10.353708] NVRM: 0x00111360:0x00000000 0x00111364:0x00000000 0x00111368:0x00000000 0x0011136c:0x00000000
[ 10.353711] NVRM: 0x001112b4:0x00040000 0x001112b8:0x00000000 0x001112bc:0x00000000 0x00111344:0x11100000
[ 10.353714] NVRM: 0x00110008:0x00000010 0x0011010c:0x00000000 0x00110118:0x00011122 0x00110110:0x09f611e4
[ 10.353716] NVRM: 0x00110128:0x00000000 0x00110114:0x0000c160 0x0011011c:0x000001a0
[ 10.353719] NVRM: ------------[ end crash report ]------------
[ 15.164719] NVRM: Xid (PCI:0000:c1:00): 119, pid=1501, name=nvidia-persiste, Timeout after 6s of waiting for RPC response from GPU0 GSP! Expected function 4097 (GSP_INIT_DONE) (0x0 0x0).
[ 15.164732] NVRM: GPU0 GSP RPC buffer contains function 4098 (GSP_RUN_CPU_SEQUENCER) and data 0x00000000000001ea 0x0000000000003fe2.
[ 15.164737] NVRM: GPU0 RPC history (CPU -> GSP):
[ 15.164740] NVRM: entry function data0 data1 ts_start ts_end duration actively_polling
[ 15.164743] NVRM: 0 73 SET_REGISTRY 0x0000000000000000 0x0000000000000000 0x00061e85fe198311 0x0000000000000000 y
[ 15.164749] NVRM: -1 72 GSP_SET_SYSTEM_INFO 0x0000000000000000 0x0000000000000000 0x00061e85fe198308 0x0000000000000000
[ 15.164754] NVRM: GPU0 RPC event history (CPU <- GSP):
[ 15.164756] NVRM: entry function data0 data1 ts_start ts_end duration during_incomplete_rpc
[ 15.164759] NVRM: 0 4098 GSP_RUN_CPU_SEQUENCER 0x00000000000001ea 0x0000000000003fe2 0x00061e85fe2d4ebd 0x00061e85fe2d6c6d 7600us y
[ 15.198825] NVRM: Xid (PCI:0000:c1:00): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:-1840691462, LTC:0, MMU:0, PCIE:0
[ 15.201030] NVRM: GPU 0000:c1:00.0: RmInitAdapter failed! (0x62:0x40:2404)
[ 15.202160] NVRM: GPU 0000:c1:00.0: rm_init_adapter failed, device minor number 0
[ 15.630474] NVRM: GPU 0000:c1:00.0: RmInitAdapter failed! (0x62:0x40:2404)
[ 15.635379] NVRM: GPU 0000:c1:00.0: rm_init_adapter failed, device minor number 0
[ 66.554711] NVRM: GPU 0000:c1:00.0: RmInitAdapter failed! (0x62:0x40:2404)
The last two lines then keep repeating forever.
I have also tried uninstalling everything nvidia and cuda related from my system and re-installing again but to no avail.
I’d appreciate any ideas for help. Thanks!