GPU not detected by nvidia-smi

I’m using a server that has 4 A100 PCIe 40GB GPUs. When I run nvidia-smi I can see only three GPUs.
nvidia-smi output:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-PCIE-40GB          On  | 00000000:01:00.0 Off |                    0 |
| N/A   32C    P0              33W / 250W |     16MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-PCIE-40GB          On  | 00000000:41:00.0 Off |                    0 |
| N/A   33C    P0              37W / 250W |     16MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-PCIE-40GB          On  | 00000000:81:00.0 Off |                    0 |
| N/A   33C    P0              37W / 250W |    447MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

lspci | grep -i nvidia output showing the 4 GPUs:

01:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 40GB] (rev a1)
41:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 40GB] (rev a1)
81:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 40GB] (rev a1)
c1:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 40GB] (rev a1)

I also ran sudo dmesg | grep NVRM and got this output:

[    7.790211] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  535.183.01  Sun May 12 19:39:15 UTC 2024
[   10.353601] NVRM: Xid (PCI:0000:c1:00): 120, pid='<unknown>', name=<unknown>, GSP task exception: load access fault (cause:0x5) @ pc:0x5161068, task:1
[   10.353615] NVRM:     Reported by libos task:0 v2.0 [0] @ ts:1722410704
[   10.353618] NVRM:     RISC-V CSR State:
[   10.353621] NVRM:         mstatus:0x000000001e000000  mscratch:0x0000000000000000     mie:0x0000000000000880  mip:0x0000000000000000
[   10.353623] NVRM:            mepc:0x0000000005161068  mbadaddr:0x0004220000040220  mcause:0x0000000000000005
[   10.353626] NVRM:     RISC-V GPR State:
[   10.353629] NVRM:         ra:0x0000000005161050   sp:0x00000000056b2980   gp:0x0000000000000000   tp:0x0000000000000000
[   10.353631] NVRM:         a0:0x0004220000040200   a1:0x00000000000002c4   a2:0x80000000004b33b0   a3:0x0000000000000004
[   10.353634] NVRM:         a4:0x17e73b689ba52e00   a5:0x80000000004b33b0   a6:0x0000000000000001   a7:0x0000000000000002
[   10.353637] NVRM:         s0:0x00000000056b29f0   s1:0x0000000004188000   s2:0x0000000000000001   s3:0x0000000000000000
[   10.353639] NVRM:         s4:0x80000000001ab4f0   s5:0x0000000004188000   s6:0x8000000000178190   s7:0x0000000000000065
[   10.353642] NVRM:         s8:0x0000000004188000   s9:0x0000000000000008  s10:0x0000000000000007  s11:0x8000000000186510
[   10.353644] NVRM:         t0:0x0000000000000012   t1:0x0000000000000001   t2:0x0000000000000200   t3:0x0000000000000000
[   10.353646] NVRM:         t4:0x0000000000000006   t5:0x17e73b689ba52e00   t6:0x0000000000000000
[   10.353649] NVRM:     Stack Trace:
[   10.353651] NVRM:         0x0000000005161068
[   10.353653] NVRM:         0x000000000515fbfc
[   10.353655] NVRM:         0x000000000514b8f0
[   10.353657] NVRM:         0x000000000514bafc
[   10.353659] NVRM:         0x000000000510ab98
[   10.353661] NVRM:         0x000000000510b7e4
[   10.353663] NVRM:         0x00000000050e1fec
[   10.353665] NVRM:         0x00000000050adc2c
[   10.353667] NVRM:         0x00000000050d50c0
[   10.353669] NVRM:         0x0000000004babfe8
[   10.353671] NVRM:         0x0000000004b7f460
[   10.353672] NVRM:         0x0000000005697178
[   10.353674] NVRM:     PC Trace:
[   10.353677] NVRM:         0x000000000569c064  0x000000000569e490  0x000000000569ce08  0x0000000004021464  0x000000000569cc2c
[   10.353679] NVRM:         0x000000000569c340  0x0000000004021464  0x000000000569c284  0x000000000569cc1c  0x000000000569c10c
[   10.353682] NVRM:         0x000000000569cc00  0x000000000569e7ac  0x000000000569ca88  0x000000000569c1c0  0x000000000569ca50
[   10.353684] NVRM:         0x000000000569c340  0x0000000004021464  0x000000000569c284  0x000000000569caa8  0x000000000569c1c0
[   10.353687] NVRM:         0x000000000569ca50  0x000000000569c340  0x0000000004021464  0x000000000569c284  0x000000000569caa8
[   10.353689] NVRM:         0x000000000569c1c0  0x000000000569ca50  0x000000000569c340  0x0000000004021464  0x000000000569c284
[   10.353692] NVRM:         0x000000000569caa8  0x000000000569c1c0  0x000000000569ca50  0x000000000569c340  0x0000000004021464
[   10.353694] NVRM:         0x000000000569c284  0x000000000569caa8  0x000000000569c1c0  0x000000000569ca50  0x000000000569c340
[   10.353696] NVRM:         0x0000000004021464  0x000000000569c284  0x000000000569caa8  0x000000000569c1c0  0x000000000569ca50
[   10.353699] NVRM:         0x000000000569c340  0x0000000004021464  0x000000000569c284  0x000000000569caa8  0x000000000569c1c0
[   10.353701] NVRM:         0x000000000569ca50  0x000000000569c340  0x0000000004021464  0x000000000569c284  0x000000000569caa8
[   10.353703] NVRM:         0x000000000569c1c0  0x000000000569ca50
[   10.353706] NVRM:     External I/O Register State:
[   10.353708] NVRM:         0x00111360:0x00000000   0x00111364:0x00000000   0x00111368:0x00000000   0x0011136c:0x00000000
[   10.353711] NVRM:         0x001112b4:0x00040000   0x001112b8:0x00000000   0x001112bc:0x00000000   0x00111344:0x11100000
[   10.353714] NVRM:         0x00110008:0x00000010   0x0011010c:0x00000000   0x00110118:0x00011122   0x00110110:0x09f611e4
[   10.353716] NVRM:         0x00110128:0x00000000   0x00110114:0x0000c160   0x0011011c:0x000001a0
[   10.353719] NVRM: ------------[ end crash report ]------------
[   15.164719] NVRM: Xid (PCI:0000:c1:00): 119, pid=1501, name=nvidia-persiste, Timeout after 6s of waiting for RPC response from GPU0 GSP! Expected function 4097 (GSP_INIT_DONE) (0x0 0x0).
[   15.164732] NVRM: GPU0 GSP RPC buffer contains function 4098 (GSP_RUN_CPU_SEQUENCER) and data 0x00000000000001ea 0x0000000000003fe2.
[   15.164737] NVRM: GPU0 RPC history (CPU -> GSP):
[   15.164740] NVRM:     entry function                   data0              data1              ts_start           ts_end             duration actively_polling
[   15.164743] NVRM:      0    73   SET_REGISTRY          0x0000000000000000 0x0000000000000000 0x00061e85fe198311 0x0000000000000000          y
[   15.164749] NVRM:     -1    72   GSP_SET_SYSTEM_INFO   0x0000000000000000 0x0000000000000000 0x00061e85fe198308 0x0000000000000000           
[   15.164754] NVRM: GPU0 RPC event history (CPU <- GSP):
[   15.164756] NVRM:     entry function                   data0              data1              ts_start           ts_end             duration during_incomplete_rpc
[   15.164759] NVRM:      0    4098 GSP_RUN_CPU_SEQUENCER 0x00000000000001ea 0x0000000000003fe2 0x00061e85fe2d4ebd 0x00061e85fe2d6c6d   7600us y
[   15.198825] NVRM: Xid (PCI:0000:c1:00): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:-1840691462, LTC:0, MMU:0, PCIE:0
[   15.201030] NVRM: GPU 0000:c1:00.0: RmInitAdapter failed! (0x62:0x40:2404)
[   15.202160] NVRM: GPU 0000:c1:00.0: rm_init_adapter failed, device minor number 0
[   15.630474] NVRM: GPU 0000:c1:00.0: RmInitAdapter failed! (0x62:0x40:2404)
[   15.635379] NVRM: GPU 0000:c1:00.0: rm_init_adapter failed, device minor number 0
[   66.554711] NVRM: GPU 0000:c1:00.0: RmInitAdapter failed! (0x62:0x40:2404)

The last two lines then keep repeating forever.

I have also tried uninstalling everything nvidia and cuda related from my system and re-installing again but to no avail.

I’d appreciate any ideas for help. Thanks!