XID Errors in DGX-1 (GPU's don't start)

Hi, how are you? Hope all well.
I’m rebuilding some DGX1-V100 (32) and when i launch any GPU app (bash /docker /slurm ), they don’t start work.

tail -f /var/log/syslog | grep “NVRM: Xid”

Mar 31 05:39:04 D1 kernel: [ 142.473073] NVRM: Xid (PCI:0000:05:00): 44, pid=5164, Ch 00000010, intr 00000000
Mar 31 05:39:34 D1 kernel: [ 172.991891] NVRM: Xid (PCI:0000:09:00): 62, pid=5164, 0a76(2b54) 00000000 00000000
Mar 31 05:39:34 D1 kernel: [ 172.992935] NVRM: Xid (PCI:0000:05:00): 45, pid=3384, Ch 00000000
Mar 31 05:39:34 D1 kernel: [ 172.993935] NVRM: Xid (PCI:0000:05:00): 45, pid=3384, Ch 00000001
Mar 31 05:39:34 D1 kernel: [ 172.994884] NVRM: Xid (PCI:0000:05:00): 45, pid=3019, Ch 00000002
Mar 31 05:39:34 D1 kernel: [ 172.995804] NVRM: Xid (PCI:0000:05:00): 45, pid=3019, Ch 00000003
Mar 31 05:39:34 D1 kernel: [ 172.996685] NVRM: Xid (PCI:0000:05:00): 45, pid=3019, Ch 00000004
Mar 31 05:39:34 D1 kernel: [ 172.997565] NVRM: Xid (PCI:0000:05:00): 45, pid=3019, Ch 00000005
Mar 31 05:39:34 D1 kernel: [ 172.998565] NVRM: Xid (PCI:0000:05:00): 45, pid=3019, Ch 00000006
Mar 31 05:39:34 D1 kernel: [ 172.999481] NVRM: Xid (PCI:0000:05:00): 45, pid=3019, Ch 00000007
Mar 31 05:39:34 D1 kernel: [ 173.000362] NVRM: Xid (PCI:0000:05:00): 45, pid=3019, Ch 00000008
Mar 31 05:39:34 D1 kernel: [ 173.001240] NVRM: Xid (PCI:0000:05:00): 45, pid=3019, Ch 00000009
Mar 31 05:39:34 D1 kernel: [ 173.002240] NVRM: Xid (PCI:0000:05:00): 45, pid=3019, Ch 0000000a
Mar 31 05:39:34 D1 kernel: [ 173.003164] NVRM: Xid (PCI:0000:05:00): 45, pid=3019, Ch 0000000b
Mar 31 05:39:34 D1 kernel: [ 173.004048] NVRM: Xid (PCI:0000:05:00): 45, pid=3019, Ch 0000000c
and continue…

Can anybody help me to start work with GPUs? Of course I saw XID Errors :: GPU Deployment and Management Documentation
Looking for solution, i uploaded healt dump for reference.

Another question is , as i understand dgx-1 works with up to 5 , 2 tb SDD, any special brand?
i never saw
nvsm-health-D1-20220331063839.tar.xz (38.8 MB)
in docs.

I wait your news, thank in advance!,
Regards!
Cristian

Hi @cristian8 !

This sounds like a perfect reason to take advantage of NVIDIA Enterprise Support! Can you contact them (see About the DGX User Forum / Note: this is not NVIDIA Enterprise Support for a link to the portal, phone numbers, etc.) so they can help you root cause the Xid errors?

Regarding SSDs, the DGX-1 ships with 3 of them (1 for OS, and 4 that are normally used as cache or scratch space mounted at /raid). When the product was still active, we supported adding additional (identical) SSDs to the empty slots, bearing in mind it will have an impact on airflow which could potentially impact GPU clock-rates and performance. Since the DGX-1 was end-of-life’ed, NVIDIA has stopped selling the replacement SSDs.

Practically speaking, most any SATA SSD should work (the drives all connect to a backplane, then to the SAS controller in the system - you’re limited in aggregate bandwidth due to the 8x SAS lanes from the controller) although NVIDIA has not tested nor qualified drives other than what is in your system already. The lowest-risk path would be to purchase additional drives that match what’s in your system (in the 4x cache drives) to add additional capacity.

ScottE

Hi Scott and thank you very much for your answer!

I have 5 x DGX1 here ! If i take the enterprise support, i’ll be sure that all 5 will work (i only tested 2 yet).

Do you know costs?

Thank in advance regards!!!

Cristian