DGX1 System is too slow. nvidia-smi, SSD disk speed. Help me, plz

The speed is too slow when running NVIDIA-SMI in the DGX1 V100 device.
And even when the disk speed check was checked, it is more relatively slower than when tested with other equipment of the same model.

DGX-1(1)
Write
dd if=/dev/sda2 bs=1024 count=100000 of=/root/test/test_file oflag=direct
100000+0 records in
100000+0 records out
102400000 bytes (102 MB, 98 MiB) copied, 5.17613 s, 19.8 MB/s

Read
dd if=/root/test/test_file of=/dev/null bs=1024
100000+0 records in
100000+0 records out
102400000 bytes (102 MB, 98 MiB) copied, 1.31338 s, 78.0 MB/s

nvsm show health
root@DGX-01:~# nvsm show health

Info

Timestamp: Tue Sep 7 18:36:14 2021 +0900
Version: 20.09.17

Checks

WARNING: <bound method ParseNvidiaSmi.run of <nvsmhealth.modules.nvidia_smi.ParseNvidiaSmi object at 0x7f160803d430>> timeout
Verify installed DIMM memory sticks… Healthy
Verify disk controllers… Healthy
Verify Ethernet controllers… Healthy
Verify installed GPU’s… Healthy
Verify installed InfiniBand controllers… Healthy
Verify PCIe switches… Healthy
Verify DIMM vendors… Healthy
Verify chassis fan presence… Healthy
Check FRU information for consistency… Healthy
Verify GPUs VBIOS version consistency… Unknown
No result from nvidia-smi tool
Quick health check of GPU using DCGM… Healthy
Number of logical CPU cores [80]… Healthy
Installed memory capacity [503.82GB]… Healthy
Verify Mellanox devices firmware version consistency… Healthy
Verify GPU’s identified using nvidia-smi… Healthy
Verify chassis power supply presence… Unhealthy
Checking output of ‘ipmitool sdr elist’ for expected chassis PSUs
AC input is lost, PSU4 Status has reading:
Presence detected, Power Supply AC lost
Verify installed MegaRAID disks… Healthy
Check for SSD health… Healthy
[sanity] MegaRAID storcli utility installed… Healthy
[sanity] DGX BaseOS support for storcli utility… Healthy
MegaRAID virtual disk state [ /c0/v0 ][ State: Optl ]… Healthy
MegaRAID virtual disk state [ /c0/v1 ][ State: Optl ]… Healthy
MegaRAID physical disk state [ /c0/v0 PD 0 ][ State: Onln ]… Healthy
MegaRAID physical disk state [ /c0/v1 PD 0 ][ State: Onln ]… Healthy
MegaRAID physical disk state [ /c0/v1 PD 1 ][ State: Onln ]… Healthy
MegaRAID physical disk state [ /c0/v1 PD 2 ][ State: Onln ]… Healthy
MegaRAID physical disk state [ /c0/v1 PD 3 ][ State: Onln ]… Healthy
MegaRAID PHY link speed [ /c0/p0 ][ Speed: No limit ]… Healthy
MegaRAID PHY link speed [ /c0/p1 ][ Speed: No limit ]… Healthy
MegaRAID PHY link speed [ /c0/p2 ][ Speed: No limit ]… Healthy
MegaRAID PHY link speed [ /c0/p3 ][ Speed: No limit ]… Healthy
MegaRAID PHY link speed [ /c0/p4 ][ Speed: No limit ]… Healthy
MegaRAID PHY link speed [ /c0/p5 ][ Speed: No limit ]… Healthy
MegaRAID PHY link speed [ /c0/p6 ][ Speed: No limit ]… Healthy
MegaRAID PHY link speed [ /c0/p7 ][ Speed: No limit ]… Healthy
MegaRAID SAS address [ /c0/p0 ][ 0x5001636002BF0F3F ]… Healthy
MegaRAID SAS address [ /c0/p1 ][ 0x5001636002BF0F3F ]… Healthy
MegaRAID SAS address [ /c0/p2 ][ 0x5001636002BF0F3F ]… Healthy
MegaRAID SAS address [ /c0/p3 ][ 0x5001636002BF0F3F ]… Healthy
MegaRAID SAS address [ /c0/p4 ][ 0x5001636002BF0F3F ]… Healthy
MegaRAID SAS address [ /c0/p5 ][ 0x5001636002BF0F3F ]… Healthy
MegaRAID SAS address [ /c0/p6 ][ 0x5001636002BF0F3F ]… Healthy
MegaRAID SAS address [ /c0/p7 ][ 0x5001636002BF0F3F ]… Healthy
MegaRAID SAS port valid [ /c0/p0 ]… Healthy
MegaRAID SAS port valid [ /c0/p1 ]… Healthy
MegaRAID SAS port valid [ /c0/p2 ]… Healthy
MegaRAID SAS port valid [ /c0/p3 ]… Healthy
MegaRAID SAS port valid [ /c0/p4 ]… Healthy
MegaRAID SAS port valid [ /c0/p5 ]… Healthy
MegaRAID SAS port valid [ /c0/p6 ]… Healthy
MegaRAID SAS port valid [ /c0/p7 ]… Healthy
Ethernet link speed [0000:01:00.0][5GT/s]… Healthy
Ethernet link width [0000:01:00.0][x8]… Healthy
Ethernet link speed [0000:01:00.1][5GT/s]… Healthy
Ethernet link width [0000:01:00.1][x8]… Healthy
GPU link speed [0000:89:00.0][None]… Unknown
unknown pstate for the GPU[0000:89:00.0]
GPU link width [0000:89:00.0][xNone]… Unknown
unknown pstate for the GPU[0000:89:00.0]
GPU link speed [0000:0a:00.0][None]… Unknown
unknown pstate for the GPU[0000:0a:00.0]
GPU link width [0000:0a:00.0][xNone]… Unknown
unknown pstate for the GPU[0000:0a:00.0]
GPU link speed [0000:07:00.0][None]… Unknown
unknown pstate for the GPU[0000:07:00.0]
GPU link width [0000:07:00.0][xNone]… Unknown
unknown pstate for the GPU[0000:07:00.0]
GPU link speed [0000:85:00.0][None]… Unknown
unknown pstate for the GPU[0000:85:00.0]
GPU link width [0000:85:00.0][xNone]… Unknown
unknown pstate for the GPU[0000:85:00.0]
GPU link speed [0000:8a:00.0][None]… Unknown
unknown pstate for the GPU[0000:8a:00.0]
GPU link width [0000:8a:00.0][xNone]… Unknown
unknown pstate for the GPU[0000:8a:00.0]
GPU link speed [0000:06:00.0][None]… Unknown
unknown pstate for the GPU[0000:06:00.0]
GPU link width [0000:06:00.0][xNone]… Unknown
unknown pstate for the GPU[0000:06:00.0]
GPU link speed [0000:0b:00.0][None]… Unknown
unknown pstate for the GPU[0000:0b:00.0]
GPU link width [0000:0b:00.0][xNone]… Unknown
unknown pstate for the GPU[0000:0b:00.0]
GPU link speed [0000:86:00.0][None]… Unknown
unknown pstate for the GPU[0000:86:00.0]
GPU link width [0000:86:00.0][xNone]… Unknown
unknown pstate for the GPU[0000:86:00.0]
InfiniBand controller link speed [0000:05:00.0][8GT/s]… Healthy
InfiniBand controller link width [0000:05:00.0][x16]… Healthy
InfiniBand controller link speed [0000:84:00.0][8GT/s]… Healthy
InfiniBand controller link width [0000:84:00.0][x16]… Healthy
Check GPUDirect Topology information for consistency… Healthy
NVIDIA Driver Version [450.80.02]…
WARNING: Unhandled exception from task check_nvidia_smi_nvlink_status
WARNING: Run with --log-level=debug for exception details.
BMC Firmware Revision [3.30.30]…
Check BMC sensor thresholds… Unhealthy
PSU4 Input: Observed value “0.0” (Watts) below critical lower
threshold “0.0”
Checked 105 sensor values against BMC thresholds.
DGX BaseOS Version [5.0.2]…
BIOS Version [S2W_3A10]…
Linux kernel version [5.4.0-58-generic]…
System Uptime [up 7 hours, 57 minutes]…
DGX Serial Number [#############]…

System Summary

Product Name: DGX-1 with V100-32
Manufacturer: NVIDIA
DGX Serial Number: #############
Uptime: up 7 hours, 57 minutes
Motherboard:
BIOS Version: S2W_3A10
Serial Number: ###########
BMC:
Firmware Version: 3.30.30
IPMI Version: 2.0
GPU:
NVIDIA Driver Version: 450.80.02
Product Name(s): Unknown
VBIOS Version(s): Unknown
Software:
DGX BaseOS Version: 5.0.2
Kernel Version: 5.4.0-58-generic

Health Summary

58 out of 77 checks are healthy
2 out of 77 checks are unhealthy
17 out of 77 checks are unknown
0 out of 77 checks are informational
Overall system status is unhealthy
Problem detected.

  1. Please run ‘sudo nvsm dump health’
  2. Please open a case with NVIDIA Enterprise Support at this address https://nvid.nvidia.com/dashboard/
  3. Attach the log file from /tmp/nvsm-health-1631007374.json
    100.0% [=========================================]
    Status: unhealthy

Can you check the power supplies (PSUs)? This might explain the lower performance.

1 Like

Thank you your answer.

I checked ‘ipmitool sdr elist’.

result:
Airflow | F2h | ok | 23.1 | 136 CFM
CPU_0 | A8h | ok | 3.1 | Presence detected
CPU_1 | A9h | ok | 3.2 | Presence detected
CPU_DIMM_VRHOT | B7h | ok | 7.9 |
DIMM_HOT | B8h | ok | 7.8 |
Event Log | F4h | ok | 6.1 | Log almost full
Fan_SYS0_1 | C0h | ok | 29.1 | 5200 RPM
Fan_SYS0_2 | C1h | ok | 29.1 | 5700 RPM
Fan_SYS1_1 | C2h | ok | 29.2 | 5200 RPM
Fan_SYS1_2 | C3h | ok | 29.2 | 5800 RPM
Fan_SYS2_1 | C4h | ok | 29.3 | 5000 RPM
Fan_SYS2_2 | C5h | ok | 29.3 | 5500 RPM
Fan_SYS3_1 | C6h | ok | 29.4 | 5000 RPM
Fan_SYS3_2 | C7h | ok | 29.4 | 5500 RPM
HDD0 | 40h | ok | 26.0 | Drive Present
HDD1 | 41h | ok | 26.1 | Drive Present
HDD10 | 4Ah | ok | 26.10 | Drive Present
HDD11 | 4Bh | ok | 26.11 |
HDD12 | 4Ch | ok | 26.12 |
HDD13 | 4Dh | ok | 26.13 |
HDD14 | 4Eh | ok | 26.14 |
HDD15 | 4Fh | ok | 26.15 |
HDD16 | 50h | ok | 26.16 |
HDD17 | 51h | ok | 26.17 |
HDD18 | 52h | ok | 26.18 |
HDD19 | 53h | ok | 26.19 |
HDD2 | 42h | ok | 26.2 | Drive Present
HDD3 | 43h | ok | 26.3 | Drive Present
HDD4 | 44h | ok | 26.4 | Drive Present
HDD5 | 45h | ok | 26.5 |
HDD6 | 46h | ok | 26.6 |
HDD7 | 47h | ok | 26.7 |
HDD8 | 48h | ok | 26.8 |
HDD9 | 49h | ok | 26.8 |
HSC0 Input | CEh | ok | 20.1 | 69 Watts
HSC0 Status High | CDh | ok | 20.3 |
HSC0 Status Low | CCh | ok | 20.2 |
HSC1 Input | C8h | ok | 20.5 | 290 Watts
HSC2 Input | C9h | ok | 20.6 | 310 Watts
PSU Redundancy | E8h | ok | 21.1 | Non-Redundant: Insufficient Resources
PSU1 Input | E4h | ok | 10.1 | 333 Watts
PSU1 Status | E0h | ok | 10.1 | Presence detected
PSU2 Input | E5h | ok | 10.2 | 369 Watts
PSU2 Status | E1h | ok | 10.2 | Presence detected
PSU3 Input | E6h | ok | 10.3 | 9 Watts
PSU3 Status | E2h | ok | 10.3 | Presence detected, Failure detected, Predictive failure
PSU4 Input | E7h | lcr | 10.4 | 0 Watts
PSU4 Status | E3h | ok | 10.4 | Presence detected, Power Supply AC lost
Power_GPGPU0 | 68h | ok | 16.1 | 44 Watts
Power_GPGPU1 | 69h | ok | 16.2 | 44 Watts
Power_GPGPU2 | 6Ah | ok | 16.3 | 44 Watts
Power_GPGPU3 | 6Bh | ok | 16.4 | 44 Watts
Power_GPGPU4 | 6Ch | ok | 16.5 | 44 Watts
Power_GPGPU5 | 6Dh | ok | 16.6 | 44 Watts
Power_GPGPU6 | 6Eh | ok | 16.7 | 44 Watts
Power_GPGPU7 | 6Fh | ok | 16.8 | 42 Watts
Pwr_Node | CFh | ok | 20.4 | 648 Watts
Temp_Ambient_BP0 | EEh | ok | 64.1 | 24 degrees C
Temp_Ambient_BP1 | EFh | ok | 64.2 | 23 degrees C
Temp_Ambient_FP | EBh | ok | 64.3 | 26 degrees C
Temp_Ambient_PCI | BAh | ok | 66.11 | 38 degrees C
Temp_CPU0 | AAh | ok | 65.1 | 44 degrees C
Temp_CPU1 | ABh | ok | 65.2 | 40 degrees C
Temp_DIMM_AB | ACh | ok | 66.1 | 37 degrees C
Temp_DIMM_CD | ADh | ok | 66.2 | 36 degrees C
Temp_DIMM_EF | AEh | ok | 66.3 | 33 degrees C
Temp_DIMM_GH | AFh | ok | 66.4 | 35 degrees C
Temp_EXPB | BBh | ok | 7.20 | 25 degrees C
Temp_GPGPU0 | 60h | ok | 16.1 | 34 degrees C
Temp_GPGPU1 | 61h | ok | 16.2 | 35 degrees C
Temp_GPGPU2 | 62h | ok | 16.3 | 36 degrees C
Temp_GPGPU3 | 63h | ok | 16.4 | 35 degrees C
Temp_GPGPU4 | 64h | ok | 16.5 | 35 degrees C
Temp_GPGPU5 | 65h | ok | 16.6 | 36 degrees C
Temp_GPGPU6 | 66h | ok | 16.7 | 36 degrees C
Temp_GPGPU7 | 67h | ok | 16.8 | 34 degrees C
Temp_GPUB0 | BCh | ok | 7.18 | 34 degrees C
Temp_GPUB1 | BDh | ok | 7.19 | 33 degrees C
Temp_Inlet_MB | ECh | ok | 66.12 | 30 degrees C
Temp_OCP_Mezz | E9h | ok | 7.16 | 60 degrees C
Temp_Outlet | 32h | ok | 7.1 | 38 degrees C
Temp_PCH | BEh | ok | 66.13 | 43 degrees C
Temp_PDB | EDh | ok | 7.21 | 26 degrees C
Temp_RaidCard | EAh | ok | 7.17 | 49 degrees C
Temp_VR_CPU0 | B0h | ok | 66.5 | 41 degrees C
Temp_VR_CPU1 | B1h | ok | 66.6 | 32 degrees C
Temp_VR_DIMM_AB | B3h | ok | 66.7 | 40 degrees C
Temp_VR_DIMM_CD | B4h | ok | 66.8 | 38 degrees C
Temp_VR_DIMM_EF | B5h | ok | 66.9 | 32 degrees C
Temp_VR_DIMM_GH | B6h | ok | 66.10 | 32 degrees C
Volt_P12V | D2h | ok | 7.1 | 12.30 Volts
Volt_P1V05 | D3h | ok | 7.1 | 1.07 Volts
Volt_P1V8_AUX | D4h | ok | 7.1 | 1.83 Volts
Volt_P3V3 | D0h | ok | 7.1 | 3.35 Volts
Volt_P3V3_AUX | D5h | ok | 7.1 | 3.35 Volts
Volt_P3V_BAT | D7h | ok | 7.1 | 3.07 Volts
Volt_P5V | D1h | ok | 7.1 | 5.08 Volts
Volt_P5V_AUX | D6h | ok | 7.1 | 5.08 Volts
Volt_VR_CPU0 | DAh | ok | 7.1 | 1.79 Volts
Volt_VR_CPU1 | DBh | ok | 7.1 | 1.79 Volts
Volt_VR_DIMM_AB | DCh | ok | 7.1 | 1.22 Volts
Volt_VR_DIMM_CD | DDh | ok | 7.1 | 1.22 Volts
Volt_VR_DIMM_EF | DEh | ok | 7.1 | 1.22 Volts
Volt_VR_DIMM_GH | DFh | ok | 7.1 | 1.22 Volts
Watchdog | F5h | ok | 6.2 |
Maybe PSU3, 4 Problems are show.
Could it be the cause of performance degradation?

Yes, failed PSUs will have a performance impact. I’ll try to get some more information tomorrow.

1 Like

Thank you.

I will try replace another PSU.

I replaced another PSU, then it works!

Thank you.