The speed is too slow when running NVIDIA-SMI in the DGX1 V100 device.
And even when the disk speed check was checked, it is more relatively slower than when tested with other equipment of the same model.
DGX-1(1)
Write
dd if=/dev/sda2 bs=1024 count=100000 of=/root/test/test_file oflag=direct
100000+0 records in
100000+0 records out
102400000 bytes (102 MB, 98 MiB) copied, 5.17613 s, 19.8 MB/s
Read
dd if=/root/test/test_file of=/dev/null bs=1024
100000+0 records in
100000+0 records out
102400000 bytes (102 MB, 98 MiB) copied, 1.31338 s, 78.0 MB/s
nvsm show health
root@DGX-01:~# nvsm show health
Info
Timestamp: Tue Sep 7 18:36:14 2021 +0900
Version: 20.09.17
Checks
WARNING: <bound method ParseNvidiaSmi.run of <nvsmhealth.modules.nvidia_smi.ParseNvidiaSmi object at 0x7f160803d430>> timeout
Verify installed DIMM memory sticks… Healthy
Verify disk controllers… Healthy
Verify Ethernet controllers… Healthy
Verify installed GPU’s… Healthy
Verify installed InfiniBand controllers… Healthy
Verify PCIe switches… Healthy
Verify DIMM vendors… Healthy
Verify chassis fan presence… Healthy
Check FRU information for consistency… Healthy
Verify GPUs VBIOS version consistency… Unknown
No result from nvidia-smi tool
Quick health check of GPU using DCGM… Healthy
Number of logical CPU cores [80]… Healthy
Installed memory capacity [503.82GB]… Healthy
Verify Mellanox devices firmware version consistency… Healthy
Verify GPU’s identified using nvidia-smi… Healthy
Verify chassis power supply presence… Unhealthy
Checking output of ‘ipmitool sdr elist’ for expected chassis PSUs
AC input is lost, PSU4 Status has reading:
Presence detected, Power Supply AC lost
Verify installed MegaRAID disks… Healthy
Check for SSD health… Healthy
[sanity] MegaRAID storcli utility installed… Healthy
[sanity] DGX BaseOS support for storcli utility… Healthy
MegaRAID virtual disk state [ /c0/v0 ][ State: Optl ]… Healthy
MegaRAID virtual disk state [ /c0/v1 ][ State: Optl ]… Healthy
MegaRAID physical disk state [ /c0/v0 PD 0 ][ State: Onln ]… Healthy
MegaRAID physical disk state [ /c0/v1 PD 0 ][ State: Onln ]… Healthy
MegaRAID physical disk state [ /c0/v1 PD 1 ][ State: Onln ]… Healthy
MegaRAID physical disk state [ /c0/v1 PD 2 ][ State: Onln ]… Healthy
MegaRAID physical disk state [ /c0/v1 PD 3 ][ State: Onln ]… Healthy
MegaRAID PHY link speed [ /c0/p0 ][ Speed: No limit ]… Healthy
MegaRAID PHY link speed [ /c0/p1 ][ Speed: No limit ]… Healthy
MegaRAID PHY link speed [ /c0/p2 ][ Speed: No limit ]… Healthy
MegaRAID PHY link speed [ /c0/p3 ][ Speed: No limit ]… Healthy
MegaRAID PHY link speed [ /c0/p4 ][ Speed: No limit ]… Healthy
MegaRAID PHY link speed [ /c0/p5 ][ Speed: No limit ]… Healthy
MegaRAID PHY link speed [ /c0/p6 ][ Speed: No limit ]… Healthy
MegaRAID PHY link speed [ /c0/p7 ][ Speed: No limit ]… Healthy
MegaRAID SAS address [ /c0/p0 ][ 0x5001636002BF0F3F ]… Healthy
MegaRAID SAS address [ /c0/p1 ][ 0x5001636002BF0F3F ]… Healthy
MegaRAID SAS address [ /c0/p2 ][ 0x5001636002BF0F3F ]… Healthy
MegaRAID SAS address [ /c0/p3 ][ 0x5001636002BF0F3F ]… Healthy
MegaRAID SAS address [ /c0/p4 ][ 0x5001636002BF0F3F ]… Healthy
MegaRAID SAS address [ /c0/p5 ][ 0x5001636002BF0F3F ]… Healthy
MegaRAID SAS address [ /c0/p6 ][ 0x5001636002BF0F3F ]… Healthy
MegaRAID SAS address [ /c0/p7 ][ 0x5001636002BF0F3F ]… Healthy
MegaRAID SAS port valid [ /c0/p0 ]… Healthy
MegaRAID SAS port valid [ /c0/p1 ]… Healthy
MegaRAID SAS port valid [ /c0/p2 ]… Healthy
MegaRAID SAS port valid [ /c0/p3 ]… Healthy
MegaRAID SAS port valid [ /c0/p4 ]… Healthy
MegaRAID SAS port valid [ /c0/p5 ]… Healthy
MegaRAID SAS port valid [ /c0/p6 ]… Healthy
MegaRAID SAS port valid [ /c0/p7 ]… Healthy
Ethernet link speed [0000:01:00.0][5GT/s]… Healthy
Ethernet link width [0000:01:00.0][x8]… Healthy
Ethernet link speed [0000:01:00.1][5GT/s]… Healthy
Ethernet link width [0000:01:00.1][x8]… Healthy
GPU link speed [0000:89:00.0][None]… Unknown
unknown pstate for the GPU[0000:89:00.0]
GPU link width [0000:89:00.0][xNone]… Unknown
unknown pstate for the GPU[0000:89:00.0]
GPU link speed [0000:0a:00.0][None]… Unknown
unknown pstate for the GPU[0000:0a:00.0]
GPU link width [0000:0a:00.0][xNone]… Unknown
unknown pstate for the GPU[0000:0a:00.0]
GPU link speed [0000:07:00.0][None]… Unknown
unknown pstate for the GPU[0000:07:00.0]
GPU link width [0000:07:00.0][xNone]… Unknown
unknown pstate for the GPU[0000:07:00.0]
GPU link speed [0000:85:00.0][None]… Unknown
unknown pstate for the GPU[0000:85:00.0]
GPU link width [0000:85:00.0][xNone]… Unknown
unknown pstate for the GPU[0000:85:00.0]
GPU link speed [0000:8a:00.0][None]… Unknown
unknown pstate for the GPU[0000:8a:00.0]
GPU link width [0000:8a:00.0][xNone]… Unknown
unknown pstate for the GPU[0000:8a:00.0]
GPU link speed [0000:06:00.0][None]… Unknown
unknown pstate for the GPU[0000:06:00.0]
GPU link width [0000:06:00.0][xNone]… Unknown
unknown pstate for the GPU[0000:06:00.0]
GPU link speed [0000:0b:00.0][None]… Unknown
unknown pstate for the GPU[0000:0b:00.0]
GPU link width [0000:0b:00.0][xNone]… Unknown
unknown pstate for the GPU[0000:0b:00.0]
GPU link speed [0000:86:00.0][None]… Unknown
unknown pstate for the GPU[0000:86:00.0]
GPU link width [0000:86:00.0][xNone]… Unknown
unknown pstate for the GPU[0000:86:00.0]
InfiniBand controller link speed [0000:05:00.0][8GT/s]… Healthy
InfiniBand controller link width [0000:05:00.0][x16]… Healthy
InfiniBand controller link speed [0000:84:00.0][8GT/s]… Healthy
InfiniBand controller link width [0000:84:00.0][x16]… Healthy
Check GPUDirect Topology information for consistency… Healthy
NVIDIA Driver Version [450.80.02]…
WARNING: Unhandled exception from task check_nvidia_smi_nvlink_status
WARNING: Run with --log-level=debug for exception details.
BMC Firmware Revision [3.30.30]…
Check BMC sensor thresholds… Unhealthy
PSU4 Input: Observed value “0.0” (Watts) below critical lower
threshold “0.0”
Checked 105 sensor values against BMC thresholds.
DGX BaseOS Version [5.0.2]…
BIOS Version [S2W_3A10]…
Linux kernel version [5.4.0-58-generic]…
System Uptime [up 7 hours, 57 minutes]…
DGX Serial Number [#############]…
System Summary
Product Name: DGX-1 with V100-32
Manufacturer: NVIDIA
DGX Serial Number: #############
Uptime: up 7 hours, 57 minutes
Motherboard:
BIOS Version: S2W_3A10
Serial Number: ###########
BMC:
Firmware Version: 3.30.30
IPMI Version: 2.0
GPU:
NVIDIA Driver Version: 450.80.02
Product Name(s): Unknown
VBIOS Version(s): Unknown
Software:
DGX BaseOS Version: 5.0.2
Kernel Version: 5.4.0-58-generic
Health Summary
58 out of 77 checks are healthy
2 out of 77 checks are unhealthy
17 out of 77 checks are unknown
0 out of 77 checks are informational
Overall system status is unhealthy
Problem detected.
- Please run ‘sudo nvsm dump health’
- Please open a case with NVIDIA Enterprise Support at this address https://nvid.nvidia.com/dashboard/
- Attach the log file from /tmp/nvsm-health-1631007374.json
100.0% [=========================================]
Status: unhealthy