I am facing a few hardware issues on a DGX system. The system identifies itself through dmidecode as
system-product-name: NVIDIA_DGX_Spark system-manufacturer: NVIDIA system-version: A.7.
I need help understanding whether the behavior I am seeing is expected for this particular model or if it indicates a hardware fault.
The issue is that memtester reports failures on nearly all tests except “walking zeroes.” This makes me suspect a problem with one or more memory modules, but I am not sure whether memtester alone is sufficient to conclude a RAM failure. To diagnose this from my end, I reached out to the NVIDIA support team (Case reference number: 251207-000476), I attempted to execute the DGX Spark Diagnostic Suite as per the instructions. This led to another issue: I do not see the NVIDIA DGX diagnostic tools on the system. The directory /usr/local/nvidia/dgx/ does not exist, and therefore diagnostic suite cannot be run. Is the absence of this directory normal?
While troubleshooting, I also observed that BMC appears to be missing. Running sudo ipmitool sel list gives an error indicating that /dev/ipmi0 does not exist, but lsmod | grep -i ipmi shows ipmi_devintf and ipmi_msghandler, indicating that the IPMI modules are loaded. Please confirm whether this system is designed without a BMC or whether IPMI is expected to work differently on this hardware.
For the reference, I have attached the following logs: the full memtester report, journalctl -p err -b, and dmesg filtered for memory/ECC/MCE messages, along with basic system information including kernel version, OS details, and BIOS version.
If any additional logs or system information are required for further analysis, I am happy to provide those as well.
I will get back to you on what memtester should look like. However the DGX Spark is not like our other DGX product as there is no BMC on the DGX Spark and there is no pre-installed diagnostic suite as it is not a datacenter.
The results I shared are from running: sudo memtester 4G 1
Since that test failed, I also tried larger sizes like 10G and 60G, but those failed in the same way.
Did a quick 10G 2 run and didn’t encounter any errors:
elsaco@spark1:~$ sudo memtester 10G 2
memtester version 4.6.0 (64-bit)
Copyright (C) 2001-2020 Charles Cazabon.
Licensed under the GNU General Public License version 2 (only).
pagesize is 4096
pagesizemask is 0xfffffffffffff000
want 10240MB (10737418240 bytes)
got 10240MB (10737418240 bytes), trying mlock ...locked.
Loop 1/2:
Stuck Address : ok
Random Value : ok
Compare XOR : ok
Compare SUB : ok
Compare MUL : ok
Compare DIV : ok
Compare OR : ok
Compare AND : ok
Sequential Increment: ok
Solid Bits : ok
Block Sequential : ok
Checkerboard : ok
Bit Spread : ok
Bit Flip : ok
Walking Ones : ok
Walking Zeroes : ok
8-bit Writes : ok
16-bit Writes : ok
Loop 2/2:
Stuck Address : ok
Random Value : ok
Compare XOR : ok
Compare SUB : ok
Compare MUL : ok
Compare DIV : ok
Compare OR : ok
Compare AND : ok
Sequential Increment: ok
Solid Bits : ok
Block Sequential : ok
Checkerboard : ok
Bit Spread : ok
Bit Flip : sok
Walking Ones : ok
Walking Zeroes : ok
8-bit Writes : ok
16-bit Writes : ok
Done.
In the 2nd run there’s Bit Flip : sok which didn’t show in the 1st run. Not sure how to interpret that.
@sushrutl I noticed you don’t have the latest kernel and still using 6.11.x You might want to upgrade to 6.14.0-1013-nvidia and run memtester again.
Keep in mind that Spark uses a unified memory and memtester doesn’t handle well GPU memory.
I am following up on NVIDIA case reference 251207-000476 and need guidance on how to generate the full DGX Spark Diagnostic Suite report, which is required to proceed with the replacement process. I am encountering persistent memory-related failures on my DGX Spark system. I ran multiple memtests by gradually increasing the memory coverage. Tests covering 4 GB, 16 GB, and 24 GB completed successfully, while tests covering 32 GB and 64 GB consistently failed. This behavior strongly indicates a bad or corrupted RAM region, as lower memory ranges pass reliably while higher ranges fail.
Based on these results, NVIDIA support requested submission of the full DGX Spark Diagnostic Suite report. However, I have been unable to generate this report despite trying multiple approaches. I have tried the following approaches:
/usr/local/nvidia/dgx/diagnostics/dgxdiag --full
O/P: -bash: /usr/local/nvidia/dgx/diagnostics/dgxdiag: No such file or directory
sudo nvsm show health
O/P: ERROR:nvsm:Failed to initialize NVSM: Failed to find a matching definition file for this platform. Please toggle autogenerate_pdf flag from nvsm.config for auto generating pdf.
You can find the device specifications in the first post of the same thread. I need a guidance on appropriate approach to get DGX Spark Diagnostic Suite report.
HI Sushrutl, this is currently NOT required. It will eventually be required, but not yet. Please reach out to support again, I’ll clarify this with Support team and let them know you are coming back.