Dcgmi diag memtest fail

RonzZZ · May 30, 2024, 3:39pm

When running the DCGM diagnostic memtest on my 8-GPU server, I encountered the following Warning: “GPU4 A memory mismatch was detected on GPU 4, but no error was reported by CUDA or NVML. Run a field diagnostic on GPU.”

How can I view more detailed logs? What is the criterion for this warning, and could it be caused by some kind of hardware failure?

generix · May 31, 2024, 8:30am

dcgmi runs a memtest on the gpus an d detected defective memory cells. Depending on the gpus used, you might want to enable ECC, rerun the memory test and then check nvidia-smi -q about how many cells are affected and mapped out to decide if you replace the gpu.

RonzZZ · June 6, 2024, 1:46am

Thank you very much for reply. After I enabled ECC, this warning no longer appears, so this GPU is not faulty?

Additionally, I found in the USER GUIDE that it seems only Tesla Products support memtest, does this mean that memtest requires ECC to be enabled?

generix · June 6, 2024, 8:25am

I suspect this is only due to the fact that those gpus don’t drive displays so the full memory can be tested.

No. ECC just detects memory errors on access and either tries to correct them or reports them and maps out the affected memory portion.

The nvidia-smi output states that one correctable error was detected, this means that exactly one single bit is faulty, which ECC corrected so memtest didn’t report any error.

kpchoi15 · July 10, 2024, 1:44pm

I am using 8GPU system of RTX4090, the DCGM memtest showed similar memory mismatch on single GPU. Would it be a concern on memory integrity or defective? What can I do next to verify if there is a problem on memory or not?

Topic		Replies	Views
Reg: GPU reset event displayed in IPMI event log DGX User Forum	3	526	April 23, 2024
eGPU is not recognized by nvidia-smi in a Nvidia optimus setting Linux cuda	23	2057	March 27, 2023
ECC Errors with quad Fermi C2070 CUDA Programming and Performance	2	23786	March 24, 2011
Internal Memcheck Error: Device not supported CUDA Programming and Performance	10	1930	August 15, 2018
CUDA Error counter CUDA Programming and Performance	5	7429	February 13, 2012
DCGM reporting Max GPU Memory Used is 0 . Linux	1	695	January 30, 2020
DCGM not reporting Max Memory Used correctly. Other Tools	1	598	January 30, 2020
GPU utilization broken in CUDA-4.0 Is patch available? CUDA Programming and Performance	2	2874	August 8, 2011
Enable ECC on RTX 4090 on Ubuntu 22.04 LTS Linux cuda , ubuntu	4	3333	January 22, 2024
MSI RTX3090 eGPU Ubuntu 22.04.4 Issues GPU - Hardware pcie , kernel , ubuntu	3	148	August 1, 2024

Dcgmi diag memtest fail

Related topics