Dcgmi diag memtest fail

When running the DCGM diagnostic memtest on my 8-GPU server, I encountered the following Warning: “GPU4 A memory mismatch was detected on GPU 4, but no error was reported by CUDA or NVML. Run a field diagnostic on GPU.”

How can I view more detailed logs? What is the criterion for this warning, and could it be caused by some kind of hardware failure?

dcgmi runs a memtest on the gpus an d detected defective memory cells. Depending on the gpus used, you might want to enable ECC, rerun the memory test and then check nvidia-smi -q about how many cells are affected and mapped out to decide if you replace the gpu.

Thank you very much for reply. After I enabled ECC, this warning no longer appears, so this GPU is not faulty?

Additionally, I found in the USER GUIDE that it seems only Tesla Products support memtest, does this mean that memtest requires ECC to be enabled?

I suspect this is only due to the fact that those gpus don’t drive displays so the full memory can be tested.

No. ECC just detects memory errors on access and either tries to correct them or reports them and maps out the affected memory portion.

The nvidia-smi output states that one correctable error was detected, this means that exactly one single bit is faulty, which ECC corrected so memtest didn’t report any error.

1 Like

I am using 8GPU system of RTX4090, the DCGM memtest showed similar memory mismatch on single GPU. Would it be a concern on memory integrity or defective? What can I do next to verify if there is a problem on memory or not?