V100 ECC Error

jhshin1 · May 8, 2020, 11:35am

There are 4 GPUs in same system.
Only one GPU (GPU3) encounters a lot of ECC error (Volatile and Aggregate) as below and attached log file.
Unlike other GPUs, GPU3 shows many ECC errors and system is not stable.

I suspect HW issue since many ECC errors are counted. How about your opinion?

Looking at the table below, the values of “Volatile Single Bit ECC Error and Aggregate Single Bit ECC Error” are detected as “309” and “751”. Is this a figure that doesn’t matter?

nvidia-smi.query.log (26.0 KB)

generix · May 8, 2020, 11:57am

The absolute numbers don’t say much, they would only in relation to the memory actually used/accesses done.
Rather use cuda-memtest to get the full picture.

jhshin1 · May 11, 2020, 10:26am

From the H / W point of view, Is it not a problem even if there are many counts of Volatile/Aggregate Single Bit ECC errors?
Additional, Is Retired Pages Single Bit ECC count not related to bad H / W?

generix · May 11, 2020, 11:11am

Of course it’s HW, memory is degraded. You just virtually asked how degraded it is. My answer to that was, insufficient data. Having 751 errors while doing 10e30 transfers might be acceptable, while doing 10e3 transfers definitely not.
Since those ecc errors might also have an external cause (e.g. EM interference from a different device next to it) the device should be removed, tested in a different system and replaced if memory still fails.

jhshin1 · May 11, 2020, 11:32am

Supplier told ECC error counts is not related to RMA since single bit error is detected and corrected. Right ?
If run cuda-memtest, which information we could get ?
Depending on the result of cuda-memtest, you could decide whether GPU is defective or not.
After removed the GPU V100, and installed in different system.
I performed FieldDiag, FieldDiag was “PASS”. And, ECC error counts keep same.

generix · May 11, 2020, 11:49am

Correct, single bit errors should be corrected, defective memory cells will also be disabled (aggregate errors/retired pages), see:
https://docs.nvidia.com/deploy/dynamic-page-retirement/index.html

I don’t know how you used that card before if at all, so you should run cuda-memtest so the full memory gets used and tested so the full picture shows up, maybe there are also broken cells that haven’t been used so far. Then, this is only a snapshot, you’ll have to watch it over time. If error numbers don’t rise, then it’s a minimal defect.

jhshin1 · May 11, 2020, 11:56am

Thanks. I’m trying to run cuda-memtest. I will update you once get the result.
BTW, you will decide the GPU card as defective if how many error numbers are increased ?

generix · May 11, 2020, 12:15pm

Please see this:
https://docs.nvidia.com/deploy/dynamic-page-retirement/index.html#faq-pre

jhshin1 · May 11, 2020, 12:38pm

I tried running it with memcheck_demo as shown in the link below.

There seems to be a problem. How can I perform a test normally memcheck?
Attached log file

memcheck_result.log (4.3 KB)

generix · May 11, 2020, 12:44pm

cuda-memtest, not cuda-memcheck. cuda-memcheck is for identifying cuda application allocation errors/memory leaks.
https://sourceforge.net/projects/cudagpumemtest/

jhshin1 · May 11, 2020, 2:06pm

I testing to run cuda-memtest. It worked using the following command.
*command: ./cuda_memtest --stress

Is this right?
How long does this test take to complete?

cuda-memtest_ing.log (4.1 KB)

generix · May 11, 2020, 4:24pm

If you run it without options, it will run infinitly. Hit ctrl+c to stop it, then check using nvidia-smi if the aggregate ecc error counts increased.

jhshin1 · May 12, 2020, 3:49am

I did a cuda-memtest and ended up with ctrl + c.
And I checked if aggregate ecc error occurred with nvidia-smi.
But there was no error. Same as before test.

As you said, the aggregate ecc error counts should increase.

Please check the “Result log” I attached.

And I have some questions.

I will share the test environment. Please check. If you have a problem with this environment, or if you need to change it, please reply.

Equipped with 2 GPUs with ECC error
OS: CentOS Linux release 7.8.2003 (Core)
Nvidia Driver Version: 440.33.01
Cuda Version: 10.2

How long should a cuda-memtest be run to get accurate results?
I want to get accurate results, but do you need to add an optional value to cuda-memtest?
As we have confirmed, if the ECC ERROR does not appear after running the cuda-memtest, is the H / W normal?

cuda-memtest-result.log (9.6 KB)

generix · May 12, 2020, 8:03am

Depending on age and usage, the results are ok, within thresholds.
Like said, I don’t know how long and to what extent you’ve been using the devices. If those are under constant extensive usage conditions, you don’t need to use cuda-memtest at all, that is what ECC is for. You just need to monitor the development of the error counts.

jhshin1 · May 15, 2020, 3:05am

So, can you get rid of the ECC error count?
Let me know if there is a way.

generix · May 15, 2020, 8:29am

No, that’s historical data about the number of cells that have been found faulty thus have been disabled. The aggregate/paged out numbers don’t have any influence on daily operation of the device.

Topic		Replies	Views
ECC error occurs when running cuda code on P100 CUDA Programming and Performance cuda	4	5865	July 1, 2022
What to do with GPUs with ECC errors? Linux linux , gpu-computing	1	533	January 27, 2025
Question about ECC memory resiliency CUDA Programming and Performance	4	977	June 25, 2019
Nvidia Tesla P100 keeps throwing ECC errors CUDA Programming and Performance cuda , ubuntu , driver	2	776	July 2, 2024
Why double bit ecc error count is not match to retired pages count CUDA-MEMCHECK	2	1586	February 28, 2022
Dcgmi diag memtest fail Linux hw , data-center	4	1220	July 10, 2024
Tool to find out the cause of CUDA error CUDA Setup and Installation	7	5403	October 12, 2021
Computing the probability of ECC errors on a GTX GPU CUDA Programming and Performance	2	7995	January 11, 2016
Strange ECC mode reported by nvidia-smi.exe CUDA Programming and Performance	6	8703	November 15, 2018
ECC Errors with quad Fermi C2070 CUDA Programming and Performance	2	23840	March 24, 2011

V100 ECC Error

Related topics