V100 ECC Error

There are 4 GPUs in same system.
Only one GPU (GPU3) encounters a lot of ECC error (Volatile and Aggregate) as below and attached log file.
Unlike other GPUs, GPU3 shows many ECC errors and system is not stable.

I suspect HW issue since many ECC errors are counted. How about your opinion?

Looking at the table below, the values of “Volatile Single Bit ECC Error and Aggregate Single Bit ECC Error” are detected as “309” and “751”. Is this a figure that doesn’t matter?

nvidia-smi.query.log (26.0 KB)

The absolute numbers don’t say much, they would only in relation to the memory actually used/accesses done.
Rather use cuda-memtest to get the full picture.

From the H / W point of view, Is it not a problem even if there are many counts of Volatile/Aggregate Single Bit ECC errors?
Additional, Is Retired Pages Single Bit ECC count not related to bad H / W?

Of course it’s HW, memory is degraded. You just virtually asked how degraded it is. My answer to that was, insufficient data. Having 751 errors while doing 10e30 transfers might be acceptable, while doing 10e3 transfers definitely not.
Since those ecc errors might also have an external cause (e.g. EM interference from a different device next to it) the device should be removed, tested in a different system and replaced if memory still fails.

  1. Supplier told ECC error counts is not related to RMA since single bit error is detected and corrected. Right ?

  2. If run cuda-memtest, which information we could get ?
    Depending on the result of cuda-memtest, you could decide whether GPU is defective or not.

  3. After removed the GPU V100, and installed in different system.
    I performed FieldDiag, FieldDiag was “PASS”. And, ECC error counts keep same.

Correct, single bit errors should be corrected, defective memory cells will also be disabled (aggregate errors/retired pages), see:
https://docs.nvidia.com/deploy/dynamic-page-retirement/index.html

I don’t know how you used that card before if at all, so you should run cuda-memtest so the full memory gets used and tested so the full picture shows up, maybe there are also broken cells that haven’t been used so far. Then, this is only a snapshot, you’ll have to watch it over time. If error numbers don’t rise, then it’s a minimal defect.

Thanks. I’m trying to run cuda-memtest. I will update you once get the result.
BTW, you will decide the GPU card as defective if how many error numbers are increased ?

Please see this:
https://docs.nvidia.com/deploy/dynamic-page-retirement/index.html#faq-pre

I tried running it with memcheck_demo as shown in the link below.
https://docs.nvidia.com/cuda/cuda-memcheck/index.html#cuda-memcheck-tool-examples

There seems to be a problem. How can I perform a test normally memcheck?
Attached log file

memcheck_result.log (4.3 KB)

cuda-memtest, not cuda-memcheck. cuda-memcheck is for identifying cuda application allocation errors/memory leaks.
https://sourceforge.net/projects/cudagpumemtest/

I testing to run cuda-memtest. It worked using the following command.
*command: ./cuda_memtest --stress

Is this right?
How long does this test take to complete?

cuda-memtest_ing.log (4.1 KB)

If you run it without options, it will run infinitly. Hit ctrl+c to stop it, then check using nvidia-smi if the aggregate ecc error counts increased.

I did a cuda-memtest and ended up with ctrl + c.
And I checked if aggregate ecc error occurred with nvidia-smi.
But there was no error. Same as before test.

As you said, the aggregate ecc error counts should increase.

Please check the “Result log” I attached.

And I have some questions.

  1. I will share the test environment. Please check. If you have a problem with this environment, or if you need to change it, please reply.

Equipped with 2 GPUs with ECC error
OS: CentOS Linux release 7.8.2003 (Core)
Nvidia Driver Version: 440.33.01
Cuda Version: 10.2

  1. How long should a cuda-memtest be run to get accurate results?

  2. I want to get accurate results, but do you need to add an optional value to cuda-memtest?

  3. As we have confirmed, if the ECC ERROR does not appear after running the cuda-memtest, is the H / W normal?

cuda-memtest-result.log (9.6 KB)

Depending on age and usage, the results are ok, within thresholds.
Like said, I don’t know how long and to what extent you’ve been using the devices. If those are under constant extensive usage conditions, you don’t need to use cuda-memtest at all, that is what ECC is for. You just need to monitor the development of the error counts.

So, can you get rid of the ECC error count?
Let me know if there is a way.

No, that’s historical data about the number of cells that have been found faulty thus have been disabled. The aggregate/paged out numbers don’t have any influence on daily operation of the device.