Cannot disable the ECC on A40

There were three A40s in my server (ubuntu 22.04), and I successfully turned off their ECC through the nvidia-smi command. Today I added an additional A40 to the machine. Its ECC function is enabled by default.

I wanted to turn off its ECC function through nvidia-smi -e 0 , but it failed. Although the command line displayed

Disabled ECC support for GPU 00000000:31:00.0.
ECC support is already Disabled for GPU 00000000:4B:00.0.
ECC support is already Disabled for GPU 00000000:B1:00.0.
ECC support is already Disabled for GPU 00000000:CA:00.0.
All done.
Reboot required.

After restarting the machine , the ECC function is still turned on and has no effect.

Wed Mar 20 20:54:03 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.06              Driver Version: 545.29.06    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A40                     Off | 00000000:31:00.0 Off |                    0 |
|  0%   26C    P8              14W / 300W |     18MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A40                     Off | 00000000:4B:00.0 Off |                  Off |
|  0%   26C    P8              23W / 300W |     18MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A40                     Off | 00000000:B1:00.0 Off |                  Off |
|  0%   26C    P8              14W / 300W |     18MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A40                     Off | 00000000:CA:00.0 Off |                  Off |
|  0%   26C    P8              23W / 300W |     18MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

I don’t know if it is because of the NVLINK settings, since each card is connected to another card through NVLINK switch:

        GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV4     SYS     SYS     0-19,40-59      0               N/A
GPU1    NV4      X      SYS     SYS     0-19,40-59      0               N/A
GPU2    SYS     SYS      X      NV4     20-39,60-79     1               N/A
GPU3    SYS     SYS     NV4      X      20-39,60-79     1               N/A

Here is the bug-report file.
nvidia-bug-report.log.gz.gz (1.0 MB)

I’m having the same problem with 2xA6000 on both the driver version 535.161.07 and 550.54.14, I’ve went through the checks inside How to enable ECC on RTX A4000 - #3 by a09a215 but nothing works.

Hello there. I have solved the problem, it may have something to do with the driver or NVLINK settings. You can frist remove the NVLINK bridge and then uninstall the current driver and mannually install the latest driver. After that just use nvidia-smi -e 0 and reboot to see if something happens. If that works, you can reinstall the NVLINK bridge and check if they work properly.

Hope this can help you.

1 Like