There were three A40s in my server (ubuntu 22.04), and I successfully turned off their ECC through the nvidia-smi command. Today I added an additional A40 to the machine. Its ECC function is enabled by default.
I wanted to turn off its ECC function through nvidia-smi -e 0
, but it failed. Although the command line displayed
Disabled ECC support for GPU 00000000:31:00.0.
ECC support is already Disabled for GPU 00000000:4B:00.0.
ECC support is already Disabled for GPU 00000000:B1:00.0.
ECC support is already Disabled for GPU 00000000:CA:00.0.
All done.
Reboot required.
After restarting the machine , the ECC function is still turned on and has no effect.
Wed Mar 20 20:54:03 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.06 Driver Version: 545.29.06 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A40 Off | 00000000:31:00.0 Off | 0 |
| 0% 26C P8 14W / 300W | 18MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A40 Off | 00000000:4B:00.0 Off | Off |
| 0% 26C P8 23W / 300W | 18MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A40 Off | 00000000:B1:00.0 Off | Off |
| 0% 26C P8 14W / 300W | 18MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A40 Off | 00000000:CA:00.0 Off | Off |
| 0% 26C P8 23W / 300W | 18MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
I don’t know if it is because of the NVLINK settings, since each card is connected to another card through NVLINK switch:
GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV4 SYS SYS 0-19,40-59 0 N/A
GPU1 NV4 X SYS SYS 0-19,40-59 0 N/A
GPU2 SYS SYS X NV4 20-39,60-79 1 N/A
GPU3 SYS SYS NV4 X 20-39,60-79 1 N/A
Here is the bug-report file.
nvidia-bug-report.log.gz.gz (1.0 MB)