How to enable ECC on RTX A4000

Hi all,

I’m trying to enable ECC on RTX A4000 on Ubuntu 22.04, but the following two approaches have failed.

From NVIDIA Settings: Opening with “sudo /bin/nvidia-settings,” I could turn the check box of “Enable ECC” on. But after a reboot, the ECC status remains “Disabled.”

From nvidia-smi: Running “sudo nvidia-smi -g 1 -e 1,” the process reported “Enabled ECC support for GPU (id). All done. Reboot required.” But after a reboot, the ECC status remains “Off.”

I also noticed that, for both cases, nvidia-smi reports “Off* (with an asterisk)” as the ECC status of A4000. Therefore, I guess both operations are expected to turn the ECC on, but a reboot resets it.

Do you have any ideas or alternative approaches to turn ECC on the GPUs on?

Specifications
OS: Ubuntu 22.04.02 LTS
Motherboard: Supermicro X10SRA
CPU: Intel Xeon E5-2697-v3
Memory: 128GB RDIMM (ECC enabled)
GPU: RTX A4000 x 2 & T400 x 1
NVIDIA Driver version: 530.30.02
nvidia-smi version: 530.30.02
NVML version: 12.530.30.02

T400 is used for a monitor. Two RTX A4000 will be used for computing purposes.
Any help would be greatly appreciated.

Kai

Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.

Hi generix,

Thank you for your response. Here is the nvidia-bug-report.log.gz. created on the machine.

Thank you for your help,
Kai

2023年3月9日(木) 16:58 generix via NVIDIA Developer Forums <notifications@nvidia.discoursemail.com>:

nvidia-bug-report.log.gz (909 KB)

Please run
sudo nvidia-smi -i 1 -e 1
without rebooting, then post the output of
nvidia-smi -q -d ECC
and check whether the “pending” mode changes to enabled.

Following the instructed operation, I got the output A and found “Pending” mode of the second GPU changed to “Enabled.”
I reboot the system and run “nvidia-smi -q -d ECC” again, and I got the output B. “Pending” of the second GPU changed to “Disabled.”

Output A (before reboot)
==============NVSMI LOG==============

Timestamp : Fri Mar 10 02:36:16 2023
Driver Version : 530.30.02
CUDA Version : 12.1

Attached GPUs : 3
GPU 00000000:01:00.0
ECC Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A

GPU 00000000:03:00.0
ECC Mode
Current : Disabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A

GPU 00000000:04:00.0
ECC Mode
Current : Disabled
Pending : Disabled
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A

Output B (after reboot)
==============NVSMI LOG==============

Timestamp : Fri Mar 10 02:42:46 2023
Driver Version : 530.30.02
CUDA Version : 12.1

Attached GPUs : 3
GPU 00000000:01:00.0
ECC Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A

GPU 00000000:03:00.0
ECC Mode
Current : Disabled
Pending : Disabled
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A

GPU 00000000:04:00.0
ECC Mode
Current : Disabled
Pending : Disabled
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A

2023年3月9日(木) 18:31 generix via NVIDIA Developer Forums <notifications@nvidia.discoursemail.com>:

That’s really incorrect behaviour. Only thing I noticed in the logs is you’re running the nvidia-persistenced without options, i.e. using persistence mode which shouldn’t be used (except on specific systems). Please disable nvidia-persistenced, check nvidia-smi -q that persistence mode is disabled, then try again to change the ecc state.

Hi generix,

Thank you so much for your further help, and sorry for this late reply. First, I could turn on the ECC function. Below are the processes I did, provided only for debugging (and avoiding) this issue.

  1. I tried to disable the persistence mode accordingly using “nvidia-smi -pm 0.” This operation changed the rows of “Persistence Mode” in “nvidia-smi -q” to “Disabled”.
  2. Without a reboot, I ran “nvidia-smi -g 1 -e 1,” and the “Pending” ECC state changed to “Enabled.”
  3. After a reboot, “Persistence Mode” changed to “Enabled,” and the current and pending ECC states were “Disabled.”
  4. As I did not understand when the driver of version 530 was installed (I had installed 525 manually,) I deleted the driver version 530 and then installed version 525 (with “-server”; I did not find the driver without “-server” and “-open” today.)
  5. Just after installing the driver version 525, the persistence mode was “Disabled.” I turned on the ECC state with “nvidia-smi -g 1(&2) -e 1.”
  6. After a reboot, I confirmed the current and pending ECC states remained “Enabled.”
  7. I further installed CUDA 11.8. This installation changed the driver version to 530.
  8. After a reboot, the persistence mode was turned on. But ECC states remained to be “Enabled.” This could be confirmed both from nvidia-smi and nvidia-settings GUI.

Thank you again for your time and support.
Kai

2023年3月9日(木) 19:29 generix via NVIDIA Developer Forums <notifications@nvidia.discoursemail.com>: