How to enable ECC on RTX A4000

Hi all,

I’m trying to enable ECC on RTX A4000 on Ubuntu 22.04, but the following two approaches have failed.

From NVIDIA Settings: Opening with “sudo /bin/nvidia-settings,” I could turn the check box of “Enable ECC” on. But after a reboot, the ECC status remains “Disabled.”

From nvidia-smi: Running “sudo nvidia-smi -g 1 -e 1,” the process reported “Enabled ECC support for GPU (id). All done. Reboot required.” But after a reboot, the ECC status remains “Off.”

I also noticed that, for both cases, nvidia-smi reports “Off* (with an asterisk)” as the ECC status of A4000. Therefore, I guess both operations are expected to turn the ECC on, but a reboot resets it.

Do you have any ideas or alternative approaches to turn ECC on the GPUs on?

Specifications
OS: Ubuntu 22.04.02 LTS
Motherboard: Supermicro X10SRA
CPU: Intel Xeon E5-2697-v3
Memory: 128GB RDIMM (ECC enabled)
GPU: RTX A4000 x 2 & T400 x 1
NVIDIA Driver version: 530.30.02
nvidia-smi version: 530.30.02
NVML version: 12.530.30.02

T400 is used for a monitor. Two RTX A4000 will be used for computing purposes.
Any help would be greatly appreciated.

Kai

Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.

Hi generix,

Thank you for your response. Here is the nvidia-bug-report.log.gz. created on the machine.

Thank you for your help,
Kai

2023年3月9日(木) 16:58 generix via NVIDIA Developer Forums <notifications@nvidia.discoursemail.com>:

nvidia-bug-report.log.gz (909 KB)

Please run
sudo nvidia-smi -i 1 -e 1
without rebooting, then post the output of
nvidia-smi -q -d ECC
and check whether the “pending” mode changes to enabled.

Following the instructed operation, I got the output A and found “Pending” mode of the second GPU changed to “Enabled.”
I reboot the system and run “nvidia-smi -q -d ECC” again, and I got the output B. “Pending” of the second GPU changed to “Disabled.”

Output A (before reboot)
==============NVSMI LOG==============

Timestamp : Fri Mar 10 02:36:16 2023
Driver Version : 530.30.02
CUDA Version : 12.1

Attached GPUs : 3
GPU 00000000:01:00.0
ECC Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A

GPU 00000000:03:00.0
ECC Mode
Current : Disabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A

GPU 00000000:04:00.0
ECC Mode
Current : Disabled
Pending : Disabled
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A

Output B (after reboot)
==============NVSMI LOG==============

Timestamp : Fri Mar 10 02:42:46 2023
Driver Version : 530.30.02
CUDA Version : 12.1

Attached GPUs : 3
GPU 00000000:01:00.0
ECC Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A

GPU 00000000:03:00.0
ECC Mode
Current : Disabled
Pending : Disabled
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A

GPU 00000000:04:00.0
ECC Mode
Current : Disabled
Pending : Disabled
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A

2023年3月9日(木) 18:31 generix via NVIDIA Developer Forums <notifications@nvidia.discoursemail.com>:

That’s really incorrect behaviour. Only thing I noticed in the logs is you’re running the nvidia-persistenced without options, i.e. using persistence mode which shouldn’t be used (except on specific systems). Please disable nvidia-persistenced, check nvidia-smi -q that persistence mode is disabled, then try again to change the ecc state.

Hi generix,

Thank you so much for your further help, and sorry for this late reply. First, I could turn on the ECC function. Below are the processes I did, provided only for debugging (and avoiding) this issue.

  1. I tried to disable the persistence mode accordingly using “nvidia-smi -pm 0.” This operation changed the rows of “Persistence Mode” in “nvidia-smi -q” to “Disabled”.
  2. Without a reboot, I ran “nvidia-smi -g 1 -e 1,” and the “Pending” ECC state changed to “Enabled.”
  3. After a reboot, “Persistence Mode” changed to “Enabled,” and the current and pending ECC states were “Disabled.”
  4. As I did not understand when the driver of version 530 was installed (I had installed 525 manually,) I deleted the driver version 530 and then installed version 525 (with “-server”; I did not find the driver without “-server” and “-open” today.)
  5. Just after installing the driver version 525, the persistence mode was “Disabled.” I turned on the ECC state with “nvidia-smi -g 1(&2) -e 1.”
  6. After a reboot, I confirmed the current and pending ECC states remained “Enabled.”
  7. I further installed CUDA 11.8. This installation changed the driver version to 530.
  8. After a reboot, the persistence mode was turned on. But ECC states remained to be “Enabled.” This could be confirmed both from nvidia-smi and nvidia-settings GUI.

Thank you again for your time and support.
Kai

2023年3月9日(木) 19:29 generix via NVIDIA Developer Forums <notifications@nvidia.discoursemail.com>:

I open a bug on this issue. Reference below fix and why is happening:

This issue is seen with Ubuntu packaged drivers because they enable kernel modesetting by default (nvidia-drm loaded with modeset=1). The workaround is to unload NVIDIA driver modules manually before the ECC state can be changed.

NVIDIA Engineering is aware of the issue. With kernel modesetting enabled, NVIDIA kernel modules like nvidia-drm and nvidia-modeset are treated as applications running on the GPU. This blocks GPU reset and switching ECC modes. Engineering is working on a long term fix which will list all applications and modules that may potentially block the reset.

For driver installations that are not from Ubuntu packages, the default steps of - nvidia-smi -e 0/1 and rebooting/ driver reset should work.

try the following commands and check if ECC gets disabled as expected?
On some systems, nvidia-modeset can block driver reset and ECC toggle. Please unload it manually and retry.

# rmmod nvidia_drm 
# rmmod nvidia_modeset 
# nvidia-smi -r
# nvidia-smi -q -d ECC
1 Like

Still having same problem, unable to turn on ECC in Ubuntu 22.04
NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6

Hi wortiz,

Thank you for the solution, and I apologize for this late reply. I noticed this post yesterday.
Due to an unrelated problem, I reinstalled Windows on the same machine, so now I don’t have Ubuntu machines. I may prepare an Ubuntu machine in the future; I will try this solution at that time.

Kai