Nvidia Tesla P100 keeps throwing ECC errors

Hi,

I am trying to setup an Nvidia Tesla P100 16G on my workstation on Ubuntu 20.04. Below are the specs of the machine:

Brand: HP Z440 Workstation
CPU: Intel Xeon E5-2666 v3
RAM: 48 GB DDR4 2133 Mhz
BIOS: M60 v02.50 11/07/2019
SSD: PCIe SSD 500GB M2
GPU: Nvidia Tesla P100
Graphic Card (for output): Radeon HD 7470
PSU: 700W
OS: Ubuntu 24.04.6 LTS
Kernel: 5.15.0-113-generic

I have installed a fresh version of Ubuntu 20.04.6 LTS on my machine, enabling the installation of the third party drivers for Graphics and Wifi (to make the driver installation for the P100 faster).

I have also installed Cudat Tookit using the run file: cuda_12.2.0_535.54.03_linux.run to get nvcc on the system.

When I check the status of the GPU with nvidia-smi, I see that Persistence is on but I get Volatile ECC errors and an ERR! for the power usage (see output below):

Note that I have commented all lines in /usr/share/X11/xorg.conf.d/10-nvidia.conf to stop xorg showing in the list of processes in nvidia-smi.

Then when I complile (successfully) the Cuda Samples for matrix Transposition (cuda-samples/Samples/6_Performance/transpose at master · NVIDIA/cuda-samples · GitHub) and try running it, I get the following error:

CUDA error at …/…/…/Common/helper_cuda.h:888 code=214(cudaErrorECCUncorrectable) “cudaSetDevice(devID)”

After checking the gpu log with nvidia-smi -q, here is the output:

‘’’
==============NVSMI LOG==============

Timestamp : Mon Jul 1 13:35:35 2024
Driver Version : 535.183.01
CUDA Version : 12.2

Attached GPUs : 1
GPU 00000000:02:00.0
Product Name : Tesla P100-PCIE-16GB
Product Brand : Tesla
Product Architecture : Pascal
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
Addressing Mode : N/A
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0321217079833
GPU UUID : GPU-f1467e2e-9d9f-1362-3aea-feacec6b2638
Minor Number : 0
VBIOS Version : 86.00.3A.00.03
MultiGPU Board : No
Board ID : 0x200
Board Part Number : 900-2H400-0000-000
GPU Part Number : 15F8-892-A1
FRU Part Number : N/A
Module ID : 1
Inforom Version
Image Version : Unknown Error
OEM Object : 1.1
ECC Object : 4.1
Power Management Object : N/A
Inforom BBX Object Flush
Latest Timestamp : 2024/07/01 13:35:35.869
Latest Duration : 41 us
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
GPU Reset Status
Reset Required : Yes
Drain and Reset Recommended : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x02
Device : 0x00
Domain : 0x0000
Device Id : 0x15F810DE
Bus Id : 00000000:02:00.0
Sub System Id : 0x118F10DE
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Device Current : 3
Device Max : 3
Host Max : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 3348000 KB/s
Atomic Caps Inbound : N/A
Atomic Caps Outbound : N/A
Fan Speed : N/A
Performance State : P0
Clocks Event Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
Sparse Operation Mode : N/A
FB Memory Usage
Total : 16384 MiB
Reserved : 107 MiB
Used : 0 MiB
Free : 16276 MiB
BAR1 Memory Usage
Total : 16384 MiB
Used : 2 MiB
Free : 16382 MiB
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
JPEG : N/A
OFA : N/A
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
ECC Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Double Bit
Device Memory : 2
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 2
Aggregate
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Double Bit
Device Memory : 135
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 135
Retired Pages
Single Bit ECC : 0
Double Bit ECC : 8
Pending Page Blacklist : Yes
Remapped Rows : N/A
Temperature
GPU Current Temp : 45 C
GPU T.Limit Temp : N/A
GPU Shutdown Temp : 85 C
GPU Slowdown Temp : 82 C
GPU Max Operating Temp : N/A
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
GPU Power Readings
Power Draw : Unknown Error
Current Power Limit : 250.00 W
Requested Power Limit : 250.00 W
Default Power Limit : 250.00 W
Min Power Limit : 125.00 W
Max Power Limit : 250.00 W
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : Unknown Error
SM : Unknown Error
Memory : 715 MHz
Video : Unknown Error
Applications Clocks
Graphics : 1189 MHz
Memory : 715 MHz
Default Applications Clocks
Graphics : 1189 MHz
Memory : 715 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 1328 MHz
SM : 1328 MHz
Memory : 715 MHz
Video : 1328 MHz
Max Customer Boost Clocks
Graphics : 1328 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : N/A
Fabric
State : N/A
Status : N/A
Processes : None
‘’’

From the look of the error, I

  1. cleared the ECC errors with sudo nvidia-smi --reset-ecc-errors=0 to get:

Reset volatile ECC errors to zero for GPU 00000000:02:00.0.
All done.

  1. tried to reset the GPU but get: GPU 00000000:02:00.0 is currently in use by another process.

I tried re-running the transposition code from Cuda Samples and unfortunately, I get the same error (cudaErrorECCUncorrectable).

I also enabled persitenced both manually with nvidia-smi -pm 1 and enabling the nvidia-persistenced service but I did not change anything.

I have tried different scenarios:

  1. Fresh install of Ubuntu without third party drivers and drivers installed from source (Nvidia) → Same issue (ECC error showing up).
  2. Installation of drivers + nvidia toolkit from run file (Cuda version 12.2). → Same issue (ECC error showing up)
  3. Also tried with Ubuntu 22.04 and Ubuntu 24.04. → Same issue.

I am not sure what next steps I should try here. Any help would be greatly appreciated as I’m starting to think that the P100 is busted and might end up in my museum of dead parts.

Thanks a lot !

Tesla P100 isn’t certified/qualified for use in a workstation (the workstation variant would have been Quadro GP100). The Tesla P100 requires server flow-through cooling, amongst other issues, which a workstation will not provide. therefore, among other issues, it will overheat quickly. It’s conceivable that high heat could cause ECC errors.

NVIDIA doesn’t recommend or support use of data center GPUs in platforms they were not qualified in.

Besides potential issues with incorrectly configured cooling and / or power supply interfering with proper operation of the part, that is a realistic possibility, especially if this is a second-hand GPU of unknown provenance. This GPU could be 8+ years old, and it is certainly possible for DRAM to become defective over that period of time.