Nvidia Tesla P100 keeps throwing ECC errors

drpac · July 1, 2024, 5:58am

Hi,

I am trying to setup an Nvidia Tesla P100 16G on my workstation on Ubuntu 20.04. Below are the specs of the machine:

Brand: HP Z440 Workstation
CPU: Intel Xeon E5-2666 v3
RAM: 48 GB DDR4 2133 Mhz
BIOS: M60 v02.50 11/07/2019
SSD: PCIe SSD 500GB M2
GPU: Nvidia Tesla P100
Graphic Card (for output): Radeon HD 7470
PSU: 700W
OS: Ubuntu 24.04.6 LTS
Kernel: 5.15.0-113-generic

I have installed a fresh version of Ubuntu 20.04.6 LTS on my machine, enabling the installation of the third party drivers for Graphics and Wifi (to make the driver installation for the P100 faster).

I have also installed Cudat Tookit using the run file: cuda_12.2.0_535.54.03_linux.run to get nvcc on the system.

When I check the status of the GPU with nvidia-smi, I see that Persistence is on but I get Volatile ECC errors and an ERR! for the power usage (see output below):

Note that I have commented all lines in /usr/share/X11/xorg.conf.d/10-nvidia.conf to stop xorg showing in the list of processes in nvidia-smi.

Then when I complile (successfully) the Cuda Samples for matrix Transposition (cuda-samples/Samples/6_Performance/transpose at master · NVIDIA/cuda-samples · GitHub) and try running it, I get the following error:

CUDA error at …/…/…/Common/helper_cuda.h:888 code=214(cudaErrorECCUncorrectable) “cudaSetDevice(devID)”

After checking the gpu log with nvidia-smi -q, here is the output:

‘’’
==============NVSMI LOG==============

Timestamp : Mon Jul 1 13:35:35 2024
Driver Version : 535.183.01
CUDA Version : 12.2

Attached GPUs : 1
GPU 00000000:02:00.0
Product Name : Tesla P100-PCIE-16GB
Product Brand : Tesla
Product Architecture : Pascal
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
Addressing Mode : N/A
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0321217079833
GPU UUID : GPU-f1467e2e-9d9f-1362-3aea-feacec6b2638
Minor Number : 0
VBIOS Version : 86.00.3A.00.03
MultiGPU Board : No
Board ID : 0x200
Board Part Number : 900-2H400-0000-000
GPU Part Number : 15F8-892-A1
FRU Part Number : N/A
Module ID : 1
Inforom Version
Image Version : Unknown Error
OEM Object : 1.1
ECC Object : 4.1
Power Management Object : N/A
Inforom BBX Object Flush
Latest Timestamp : 2024/07/01 13:35:35.869
Latest Duration : 41 us
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
GPU Reset Status
Reset Required : Yes
Drain and Reset Recommended : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x02
Device : 0x00
Domain : 0x0000
Device Id : 0x15F810DE
Bus Id : 00000000:02:00.0
Sub System Id : 0x118F10DE
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Device Current : 3
Device Max : 3
Host Max : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 3348000 KB/s
Atomic Caps Inbound : N/A
Atomic Caps Outbound : N/A
Fan Speed : N/A
Performance State : P0
Clocks Event Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
Sparse Operation Mode : N/A
FB Memory Usage
Total : 16384 MiB
Reserved : 107 MiB
Used : 0 MiB
Free : 16276 MiB
BAR1 Memory Usage
Total : 16384 MiB
Used : 2 MiB
Free : 16382 MiB
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
JPEG : N/A
OFA : N/A
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
ECC Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Double Bit
Device Memory : 2
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 2
Aggregate
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Double Bit
Device Memory : 135
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 135
Retired Pages
Single Bit ECC : 0
Double Bit ECC : 8
Pending Page Blacklist : Yes
Remapped Rows : N/A
Temperature
GPU Current Temp : 45 C
GPU T.Limit Temp : N/A
GPU Shutdown Temp : 85 C
GPU Slowdown Temp : 82 C
GPU Max Operating Temp : N/A
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
GPU Power Readings
Power Draw : Unknown Error
Current Power Limit : 250.00 W
Requested Power Limit : 250.00 W
Default Power Limit : 250.00 W
Min Power Limit : 125.00 W
Max Power Limit : 250.00 W
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : Unknown Error
SM : Unknown Error
Memory : 715 MHz
Video : Unknown Error
Applications Clocks
Graphics : 1189 MHz
Memory : 715 MHz
Default Applications Clocks
Graphics : 1189 MHz
Memory : 715 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 1328 MHz
SM : 1328 MHz
Memory : 715 MHz
Video : 1328 MHz
Max Customer Boost Clocks
Graphics : 1328 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : N/A
Fabric
State : N/A
Status : N/A
Processes : None
‘’’

From the look of the error, I

cleared the ECC errors with sudo nvidia-smi --reset-ecc-errors=0 to get:

Reset volatile ECC errors to zero for GPU 00000000:02:00.0.
All done.

tried to reset the GPU but get: GPU 00000000:02:00.0 is currently in use by another process.

I tried re-running the transposition code from Cuda Samples and unfortunately, I get the same error (cudaErrorECCUncorrectable).

I also enabled persitenced both manually with nvidia-smi -pm 1 and enabling the nvidia-persistenced service but I did not change anything.

I have tried different scenarios:

Fresh install of Ubuntu without third party drivers and drivers installed from source (Nvidia) → Same issue (ECC error showing up).
Installation of drivers + nvidia toolkit from run file (Cuda version 12.2). → Same issue (ECC error showing up)
Also tried with Ubuntu 22.04 and Ubuntu 24.04. → Same issue.

I am not sure what next steps I should try here. Any help would be greatly appreciated as I’m starting to think that the P100 is busted and might end up in my museum of dead parts.

Thanks a lot !

Robert_Crovella · July 2, 2024, 2:25pm

Tesla P100 isn’t certified/qualified for use in a workstation (the workstation variant would have been Quadro GP100). The Tesla P100 requires server flow-through cooling, amongst other issues, which a workstation will not provide. therefore, among other issues, it will overheat quickly. It’s conceivable that high heat could cause ECC errors.

NVIDIA doesn’t recommend or support use of data center GPUs in platforms they were not qualified in.

njuffa · July 2, 2024, 4:08pm

Besides potential issues with incorrectly configured cooling and / or power supply interfering with proper operation of the part, that is a realistic possibility, especially if this is a second-hand GPU of unknown provenance. This GPU could be 8+ years old, and it is certainly possible for DRAM to become defective over that period of time.

Topic		Replies	Views
Tesla P100 on PC Drivers - Linux, Windows, MacOS	6	3653	June 7, 2023
Tesla P100 Issue – Processing Stops at 8MiB, Multiple Driver Versions Tested nvc, nvc++ and nvfortran cuda	9	110	December 19, 2024
Tesla V100 PCIE fails after some time on Ubuntu 18.04 Linux	1	1315	January 29, 2019
Ubuntu 22.04.3 LTS Server, Tesla P100, Driver Version: 470.199.02, CUDA Version: 11.4 CUDA Setup and Installation	3	3100	August 19, 2023
P100 not showing up in nvidia-smi CUDA Setup and Installation	17	8848	November 20, 2022
All CUDA-capable devices are busy or unavailable Tesla V100 Accelerated Computing cuda	0	829	December 28, 2020
nvcc error : 'ptxas' died due to signal 11 (Invalid memory reference) CUDA Programming and Performance	8	4742	March 12, 2014
K20 with high utilization, but no compute processes. CUDA Setup and Installation	12	26613	March 19, 2015
P100 Issues on EL6/7 - /proc/driver/nvidia/gpus/XX/information output is ?? and can't run X Linux	6	2723	October 14, 2021
Need Help with P100 installation (R730 Dell) CUDA Setup and Installation	8	1693	August 18, 2023

Nvidia Tesla P100 keeps throwing ECC errors

Related topics