Hi,
I am trying to setup an Nvidia Tesla P100 16G on my workstation on Ubuntu 20.04. Below are the specs of the machine:
Brand: HP Z440 Workstation
CPU: Intel Xeon E5-2666 v3
RAM: 48 GB DDR4 2133 Mhz
BIOS: M60 v02.50 11/07/2019
SSD: PCIe SSD 500GB M2
GPU: Nvidia Tesla P100
Graphic Card (for output): Radeon HD 7470
PSU: 700W
OS: Ubuntu 24.04.6 LTS
Kernel: 5.15.0-113-generic
I have installed a fresh version of Ubuntu 20.04.6 LTS on my machine, enabling the installation of the third party drivers for Graphics and Wifi (to make the driver installation for the P100 faster).
I have also installed Cudat Tookit using the run file: cuda_12.2.0_535.54.03_linux.run to get nvcc on the system.
When I check the status of the GPU with nvidia-smi, I see that Persistence is on but I get Volatile ECC errors and an ERR! for the power usage (see output below):
Note that I have commented all lines in /usr/share/X11/xorg.conf.d/10-nvidia.conf to stop xorg showing in the list of processes in nvidia-smi.
Then when I complile (successfully) the Cuda Samples for matrix Transposition (cuda-samples/Samples/6_Performance/transpose at master · NVIDIA/cuda-samples · GitHub) and try running it, I get the following error:
CUDA error at …/…/…/Common/helper_cuda.h:888 code=214(cudaErrorECCUncorrectable) “cudaSetDevice(devID)”
After checking the gpu log with nvidia-smi -q
, here is the output:
‘’’
==============NVSMI LOG==============
Timestamp : Mon Jul 1 13:35:35 2024
Driver Version : 535.183.01
CUDA Version : 12.2
Attached GPUs : 1
GPU 00000000:02:00.0
Product Name : Tesla P100-PCIE-16GB
Product Brand : Tesla
Product Architecture : Pascal
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
Addressing Mode : N/A
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0321217079833
GPU UUID : GPU-f1467e2e-9d9f-1362-3aea-feacec6b2638
Minor Number : 0
VBIOS Version : 86.00.3A.00.03
MultiGPU Board : No
Board ID : 0x200
Board Part Number : 900-2H400-0000-000
GPU Part Number : 15F8-892-A1
FRU Part Number : N/A
Module ID : 1
Inforom Version
Image Version : Unknown Error
OEM Object : 1.1
ECC Object : 4.1
Power Management Object : N/A
Inforom BBX Object Flush
Latest Timestamp : 2024/07/01 13:35:35.869
Latest Duration : 41 us
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
GPU Reset Status
Reset Required : Yes
Drain and Reset Recommended : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x02
Device : 0x00
Domain : 0x0000
Device Id : 0x15F810DE
Bus Id : 00000000:02:00.0
Sub System Id : 0x118F10DE
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Device Current : 3
Device Max : 3
Host Max : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 3348000 KB/s
Atomic Caps Inbound : N/A
Atomic Caps Outbound : N/A
Fan Speed : N/A
Performance State : P0
Clocks Event Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
Sparse Operation Mode : N/A
FB Memory Usage
Total : 16384 MiB
Reserved : 107 MiB
Used : 0 MiB
Free : 16276 MiB
BAR1 Memory Usage
Total : 16384 MiB
Used : 2 MiB
Free : 16382 MiB
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
JPEG : N/A
OFA : N/A
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
ECC Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Double Bit
Device Memory : 2
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 2
Aggregate
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Double Bit
Device Memory : 135
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 135
Retired Pages
Single Bit ECC : 0
Double Bit ECC : 8
Pending Page Blacklist : Yes
Remapped Rows : N/A
Temperature
GPU Current Temp : 45 C
GPU T.Limit Temp : N/A
GPU Shutdown Temp : 85 C
GPU Slowdown Temp : 82 C
GPU Max Operating Temp : N/A
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
GPU Power Readings
Power Draw : Unknown Error
Current Power Limit : 250.00 W
Requested Power Limit : 250.00 W
Default Power Limit : 250.00 W
Min Power Limit : 125.00 W
Max Power Limit : 250.00 W
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : Unknown Error
SM : Unknown Error
Memory : 715 MHz
Video : Unknown Error
Applications Clocks
Graphics : 1189 MHz
Memory : 715 MHz
Default Applications Clocks
Graphics : 1189 MHz
Memory : 715 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 1328 MHz
SM : 1328 MHz
Memory : 715 MHz
Video : 1328 MHz
Max Customer Boost Clocks
Graphics : 1328 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : N/A
Fabric
State : N/A
Status : N/A
Processes : None
‘’’
From the look of the error, I
- cleared the ECC errors with sudo nvidia-smi --reset-ecc-errors=0 to get:
Reset volatile ECC errors to zero for GPU 00000000:02:00.0.
All done.
- tried to reset the GPU but get: GPU 00000000:02:00.0 is currently in use by another process.
I tried re-running the transposition code from Cuda Samples and unfortunately, I get the same error (cudaErrorECCUncorrectable).
I also enabled persitenced both manually with nvidia-smi -pm 1
and enabling the nvidia-persistenced service but I did not change anything.
I have tried different scenarios:
- Fresh install of Ubuntu without third party drivers and drivers installed from source (Nvidia) → Same issue (ECC error showing up).
- Installation of drivers + nvidia toolkit from run file (Cuda version 12.2). → Same issue (ECC error showing up)
- Also tried with Ubuntu 22.04 and Ubuntu 24.04. → Same issue.
I am not sure what next steps I should try here. Any help would be greatly appreciated as I’m starting to think that the P100 is busted and might end up in my museum of dead parts.
Thanks a lot !