Hello,
I have just installed a Tesla P4 compute card in my Manjaro (Arch Linux based) system in order to use it with the Willow Inference Server project among other things.
The expected behavior is that the docker image starts up, loads models in VRAM, warms them up and then sits idle waiting for commands. This means that one should see this kind of behavior on the GPU:
P8 → P0 → P2
Sadly, in my case, the card gets “stuck” in P0 state, drawing around 30W and in less than 10 minutes it reaches 93C meaning it simply stops for thermal overload.
Do you have any idea what could cause the card to get stuck in P0?
nvidia-smi -q
gives me this when WIS is idling:
Driver Version : 530.41.03
CUDA Version : 12.1
Attached GPUs : 1
GPU 00000000:01:00.0
Product Name : Tesla P4
Product Brand : Tesla
Product Architecture : Pascal
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Disabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0420218024108
GPU UUID : GPU-701701f0-6d5b-67c5-3371-afbf742d22b0
Minor Number : 0
VBIOS Version : 86.04.55.00.01
MultiGPU Board : No
Board ID : 0x100
Board Part Number : 900-2G414-0000-000
GPU Part Number : 1BB3-895-A1
FRU Part Number : N/A
Module ID : 1
Inforom Version
Image Version : G414.0200.00.03
OEM Object : 1.1
ECC Object : 4.1
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
GPU Reset Status
Reset Required : No
Drain and Reset Recommended : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x01
Device : 0x00
Domain : 0x0000
Device Id : 0x1BB310DE
Bus Id : 00000000:01:00.0
Sub System Id : 0x11D810DE
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Device Current : 3
Device Max : 3
Host Max : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Atomic Caps Inbound : N/A
Atomic Caps Outbound : N/A
Fan Speed : N/A
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 7680 MiB
Reserved : 73 MiB
Used : 3956 MiB
Free : 3650 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 2 MiB
Free : 254 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
ECC Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : 0
Double Bit
Device Memory : 0
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : 0
Aggregate
Single Bit
Device Memory : 0
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : 0
Double Bit
Device Memory : 0
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : 0
Retired Pages
Single Bit ECC : 0
Double Bit ECC : 0
Pending Page Blacklist : No
Remapped Rows : N/A
Temperature
GPU Current Temp : 61 C
GPU Shutdown Temp : 94 C
GPU Slowdown Temp : 91 C
GPU Max Operating Temp : N/A
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 27.53 W
Power Limit : 75.00 W
Default Power Limit : 75.00 W
Enforced Power Limit : 75.00 W
Min Power Limit : 60.00 W
Max Power Limit : 75.00 W
Clocks
Graphics : 1113 MHz
SM : 1113 MHz
Memory : 2999 MHz
Video : 999 MHz
Applications Clocks
Graphics : 885 MHz
Memory : 3003 MHz
Default Applications Clocks
Graphics : 885 MHz
Memory : 3003 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 1531 MHz
SM : 1531 MHz
Memory : 3003 MHz
Video : 1379 MHz
Max Customer Boost Clocks
Graphics : 1113 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : N/A
Fabric
State : N/A
Status : N/A
Processes
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 175366
Type : C
Name : gunicorn: worker [main:app]
Used GPU Memory : 3954 MiB
I understand that it might well be WIS that is at fault here, but to rule that out, can you suggest a docker image that I could use to test the same kind of behavior? Something that starts up, loads a model in VRAM and then sits idle not computing anything on the GPU.
I tried the nbody sample docker but it stops after it has done its computation so the GPU goes back to P8 and the test is non conclusive.
Any suggestion is most welcome as I’m a bit lost as to what I’m missing here.
Regards