I have a system running RHEL 7.1 with CUDA 7. It has 2 Tesla K40m GPUs installed. It was working normally, but then a user noticed that no job could make progress, although the utilization is always at 99%, even after doing a reset (nvidia-smi -r ) for both units 0 and 1. We have not rebooted the server, I’d prefer not to.
Any ideas?
Trying to run a simple hello world CUDA program under strace shows it’s stuck in a loop of calls:
ioctl(3, 0xc020462a, 0x7fff01a2d180) = 0
nanosleep({1, 0}, NULL) = 0
Where fd 3 is /dev/nvidiactl according to /proc
+------------------------------------------------------+
| NVIDIA-SMI 346.89 Driver Version: 346.89 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K40m On | 0000:1B:00.0 Off | 0* |
| N/A 29C P0 65W / 235W | 55MiB / 11519MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K40m On | 0000:86:00.0 Off | 0* |
| N/A 19C P8 19W / 235W | 55MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
$ nvidia-smi -a
==============NVSMI LOG==============
Timestamp : Thu Sep 24 15:24:19 2015
Driver Version : 346.89
Attached GPUs : 2
GPU 0000:1B:00.0
Product Name : Tesla K40m
Product Brand : Tesla
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Enabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 128
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0322714012955
GPU UUID : GPU-5a79ca7e-4139-eae2-7854-081e69374741
Minor Number : 0
VBIOS Version : 80.80.3E.00.01
MultiGPU Board : No
Board ID : 0x1b00
Inforom Version
Image Version : 2081.0202.01.04
OEM Object : 1.1
ECC Object : 3.0
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
PCI
Bus : 0x1B
Device : 0x00
Domain : 0x0000
Device Id : 0x102310DE
Bus Id : 0000:1B:00.0
Sub System Id : 0x097E10DE
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : N/A
Rx Throughput : N/A
Fan Speed : N/A
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Unknown : Not Active
FB Memory Usage
Total : 11519 MiB
Used : 55 MiB
Free : 11464 MiB
BAR1 Memory Usage
Total : 16384 MiB
Used : 2 MiB
Free : 16382 MiB
Compute Mode : Default
Utilization
Gpu : 99 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Ecc Mode
Current : Enabled
Pending : Disabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Aggregate
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Retired Pages
Single Bit ECC : 0
Double Bit ECC : 0
Pending : No
Temperature
GPU Current Temp : 30 C
GPU Shutdown Temp : 95 C
GPU Slowdown Temp : 90 C
Power Readings
Power Management : Supported
Power Draw : 65.60 W
Power Limit : 235.00 W
Default Power Limit : 235.00 W
Enforced Power Limit : 235.00 W
Min Power Limit : 180.00 W
Max Power Limit : 235.00 W
Clocks
Graphics : 875 MHz
SM : 875 MHz
Memory : 3004 MHz
Applications Clocks
Graphics : 745 MHz
Memory : 3004 MHz
Default Applications Clocks
Graphics : 745 MHz
Memory : 3004 MHz
Max Clocks
Graphics : 875 MHz
SM : 875 MHz
Memory : 3004 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes : None
GPU 0000:86:00.0
Product Name : Tesla K40m
Product Brand : Tesla
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Enabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 128
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0322714013634
GPU UUID : GPU-816c1f5d-8a85-f8e1-e15b-da3f75f40f72
Minor Number : 1
VBIOS Version : 80.80.3E.00.01
MultiGPU Board : No
Board ID : 0x8600
Inforom Version
Image Version : 2081.0202.01.04
OEM Object : 1.1
ECC Object : 3.0
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
PCI
Bus : 0x86
Device : 0x00
Domain : 0x0000
Device Id : 0x102310DE
Bus Id : 0000:86:00.0
Sub System Id : 0x097E10DE
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : N/A
Rx Throughput : N/A
Fan Speed : N/A
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Unknown : Not Active
FB Memory Usage
Total : 11519 MiB
Used : 55 MiB
Free : 11464 MiB
BAR1 Memory Usage
Total : 16384 MiB
Used : 2 MiB
Free : 16382 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Ecc Mode
Current : Enabled
Pending : Disabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Aggregate
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Retired Pages
Single Bit ECC : 0
Double Bit ECC : 0
Pending : No
Temperature
GPU Current Temp : 19 C
GPU Shutdown Temp : 95 C
GPU Slowdown Temp : 90 C
Power Readings
Power Management : Supported
Power Draw : 19.42 W
Power Limit : 235.00 W
Default Power Limit : 235.00 W
Enforced Power Limit : 235.00 W
Min Power Limit : 180.00 W
Max Power Limit : 235.00 W
Clocks
Graphics : 324 MHz
SM : 324 MHz
Memory : 324 MHz
Applications Clocks
Graphics : 745 MHz
Memory : 3004 MHz
Default Applications Clocks
Graphics : 745 MHz
Memory : 3004 MHz
Max Clocks
Graphics : 875 MHz
SM : 875 MHz
Memory : 3004 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes : None
Running nvidia-healthmon ends with it saying “Healthmon timed out”, but no specific failure reports:
$ nvidia-healthmon
Using config file path: /etc/nvidia-healthmon/nvidia-healthmon.conf
Loading Config: SUCCESS
Global Tests
Black-Listed Modules: SKIPPED
Black-Listed Drivers: SUCCESS
Load NVML: SUCCESS
NVML Sanity: SUCCESS
Tesla Devices Count: SKIPPED
GPUDirect Comm Matrix
GPU0 GPU1 mlx4_0 CPU Affinity
GPU0 X SOC PHB 0-7,16-23
GPU1 SOC X SOC 8-15,24-31
mlx4_0 PHB SOC X
Legend:
X = Self
SOC = Path traverses a socket-level link (e.g. QPI)
PHB = Path traverses a PCIe host bridge
PXB = Path traverses multiple PCIe internal switches
PIX = Path traverses a PCIe internal switch
CPU Affinity = The cores that are most ideal for NUMA
Result: SUCCESS
Global Test Results: 13 success, 0 errors, 0 warnings, 8 did not run
-----------------------------------------------------------
0000:1B:00.0
NVML Sanity: SUCCESS
InfoROM: SKIPPED
Multi-GPU InfoROM: SKIPPED
ECC DBE: SUCCESS
ECC Enabled Check: SKIPPED
PCIe Maximum Link Generation: SKIPPED
PCIe Maximum Link Width: SUCCESS
CUDA Sanity: SUCCESS
PCI Bandwidth: SKIPPED
Memory: SKIPPED
Healthmon timed out.