I have set persistence mode using nvidia-persistenced. Daemons related to nvidia driver running on the machine -
ps aux | grep nvidia
root 4798 0.0 0.0 0 0 ? S 09:34 0:00 [nvidia-modeset]
root 6441 5.9 0.0 8608 1476 ? Ss 09:38 1:07 nvidia-persistenced
root 6444 0.0 0.0 0 0 ? S 09:38 0:00 [irq/80-nvidia]
root 6445 0.0 0.0 0 0 ? S 09:38 0:00 [nvidia]
CUDA version - 8.0
cudnn version - 6.0
Output of
nvidia-smi -q
==============NVSMI LOG==============
Timestamp : Tue Jan 16 09:53:00 2018
Driver Version : 384.81
Attached GPUs : 1
GPU 00000000:00:1E.0
Product Name : Tesla K80
Product Brand : Tesla
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Enabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 1920
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-b99ba6e9-e4bd-a912-dc7f-96393d494dc4
Minor Number : 0
VBIOS Version : 80.21.1F.00.02
MultiGPU Board : No
Board ID : 0x1e
GPU Part Number : N/A
Inforom Version
Image Version : N/A
OEM Object : N/A
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization mode : Pass-Through
PCI
Bus : 0x00
Device : 0x1E
Domain : 0x0000
Device Id : 0x102D10DE
Bus Id : 00000000:00:1E.0
Sub System Id : 0x106C10DE
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : N/A
Rx Throughput : N/A
Fan Speed : N/A
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
FB Memory Usage
Total : 11439 MiB
Used : 0 MiB
Free : 11439 MiB
BAR1 Memory Usage
Total : 16384 MiB
Used : 2 MiB
Free : 16382 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Texture Shared : N/A
CBU : N/A
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Texture Shared : N/A
CBU : N/A
Total : 0
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Retired Pages
Single Bit ECC : 0
Double Bit ECC : 0
Pending : No
Temperature
GPU Current Temp : 45 C
GPU Shutdown Temp : 93 C
GPU Slowdown Temp : 88 C
GPU Max Operating Temp : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : N/A
Power Limit : 149.00 W
Default Power Limit : N/A
Enforced Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : 324 MHz
SM : 324 MHz
Memory : 324 MHz
Video : 405 MHz
Applications Clocks
Graphics : 562 MHz
Memory : 2505 MHz
Default Applications Clocks
Graphics : 562 MHz
Memory : 2505 MHz
Max Clocks
Graphics : 875 MHz
SM : 875 MHz
Memory : 2505 MHz
Video : 540 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : On
Auto Boost Default : On
Processes : None
I am trying to run tensorflow code but it just gets stuck. After starting the command, even nvidia-smi doesn’t give an output and gets hung indefinitely. Can someone explain what’s happening and if something is wrong with my GPU configuration? Any help regarding how to debug this issue is also appreciated.
NOTE: Things work fine when the persistence mode is disabled.