Hi,
We’re trying to debug a problem where CUDA seems to mystersiously become unavailable, in desktop computers which otherwise show no GPU problems (i.e. someone is logged into a Gnome session)
Our setup:
- Red Hat Enterprise Linux Workstation release 7.9 (Maipo)
- Driver Version: 460.67
- CUDA Version: 11.2
- Quadro M5000
The problem:
- various apps report no CUDA device is available (PyTorch, Tensorflow, Blender)
What we’ve tried:
- nothing we can discover in nvidia-smi seems to report this
- putting system in Runlevel 3 does not fix
- a reboot fixes
- this occurs seemingly randomly in a room of indentical spec machines
We’re hoping to get some help on how we can avoid this. We are running PyTorch jobs on these machines, and seem to encounter this issue regularly. Rebooting works but is painful as someone has to be watching.
cheers
A few more salient points:
- opengl acceleration works (Autodesk Maya)
- opencl “seems to work” (Autodesk Maya runs with “C+G” flag in nvidia-smi
what is the output of
nvidia-smi -a
on one of the machines?
In lieu of that I would make sure that “Compute Mode” is listed as “Default” in that output.
Beyond that, I suspect the most likely explanation is that you have processes (e.g. a previous stopped Pytorch job or python process) that are still “hanging on” to the GPU.
It should be possible to write a bash script that would probably have to be run as root, that would “clean this up” that would avoid having to reboot the machine to fix things.
Output of nvidia-smi -a as follows:
133235@ladybug [/home/133235]$ nvidia-smi -a
==============NVSMI LOG==============
Timestamp : Tue May 25 10:47:11 2021
Driver Version : 460.73.01
CUDA Version : 11.2
Attached GPUs : 1
GPU 00000000:03:00.0
Product Name : Quadro M5000
Product Brand : Quadro
Display Mode : Enabled
Display Active : Enabled
Persistence Mode : Disabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0322816130082
GPU UUID : GPU-f25eb97e-06c1-0fed-377e-87303b3390ce
Minor Number : 0
VBIOS Version : 84.04.88.00.05
MultiGPU Board : No
Board ID : 0x300
GPU Part Number : N/A
Inforom Version
Image Version : G400.0500.00.04
OEM Object : 1.1
ECC Object : 3.0
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x03
Device : 0x00
Domain : 0x0000
Device Id : 0x13F010DE
Bus Id : 00000000:03:00.0
Sub System Id : 0x115210DE
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 21000 KB/s
Fan Speed : 42 %
Performance State : P5
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : N/A
HW Power Brake Slowdown : N/A
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 8126 MiB
Used : 488 MiB
Free : 7638 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 6 MiB
Free : 250 MiB
Compute Mode : Default
Utilization
Gpu : 4 %
Memory : 6 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Disabled
Pending : Disabled
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows : N/A
Temperature
GPU Current Temp : 51 C
GPU Shutdown Temp : 105 C
GPU Slowdown Temp : 100 C
GPU Max Operating Temp : N/A
GPU Target Temperature : 79 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 25.96 W
Power Limit : 150.00 W
Default Power Limit : 150.00 W
Enforced Power Limit : 150.00 W
Min Power Limit : 10.00 W
Max Power Limit : 150.00 W
Clocks
Graphics : 670 MHz
SM : 670 MHz
Memory : 810 MHz
Video : 617 MHz
Applications Clocks
Graphics : 861 MHz
Memory : 3305 MHz
Default Applications Clocks
Graphics : 861 MHz
Memory : 3305 MHz
Max Clocks
Graphics : 1126 MHz
SM : 1126 MHz
Memory : 3305 MHz
Video : 1036 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : On
Auto Boost Default : On
Processes
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 6580
Type : G
Name : /usr/bin/X
Used GPU Memory : 386 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 7602
Type : G
Name : cinnamon
Used GPU Memory : 32 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 7937
Type : G
Name : /usr/lib/slack/slack --type=gpu-process --field-trial-handle=7968621012182413748,8683948879275216873,131072 --enable-features=WebComponentsV0Enabled --disable-features=CertVerifierService,CookiesWithoutSameSiteMustBeSecure,HardwareMediaKeyHandling,RequestInitiatorSiteLockEnfocement,SameSiteByDefaultCookies,SpareRendererForSitePerProcess,WebRtcHideLocalIpsWithMdns --enable-crash-reporter=290a0cf4-dd2e-4dd2-b6c4-3937dc7ab429,no_channel --global-crash-keys=290a0cf4-dd2e-4dd2-b6c4-3937dc7ab429,no_channel,_productName=Slack,_version=4.14.0 --gpu-preferences=UAAAAAAAAAAgAAAQAAAAAAAAAAAAAAAAAABgAAAAAAAwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAgAAAAAAAAACAAAAAAAAAAAAAAAAAAAAAIAAAAAAAAAAgAAAAAAAAACAAAAAAAAAA= --shared-files
Used GPU Memory : 31 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 8905
Type : G
Name : /opt/google/chrome/chrome --type=gpu-process --field-trial-handle=6017906675086853055,5177941279676265715,131072 --enable-crash-reporter=79de6820-cc60-4d59-a21d-f93a42837f3c, --gpu-preferences=UAAAAAAAAAAgAAAQAAAAAAAAAAAAAAAAAABgAAAAAAAwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAgAAAAAAAAACAAAAAAAAAAAAAAAAAAAAAIAAAAAAAAAAgAAAAAAAAACAAAAAAAAAA= --shared-files
Used GPU Memory : 30 MiB
I don’t notice anything that looks problematic. It might be useful to get the nvidia-smi output from a machine with a GPU that is currently in the “unavailable” state. In that case, it would be convenient to have both the nvidia-smi
output and the nvidia-smi -a
output.