CUDA unavailable in RedHat without other GPU issues

Hi,

We’re trying to debug a problem where CUDA seems to mystersiously become unavailable, in desktop computers which otherwise show no GPU problems (i.e. someone is logged into a Gnome session)

Our setup:

  • Red Hat Enterprise Linux Workstation release 7.9 (Maipo)
  • Driver Version: 460.67
  • CUDA Version: 11.2
  • Quadro M5000

The problem:

  • various apps report no CUDA device is available (PyTorch, Tensorflow, Blender)

What we’ve tried:

  • nothing we can discover in nvidia-smi seems to report this
  • putting system in Runlevel 3 does not fix
  • a reboot fixes
  • this occurs seemingly randomly in a room of indentical spec machines

We’re hoping to get some help on how we can avoid this. We are running PyTorch jobs on these machines, and seem to encounter this issue regularly. Rebooting works but is painful as someone has to be watching.

cheers

A few more salient points:

  • opengl acceleration works (Autodesk Maya)
  • opencl “seems to work” (Autodesk Maya runs with “C+G” flag in nvidia-smi

what is the output of

nvidia-smi -a

on one of the machines?

In lieu of that I would make sure that “Compute Mode” is listed as “Default” in that output.

Beyond that, I suspect the most likely explanation is that you have processes (e.g. a previous stopped Pytorch job or python process) that are still “hanging on” to the GPU.

It should be possible to write a bash script that would probably have to be run as root, that would “clean this up” that would avoid having to reboot the machine to fix things.

Output of nvidia-smi -a as follows:

133235@ladybug [/home/133235]$ nvidia-smi -a

==============NVSMI LOG==============

Timestamp                                 : Tue May 25 10:47:11 2021
Driver Version                            : 460.73.01
CUDA Version                              : 11.2

Attached GPUs                             : 1
GPU 00000000:03:00.0
    Product Name                          : Quadro M5000
    Product Brand                         : Quadro
    Display Mode                          : Enabled
    Display Active                        : Enabled
    Persistence Mode                      : Disabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 0322816130082
    GPU UUID                              : GPU-f25eb97e-06c1-0fed-377e-87303b3390ce
    Minor Number                          : 0
    VBIOS Version                         : 84.04.88.00.05
    MultiGPU Board                        : No
    Board ID                              : 0x300
    GPU Part Number                       : N/A
    Inforom Version
        Image Version                     : G400.0500.00.04
        OEM Object                        : 1.1
        ECC Object                        : 3.0
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x03
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x13F010DE
        Bus Id                            : 00000000:03:00.0
        Sub System Id                     : 0x115210DE
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 1
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 21000 KB/s
    Fan Speed                             : 42 %
    Performance State                     : P5
    Clocks Throttle Reasons
        Idle                              : Not Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : N/A
            HW Power Brake Slowdown       : N/A
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 8126 MiB
        Used                              : 488 MiB
        Free                              : 7638 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 6 MiB
        Free                              : 250 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 4 %
        Memory                            : 6 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : Disabled
        Pending                           : Disabled
    ECC Errors
        Volatile
            Single Bit            
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
            Double Bit            
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
        Aggregate
            Single Bit            
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
            Double Bit            
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 51 C
        GPU Shutdown Temp                 : 105 C
        GPU Slowdown Temp                 : 100 C
        GPU Max Operating Temp            : N/A
        GPU Target Temperature            : 79 C
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 25.96 W
        Power Limit                       : 150.00 W
        Default Power Limit               : 150.00 W
        Enforced Power Limit              : 150.00 W
        Min Power Limit                   : 10.00 W
        Max Power Limit                   : 150.00 W
    Clocks
        Graphics                          : 670 MHz
        SM                                : 670 MHz
        Memory                            : 810 MHz
        Video                             : 617 MHz
    Applications Clocks
        Graphics                          : 861 MHz
        Memory                            : 3305 MHz
    Default Applications Clocks
        Graphics                          : 861 MHz
        Memory                            : 3305 MHz
    Max Clocks
        Graphics                          : 1126 MHz
        SM                                : 1126 MHz
        Memory                            : 3305 MHz
        Video                             : 1036 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : On
        Auto Boost Default                : On
    Processes
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 6580
            Type                          : G
            Name                          : /usr/bin/X
            Used GPU Memory               : 386 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 7602
            Type                          : G
            Name                          : cinnamon
            Used GPU Memory               : 32 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 7937
            Type                          : G
            Name                          : /usr/lib/slack/slack --type=gpu-process --field-trial-handle=7968621012182413748,8683948879275216873,131072 --enable-features=WebComponentsV0Enabled --disable-features=CertVerifierService,CookiesWithoutSameSiteMustBeSecure,HardwareMediaKeyHandling,RequestInitiatorSiteLockEnfocement,SameSiteByDefaultCookies,SpareRendererForSitePerProcess,WebRtcHideLocalIpsWithMdns --enable-crash-reporter=290a0cf4-dd2e-4dd2-b6c4-3937dc7ab429,no_channel --global-crash-keys=290a0cf4-dd2e-4dd2-b6c4-3937dc7ab429,no_channel,_productName=Slack,_version=4.14.0 --gpu-preferences=UAAAAAAAAAAgAAAQAAAAAAAAAAAAAAAAAABgAAAAAAAwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAgAAAAAAAAACAAAAAAAAAAAAAAAAAAAAAIAAAAAAAAAAgAAAAAAAAACAAAAAAAAAA= --shared-files
            Used GPU Memory               : 31 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 8905
            Type                          : G
            Name                          : /opt/google/chrome/chrome --type=gpu-process --field-trial-handle=6017906675086853055,5177941279676265715,131072 --enable-crash-reporter=79de6820-cc60-4d59-a21d-f93a42837f3c, --gpu-preferences=UAAAAAAAAAAgAAAQAAAAAAAAAAAAAAAAAABgAAAAAAAwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAgAAAAAAAAACAAAAAAAAAAAAAAAAAAAAAIAAAAAAAAAAgAAAAAAAAACAAAAAAAAAA= --shared-files
            Used GPU Memory               : 30 MiB

I don’t notice anything that looks problematic. It might be useful to get the nvidia-smi output from a machine with a GPU that is currently in the “unavailable” state. In that case, it would be convenient to have both the nvidia-smi output and the nvidia-smi -a output.