GPU getting stuck, not able to execute any command using GPU

Somya · January 16, 2018, 3:53pm

I have set persistence mode using nvidia-persistenced. Daemons related to nvidia driver running on the machine -

ps aux | grep nvidia
root      4798  0.0  0.0      0     0 ?        S    09:34   0:00 [nvidia-modeset]
root      6441  5.9  0.0   8608  1476 ?        Ss   09:38   1:07 nvidia-persistenced
root      6444  0.0  0.0      0     0 ?        S    09:38   0:00 [irq/80-nvidia]
root      6445  0.0  0.0      0     0 ?        S    09:38   0:00 [nvidia]

CUDA version - 8.0
cudnn version - 6.0

Output of

nvidia-smi -q

==============NVSMI LOG==============

Timestamp                           : Tue Jan 16 09:53:00 2018
Driver Version                      : 384.81

Attached GPUs                       : 1
GPU 00000000:00:1E.0
    Product Name                    : Tesla K80
    Product Brand                   : Tesla
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : Enabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 1920
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : N/A
    GPU UUID                        : GPU-b99ba6e9-e4bd-a912-dc7f-96393d494dc4
    Minor Number                    : 0
    VBIOS Version                   : 80.21.1F.00.02
    MultiGPU Board                  : No
    Board ID                        : 0x1e
    GPU Part Number                 : N/A
    Inforom Version
        Image Version               : N/A
        OEM Object                  : N/A
        ECC Object                  : N/A
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    GPU Virtualization Mode
        Virtualization mode         : Pass-Through
    PCI
        Bus                         : 0x00
        Device                      : 0x1E
        Domain                      : 0x0000
        Device Id                   : 0x102D10DE
        Bus Id                      : 00000000:00:1E.0
        Sub System Id               : 0x106C10DE
        GPU Link Info
            PCIe Generation
                Max                 : 3
                Current             : 1
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays since reset         : 0
        Tx Throughput               : N/A
        Rx Throughput               : N/A
    Fan Speed                       : N/A
    Performance State               : P8
    Clocks Throttle Reasons
        Idle                        : Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
        Sync Boost                  : Not Active
        SW Thermal Slowdown         : Not Active
    FB Memory Usage
        Total                       : 11439 MiB
        Used                        : 0 MiB
        Free                        : 11439 MiB
    BAR1 Memory Usage
        Total                       : 16384 MiB
        Used                        : 2 MiB
        Free                        : 16382 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 0 %
        Memory                      : 0 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Encoder Stats
        Active Sessions             : 0
        Average FPS                 : 0
        Average Latency             : 0
    Ecc Mode
        Current                     : N/A
        Pending                     : N/A
    ECC Errors
        Volatile
            Single Bit
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : 0
            Double Bit
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : 0
        Aggregate
            Single Bit
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
            Double Bit
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
    Retired Pages
        Single Bit ECC              : 0
        Double Bit ECC              : 0
        Pending                     : No
    Temperature
        GPU Current Temp            : 45 C
        GPU Shutdown Temp           : 93 C
        GPU Slowdown Temp           : 88 C
        GPU Max Operating Temp      : N/A
        Memory Current Temp         : N/A
        Memory Max Operating Temp   : N/A
    Power Readings
        Power Management            : Supported
        Power Draw                  : N/A
        Power Limit                 : 149.00 W
        Default Power Limit         : N/A
        Enforced Power Limit        : N/A
        Min Power Limit             : N/A
        Max Power Limit             : N/A
    Clocks
        Graphics                    : 324 MHz
        SM                          : 324 MHz
        Memory                      : 324 MHz
        Video                       : 405 MHz
    Applications Clocks
        Graphics                    : 562 MHz
        Memory                      : 2505 MHz
    Default Applications Clocks
        Graphics                    : 562 MHz
        Memory                      : 2505 MHz
    Max Clocks
        Graphics                    : 875 MHz
        SM                          : 875 MHz
        Memory                      : 2505 MHz
        Video                       : 540 MHz
    Max Customer Boost Clocks
        Graphics                    : N/A
    Clock Policy
        Auto Boost                  : On
        Auto Boost Default          : On
    Processes                       : None

I am trying to run tensorflow code but it just gets stuck. After starting the command, even nvidia-smi doesn’t give an output and gets hung indefinitely. Can someone explain what’s happening and if something is wrong with my GPU configuration? Any help regarding how to debug this issue is also appreciated.

NOTE: Things work fine when the persistence mode is disabled.

generix · January 17, 2018, 11:10am

Anything in dmesg when issue hits?

Topic		Replies	Views
nvidia-smi is slow and hangs after sometime with 1080Ti CUDA Setup and Installation	4	6825	June 20, 2018
GPU's become unresponsive with persistence-mode disabled Announcements	0	1517	September 18, 2018
RedHat 7.4 with Tesla P40 * 4 work abnormal with driver 384.81 and 384.125 Linux	5	941	April 5, 2018
Processes hang trying to ioctl /dev/nvidiactl CUDA Setup and Installation	6	4564	October 2, 2015
Setting persistence mode is not supported for GPU CUDA Programming and Performance	2	7220	April 28, 2014
Cannot nvidia-smi Geforce 1070 anymore suddenly. Linux	9	1726	October 12, 2021
nvidia-smi is slow on Ubuntu 16.04 CUDA Setup and Installation	4	15425	August 23, 2017
cuda (375.66) is failing with uknown error 30 after suspending Ubuntu 16.04 Linux	3	1713	September 5, 2017
K20 with high utilization, but no compute processes. CUDA Setup and Installation	12	26852	March 19, 2015
Installed CUDA 9.1 on Ubuntu 17010 but nvidia-smi does not show anything and freezes CUDA Setup and Installation	3	1120	April 17, 2018

GPU getting stuck, not able to execute any command using GPU

Related topics