Prolonged execution of cuda code causes massive computer slowdown

I had a cuda program running for the best part of 36 hours, and now the computer is all but stalled. For starters, whereas each iteration used to take 10 seconds, it now takes 800. SSH connections in now take about 30 seconds to resolve.

I then close the cuda program and the problem does not resolve. Note that the computer’s sole purpose is to run this cuda program.

What’s worse is that rebooting the computer does not fix the issue!
Has anyone got a clue what’s going on here???

Nvidia-smi output (one of the 8 gpus) below (all 8 gpus show similar outputs):

GPU 0000:86:00.0
    Product Name                    : GeForce GTX TITAN
    Display Mode                    : N/A
    Display Active                  : N/A
    Persistence Mode                : Disabled
    Accounting Mode                 : N/A
    Accounting Mode Buffer Size     : N/A
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : N/A
    GPU UUID                        : GPU-39325139-a5b1-083b-0252-b3e4dc505f84
    VBIOS Version                   : 80.10.2C.00.06
    Inforom Version
        Image Version               : N/A
        OEM Object                  : N/A
        ECC Object                  : N/A
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    PCI
        Bus                         : 0x86
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x100510DE
        Bus Id                      : 0000:86:00.0
        Sub System Id               : 0x84511043
        GPU Link Info
            PCIe Generation
                Max                 : N/A
                Current             : N/A
            Link Width
                Max                 : N/A
                Current             : N/A
    Fan Speed                       : 33 %
    Performance State               : N/A
    Clocks Throttle Reasons         : N/A
    Memory Usage
        Total                       : 6143 MB
        Used                        : 14 MB
        Free                        : 6129 MB
    Compute Mode                    : Default
    Utilization
        Gpu                         : N/A
        Memory                      : N/A
    Ecc Mode
        Current                     : N/A
        Pending                     : N/A
    ECC Errors
        Volatile
            Single Bit           
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : N/A
            Double Bit           
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : N/A
        Aggregate
            Single Bit           
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : N/A
            Double Bit           
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : N/A
    Retired Pages
        Single Bit ECC              : N/A
        Double Bit ECC              : N/A
        Pending                     : N/A
    Temperature
        Gpu                         : 49 C
    Power Readings
        Power Management            : N/A
        Power Draw                  : N/A
        Power Limit                 : N/A
        Default Power Limit         : N/A
        Enforced Power Limit        : N/A
        Min Power Limit             : N/A
        Max Power Limit             : N/A
    Clocks
        Graphics                    : N/A
        SM                          : N/A
        Memory                      : N/A
    Applications Clocks
        Graphics                    : N/A
        Memory                      : N/A
    Default Applications Clocks
        Graphics                    : N/A
        Memory                      : N/A
    Max Clocks
        Graphics                    : N/A
        SM                          : N/A
        Memory                      : N/A
    Compute Processes               : N/A

This may have nothing to do with CUDA. I cannot think of an instance where such behavior was caused by CUDA, which of course does not mean there could not be some kind of connection. I use both Linux and Windows and common scenarios for massive slowdowns I have seen:

(1) The application has a memory leak, and over lengthy time periods this causes all user memory to be leaked on the host. The system winds up continuously swapping.

(2) The application spawns processes that are not correctly torn down on abnormal terminations, leaving behind zombie processes that may be iterating in a tight loop, eating up all CPU cycles.

If you find evidence of memory leaks, you may want to test this code under valgrind, although at least older versions of valgrind seem to produce false positives with CUDA applications.

The strange thing is that I have restarted the computer (unable to powercycle at the moment as it is some 1 hour away from where I live) and the problem persists, even 12 hours later. Could a memory leak persist through a restart???

I’ve had this problem before and have found powering down for 30 minutes or so can resolve it, though I’ve been unable to do so yet, but I can resolve it later today. I thought I might leave it in its current state to do some testing though.

Not much state survives a warm boot, and since the operating system restarts from scratch, a memory leak seems ruled out. And the fact that SSH is slow would pretty much rule out anything CUDA or GPU related as I can’t see a plausible connection between the two.

I think the first thing you would want to do is find out why the system is slow, then eliminate working hypotheses as to the reason one by one. Is the CPU load high (e.g. due to tasks running your weren’t aware off)? Are CPU clocks throttled (e.g. due to a thermal event)? Is there a high load on the I/O subsystems (e.g. disk or network load high)? Is there a lack of useable system memory?

I resolved the issue - it may have something to do with the gpus - not necessarily cuda

When I reached the server it had a big power usage warning light - essentially the computer was drawing more power than it was capable of supplying - It runs on 3 separate PSUs so either the gpus were at absolute peak, or one of the power supplies dropped out. When this happened it must’ve dropped into a safety mode so as to not damage the hardware.

In short nothing to do with CUDA - but maybe one of the joys of HPC - 8* cards at 250W is a lot of power to handle!! :)

It could be a thermal issue – when you get the machine humming along and the GPUs are drawing full power they will warm up. Fans will start running that were sitting idle before and any other cooling hardware you have may activate. Simply restarting the machine may not provide enough time for the system to cool down adequately, and slow processing times can result from heat-related degradation. You may find it informative to start monitoring any accessible temperature sensors on your CPU, GPUs or hard drives (overheated hard drives can soft-fail).