GPU loss while running very simple deep learning code possibly memory based

aydingerek · February 1, 2019, 4:00pm

I’m running into a GPU loss problem described below. Any advice would be much appreciated.
This problem occurs in our lab’s server which houses a GFX 1080. The server runs on Ubuntu 16.04. We do not use the graphical interface but instead log in through ssh.
Here’s a minimal example code that causes this problem:

import torch
from torch import nn
from torch import optim
import torch.nn.functional as F

class Net(nn.Module):
    def __init__(self, d1, d2, d3):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(d2, d3, bias=False)
        self.embd = nn.Parameter(torch.randn(d1, d2))
        
    def forward(self):
        a = self.fc1(self.embd)
        return a

def train(model, device, target, optimizer, epochs):
    target = target.to(device)
    for epoch in range(1, epochs+1):
        optimizer.zero_grad()
        output = model.forward()
        loss = F.mse_loss(output, target)
        loss.backward()
        optimizer.step()
        print(f"Epoch {epoch} complete")

if not torch.cuda.is_available():
    raise RuntimeError("can't experiment without cuda")

device = torch.device("cuda")
#device = torch.device("cpu")

d1, d2, d3 = 30000, 100, 200
    
target = torch.randn(d1,d3);
    
model = Net(d1 = d1, d2 = d2, d3=d3).to(device)
optimizer = optim.Adagrad(model.parameters(), lr=0.5, weight_decay=0.3)
train(model, device, target, optimizer, 1000)

torch version is 1.0. It was installed through pip, and it comes with cuda 9. Python version is 3.7.2

This program will run slowly and a few seconds into the program the GPU becomes non-responsive:

$ nvidia-smi
Unable to determine the device handle for GPU 0000:03:00.0: GPU is lost. Reboot the system to recover this GPU

What’s worse the program continues, but cannot be terminated with CTRL-C.

Epoch 654 complete
^CTraceback (most recent call last):
File “test001.py”, line 38, in
train(model, device, target, optimizer, 1000)
File “test001.py”, line 22, in train
loss.backward()
File “/opt/python/torch/tensor.py”, line 102, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/opt/python/torch/autograd/init.py”, line 90, in backward
allow_unreachable=True) # allow_unreachable flag
KeyboardInterrupt

it shows up as zombie process on ps
gereka 2821 109 0.0 0 0 pts/1 Zl+ 18:46 0:52 [python3.7]

and cannot even be ended with kill -9

However the same program with d1=20000 runs just fine, and completes in a few seconds.
So I’m thinking this has something to do with memory used.

During execution I ran while true; do nvidia-smi >> out.txt; sleep 0.5; done in order to save the output of nvidia-smi.
outputs for both the working case and non-working case are attached
also attached are the output of sudo dmesg and the nvidia-bug-report
gpu_007.txt (16.9 KB)
dmesg.txt (79.7 KB)
gpu_006.txt (19.2 KB)
nvidia-bug-report_007.log.gz (302 KB)

generix · February 1, 2019, 4:21pm

You’re running into

NVRM: Xid (PCI:0000:03:00): 79, GPU has fallen off the bus.

means either overheating or insufficient power supply. Check temperatures using nvidia-smi while running your application.

aydingerek · February 1, 2019, 9:07pm

attached to the original post is gpu_007.txt in which we have nvidia-smi outputs sampled every .5 seconds. The temperature rises to 52C before the GPU falls off. I don’t think it’s the temperature. How do I check whether it’s inefficient power supply?

generix · February 1, 2019, 9:18pm

Saw that nvidia-smi log only after posting.
Check for properly connected power connectors on the card, remove/reconnect them, use different connectors of the psu, check the gtx1080 in another system for general hw failure, replace psu.

generix · February 1, 2019, 9:20pm

Also, check if the psu is switched to ‘eco mode’ if available which would limit its output and of course if it’s large enough in the first place.