I’m running into a GPU loss problem described below. Any advice would be much appreciated.
This problem occurs in our lab’s server which houses a GFX 1080. The server runs on Ubuntu 16.04. We do not use the graphical interface but instead log in through ssh.
Here’s a minimal example code that causes this problem:
import torch
from torch import nn
from torch import optim
import torch.nn.functional as F
class Net(nn.Module):
def __init__(self, d1, d2, d3):
super(Net, self).__init__()
self.fc1 = nn.Linear(d2, d3, bias=False)
self.embd = nn.Parameter(torch.randn(d1, d2))
def forward(self):
a = self.fc1(self.embd)
return a
def train(model, device, target, optimizer, epochs):
target = target.to(device)
for epoch in range(1, epochs+1):
optimizer.zero_grad()
output = model.forward()
loss = F.mse_loss(output, target)
loss.backward()
optimizer.step()
print(f"Epoch {epoch} complete")
if not torch.cuda.is_available():
raise RuntimeError("can't experiment without cuda")
device = torch.device("cuda")
#device = torch.device("cpu")
d1, d2, d3 = 30000, 100, 200
target = torch.randn(d1,d3);
model = Net(d1 = d1, d2 = d2, d3=d3).to(device)
optimizer = optim.Adagrad(model.parameters(), lr=0.5, weight_decay=0.3)
train(model, device, target, optimizer, 1000)
torch version is 1.0. It was installed through pip, and it comes with cuda 9. Python version is 3.7.2
This program will run slowly and a few seconds into the program the GPU becomes non-responsive:
$ nvidia-smi
Unable to determine the device handle for GPU 0000:03:00.0: GPU is lost. Reboot the system to recover this GPU
What’s worse the program continues, but cannot be terminated with CTRL-C.
Epoch 654 complete
^CTraceback (most recent call last):
File “test001.py”, line 38, in
train(model, device, target, optimizer, 1000)
File “test001.py”, line 22, in train
loss.backward()
File “/opt/python/torch/tensor.py”, line 102, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/opt/python/torch/autograd/init.py”, line 90, in backward
allow_unreachable=True) # allow_unreachable flag
KeyboardInterrupt
it shows up as zombie process on ps
gereka 2821 109 0.0 0 0 pts/1 Zl+ 18:46 0:52 [python3.7]
and cannot even be ended with kill -9
However the same program with d1=20000 runs just fine, and completes in a few seconds.
So I’m thinking this has something to do with memory used.
During execution I ran while true; do nvidia-smi >> out.txt; sleep 0.5; done
in order to save the output of nvidia-smi.
outputs for both the working case and non-working case are attached
also attached are the output of sudo dmesg
and the nvidia-bug-report
gpu_007.txt (16.9 KB)
dmesg.txt (79.7 KB)
gpu_006.txt (19.2 KB)
nvidia-bug-report_007.log.gz (302 KB)