GPU loss while running very simple deep learning code possibly memory based

I’m running into a GPU loss problem described below. Any advice would be much appreciated.
This problem occurs in our lab’s server which houses a GFX 1080. The server runs on Ubuntu 16.04. We do not use the graphical interface but instead log in through ssh.
Here’s a minimal example code that causes this problem:

import torch
from torch import nn
from torch import optim
import torch.nn.functional as F

class Net(nn.Module):
    def __init__(self, d1, d2, d3):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(d2, d3, bias=False)
        self.embd = nn.Parameter(torch.randn(d1, d2))
        
    def forward(self):
        a = self.fc1(self.embd)
        return a

def train(model, device, target, optimizer, epochs):
    target = target.to(device)
    for epoch in range(1, epochs+1):
        optimizer.zero_grad()
        output = model.forward()
        loss = F.mse_loss(output, target)
        loss.backward()
        optimizer.step()
        print(f"Epoch {epoch} complete")

if not torch.cuda.is_available():
    raise RuntimeError("can't experiment without cuda")

device = torch.device("cuda")
#device = torch.device("cpu")

d1, d2, d3 = 30000, 100, 200
    
target = torch.randn(d1,d3);
    
model = Net(d1 = d1, d2 = d2, d3=d3).to(device)
optimizer = optim.Adagrad(model.parameters(), lr=0.5, weight_decay=0.3)
train(model, device, target, optimizer, 1000)

torch version is 1.0. It was installed through pip, and it comes with cuda 9. Python version is 3.7.2

This program will run slowly and a few seconds into the program the GPU becomes non-responsive:

$ nvidia-smi
Unable to determine the device handle for GPU 0000:03:00.0: GPU is lost. Reboot the system to recover this GPU

What’s worse the program continues, but cannot be terminated with CTRL-C.

Epoch 654 complete
^CTraceback (most recent call last):
File “test001.py”, line 38, in
train(model, device, target, optimizer, 1000)
File “test001.py”, line 22, in train
loss.backward()
File “/opt/python/torch/tensor.py”, line 102, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/opt/python/torch/autograd/init.py”, line 90, in backward
allow_unreachable=True) # allow_unreachable flag
KeyboardInterrupt

it shows up as zombie process on ps
gereka 2821 109 0.0 0 0 pts/1 Zl+ 18:46 0:52 [python3.7]

and cannot even be ended with kill -9

However the same program with d1=20000 runs just fine, and completes in a few seconds.
So I’m thinking this has something to do with memory used.

During execution I ran while true; do nvidia-smi >> out.txt; sleep 0.5; done in order to save the output of nvidia-smi.
outputs for both the working case and non-working case are attached
also attached are the output of sudo dmesg and the nvidia-bug-report
gpu_007.txt (16.9 KB)
dmesg.txt (79.7 KB)
gpu_006.txt (19.2 KB)
nvidia-bug-report_007.log.gz (302 KB)

You’re running into

NVRM: Xid (PCI:0000:03:00): 79, GPU has fallen off the bus.

means either overheating or insufficient power supply. Check temperatures using nvidia-smi while running your application.

attached to the original post is gpu_007.txt in which we have nvidia-smi outputs sampled every .5 seconds. The temperature rises to 52C before the GPU falls off. I don’t think it’s the temperature. How do I check whether it’s inefficient power supply?

Saw that nvidia-smi log only after posting.
Check for properly connected power connectors on the card, remove/reconnect them, use different connectors of the psu, check the gtx1080 in another system for general hw failure, replace psu.

Also, check if the psu is switched to ‘eco mode’ if available which would limit its output and of course if it’s large enough in the first place.