Hard shutdown problem on ubuntu16.04 with 2080Ti

Hi, I have weird problem when execute CNN with tensorflow.
the point of weird, everything is fine before use CNN.
it can proceed simple MNIST tutorial, can play youtube.
but problem occured at MNIST applied CNN. the RTX2080Ti works well before final training sets.
the training using GPU is very faster than only CPU. But when the CNN training finally completed, (ex, batch(100) total 1000, the process from step 900 to step 1000) ubuntu suddenly shutdown and restart. and also it occurs when I input nvidia-smi very quickly at the terminal. because of this, I can’t confirm error message.

Just in case I tried memory limitation using tensorflow, but it didn’t work.

Could you give me some clue about this problem?

my several version is

cuda 10.0 version
nvidia 418.88
tensorflow 1.14

(added) when tried MNIST RNN, it works well…

This sounds like a broader system or hardware issue rather than a problem with TensorFlow. Have you checked the system logs for error messages? The symptoms you describe could be explained by a failing or under-provisioned power supply. I would also recommend upgrading to the latest NVIDIA driver (currently 430.50).

I tried driver 430.50 and 410 but it still didn’t work.

the syslog, there were no messages about errors, warnings related to gpu.

if I monitor the graphic card by using “sudo watch nvidia-smi” in realtime, and run the convolution (CNN)code,

its power usage was under 50W and randomly shutdowned. the code shutdown same after sequence 999

(my code)

for i in range(1000):
batch = mnist.train.next_batch(50)
print (i)
if i%100 == 0:
train_accuracy = accuracy.eval(feed_dict={
x:batch[0], y_: batch[1], keep_prob: 1.0})
print(“step %d, training accuracy %g”%(i, train_accuracy))
train_step.run(feed_dict={x: batch[0], y_: batch[1], keep_prob: 0.5})

It looks like problem of malfunctioned power supply or driver problem

So, sadly I can’t solve this problem using your previous answer.

Should I ask this problem to hardware department?

With nothing appearing in the syslogs, I would turn my attention to the power supply. Nvidia-smi shows power utilization averaged over a small window of time, transient peaks within that window may draw higher power.

Things to check:

  1. your PSU is rated for the power requirements (https://www.nvidia.com/en-us/geforce/graphics-cards/rtx-2080-ti/). Note that PSUs lose power capacity as they age.
  2. If your PSU has multiple rails, make sure your system is correctly distributing its loads over the rails.
  3. Make sure both 8-pin connectors are not wired to the same power cable.

If those are all OK, the PSU may just be faulty. Try replacing it with another unit and see if the problem persists.