Hard shutdown problem on ubuntu16.04 with 2080Ti

iamdnjswh · September 19, 2019, 11:26am

Hi, I have weird problem when execute CNN with tensorflow.
the point of weird, everything is fine before use CNN.
it can proceed simple MNIST tutorial, can play youtube.
but problem occured at MNIST applied CNN. the RTX2080Ti works well before final training sets.
the training using GPU is very faster than only CPU. But when the CNN training finally completed, (ex, batch(100) total 1000, the process from step 900 to step 1000) ubuntu suddenly shutdown and restart. and also it occurs when I input nvidia-smi very quickly at the terminal. because of this, I can’t confirm error message.

Just in case I tried memory limitation using tensorflow, but it didn’t work.

Could you give me some clue about this problem?

my several version is

ubuntu16.04LTS
cuda 10.0 version
nvidia 418.88
python3.5.2
tensorflow 1.14

(added) when tried MNIST RNN, it works well…

nluehr · September 19, 2019, 7:24pm

This sounds like a broader system or hardware issue rather than a problem with TensorFlow. Have you checked the system logs for error messages? The symptoms you describe could be explained by a failing or under-provisioned power supply. I would also recommend upgrading to the latest NVIDIA driver (currently 430.50).

iamdnjswh · September 23, 2019, 8:49am

I tried driver 430.50 and 410 but it still didn’t work.

the syslog, there were no messages about errors, warnings related to gpu.

if I monitor the graphic card by using “sudo watch nvidia-smi” in realtime, and run the convolution (CNN)code,

its power usage was under 50W and randomly shutdowned. the code shutdown same after sequence 999

(my code)

for i in range(1000):
batch = mnist.train.next_batch(50)
print (i)
if i%100 == 0:
train_accuracy = accuracy.eval(feed_dict={
x:batch[0], y_: batch[1], keep_prob: 1.0})
print(“step %d, training accuracy %g”%(i, train_accuracy))
train_step.run(feed_dict={x: batch[0], y_: batch[1], keep_prob: 0.5})

It looks like problem of malfunctioned power supply or driver problem

So, sadly I can’t solve this problem using your previous answer.

Should I ask this problem to hardware department?

nluehr · September 23, 2019, 3:23pm

With nothing appearing in the syslogs, I would turn my attention to the power supply. Nvidia-smi shows power utilization averaged over a small window of time, transient peaks within that window may draw higher power.

Things to check:

your PSU is rated for the power requirements (https://www.nvidia.com/en-us/geforce/graphics-cards/rtx-2080-ti/). Note that PSUs lose power capacity as they age.
If your PSU has multiple rails, make sure your system is correctly distributing its loads over the rails.
Make sure both 8-pin connectors are not wired to the same power cable.

If those are all OK, the PSU may just be faulty. Try replacing it with another unit and see if the problem persists.

Topic		Replies	Views
Restarts when running tensorflow CUDA Setup and Installation	6	751	March 22, 2018
GPU is lost, all GPU card fans on, 1080 Ti, Ubuntu 16.04. Linux	2	5522	January 3, 2018
Hard crash using CUDA on GTX 1080 Ti on Ubuntu 16.04 CUDA Setup and Installation	8	4863	September 25, 2017
Ubuntu 16.04 CUDA8 crashing graphics driver Linux	5	1615	October 14, 2021
GPU is lost during execution of either Tensorflow or Theano code CUDA Programming and Performance	12	12670	March 8, 2020
Ubuntu 18.04 with 2 RTX 2080 Ti system frozen when training deep learning models using cuda CUDA Programming and Performance	28	3351	March 23, 2020
Ubuntu 18.04 and RTX 2080 SUPER systematically freezing Linux cuda , tensorflow , ubuntu	27	3798	October 12, 2021
GPU Lost when using Tensorflow Training Linux	0	427	March 6, 2019
Ubuntu 18.04 with 2 RTX 2080 Ti screen frozen when training deep learning models Linux	15	1362	October 4, 2019
TitanX slower than CPU (Tensorflow), possible configuration issue CUDA Programming and Performance	9	4514	April 13, 2016

Hard shutdown problem on ubuntu16.04 with 2080Ti

Related topics