Using GeForce Titan X to train deep network model, then report "Check failed: error == cudaSucc

wuya · August 26, 2016, 7:34am

device: GeForce Titan X
cuda: V7.5.17
nvidia driver: 352.39
CentOS release 6.8 (Final)

I installed the nvidia driver and cuda7.5 successfully and I compiled caffe on GPU model successfully. I can run the simple mnist demo by train_lenet.sh. But when I want to train a larger network model, it can start properly, then report “Check failed: error == cudaSuccess (38 vs. 0) no CUDA-capable device is detected” after some iterations. If I want to run a caffe model, I must reboot. I have not gotten a complete network by now because it always interrupt.

I read the /var/log/message.

abrtd: Executable '/home/caffe/.build_release/tools/caffe.bin' doesn\'t belong to any package and ProcessUnpackaged is set to \'no\'
Saved core dump of pid 27875 (/home/caffe/.build_release/tools/caffe.bin) to /var/spool/abrt/ccpp-2016-08-25-15:23:50-27875 (30154752 bytes)

I edited /etc/abrt/abrt-action-save-package-data.conf and change ProcessUnpackaged = no to ProcessUnpackaged = yes, but it didn’t work.

When I train deep network model, the error message is

Check failed: error == cudaSuccess (38 vs. 0)  no CUDA-capable device is detected
*** Check failure stack trace: ***
    @     0x7f238740fb5d  google::LogMessage::Fail()
    @     0x7f2387413b77  google::LogMessage::SendToLog()
    @     0x7f23874119f9  google::LogMessage::Flush()
    @     0x7f2387411cfd  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f238cb63e4e  caffe::Caffe::SetDevice()
    @           0x40ba7f  train()
    @           0x407d5f  main
    @     0x7f237feccd1d  __libc_start_main
    @           0x406f49  (unknown)
./xxl/test_googlenet/train_net.sh: line 1: 15057 Aborted                 (core dumped) ./build/tools/caffe train -solver xxl/test_googlenet/solver.prototxt -weights xxl/test_googlenet/bvlc_googlenet.caffemodel

But after I reboot, it start properly.

Have you met this situation? May you help me? Thank you very much!

nvidia-bug-report.loghttps://drive.google.com/file/d/0B_9Bs_BdOK6CQTk1dXdtaTdCZFE/view?usp=sharing
nvidia-install.loghttps://drive.google.com/file/d/0B_9Bs_BdOK6CTHFoUkFHT2J1UnM/view?usp=sharing

Topic		Replies	Views
My computer crashed when caffe ran CUDA Setup and Installation	0	570	September 20, 2016
problems with installation Deep Learning (Training & Inference)	0	763	October 15, 2018
Can not work with train-val network in caffe??? cuda10.1 CUDA Setup and Installation	0	498	April 24, 2019
caffe run lenet sample ,cuda unknown error CUDA Programming and Performance	1	925	December 10, 2017
the new gpu rtxtitan can not train our model Deep Learning (Training & Inference)	0	440	August 4, 2019
TX2 ERROR: Check failed: error == cudaSuccess (8 vs. 0) invalid device function Jetson TX2	4	2613	October 18, 2021
Error while running the command: "tao detectnet_v2 train" TAO Toolkit python , tao	3	724	February 23, 2023
Nvidia GTX 1660TI install Cuda, Cudnn,caffe, Digits.etc softdrivers show big problems need some help. cuDNN	2	1562	June 8, 2019
Titan X not being used for model training after solving login loop by purge the reinstall CUDA Programming and Performance	2	580	April 13, 2017
The computer reboot during the training of deep neural network CUDA Setup and Installation	0	1108	July 16, 2018

Using GeForce Titan X to train deep network model, then report "Check failed: error == cudaSucc

Related topics