CUDA/Caffe slows down and Check failed: error == cudaSuccess (30 vs. 0) unknown error

Alex1980 · March 3, 2017, 6:05pm

I run caffe (FCN 8stride) on CUDA 8.0 w/cuDNN 5.1 on Tesla K40 w/12 Gb and 64 Gb RAM. The network works perfectly for about 120-130 images and then starts to slow down, from 0.6 s/img down to 4.5s/img, and then produces this:

F0303 16:00:33.861330 17604 math_functions.cu:79] Check failed: error == cudaSuccess (30 vs. 0) unknown error

I looked into this function and it is

void caffe_gpu_memcpy(const size_t N, const void* X, void* Y) {
  if (X != Y) {
    CUDA_CHECK(cudaMemcpy(Y, X, N, cudaMemcpyDefault));  // NOLINT(caffe/alt_fn)
  }
}

I tried a few quick fixes, but nothing worked. Any suggestions? Something to do with memory?

Robert_Crovella · March 5, 2017, 9:51pm

Sounds like you’re running out of host memory.

Monitor memory usage by the process e.g with top up to the point of failure.

Alex1980 · March 6, 2017, 10:49am

Thanks. Could you be a bit more specific as to how I can monitor memory use and top-up. Sorry I only recently started working with GPGPU.

Robert_Crovella · March 6, 2017, 2:13pm

You seem to have mentioned in your cross-posting:

[url]http://stackoverflow.com/questions/42625904/check-failed-error-cudasuccess-30-vs-0-while-running-inference-stage-pe[/url]

that memory usage is not an issue. The suggestion there to monitor GPU temperature of your K40m is a good one.

You can do this with nvidia-smi

Whether on linux or windows, open up an additional command prompt/terminal window, and run nvidia-smi in a loop there. This additional command prompt window can remain open while you start your caffe run in another window.

To learn how to use nvidia-smi, including how to make it loop automatically, use

nvidia-smi --help

but briefly,

nvidia-smi -l

should be good enough for this monitoring purpose

Alex1980 · March 6, 2017, 5:01pm

@txbob:

you are totally right, it is the temperature! GPU went from 30 to 90 in a few minutes! One obvious solution is to pause the inference for some time every N steps. Would you recommend something else, probably there are better solutions that don’t take so much time?

Robert_Crovella · March 6, 2017, 6:32pm

You apparently have a K40m that is in a system that was not designed for it.

This is typical of the trouble you run into.

The right thing to do is to get a K40c, or else get that K40m installed in a properly certified OEM server.

Topic		Replies	Views
Using caffe to train network but stop at iteration 0. Jetson TX1	4	916	October 18, 2021
Out of memory CUDA Setup and Installation	0	1043	December 14, 2016
Facing cuda memory issue CUDA-MEMCHECK cuda , gstreamer	2	1285	January 17, 2021
screen turn to black and kernel stop coud_memcheck: 0 error CUDA Programming and Performance	2	3529	November 21, 2011
Makefile:476: recipe for target 'runtest' failed CUDA Setup and Installation	4	1509	December 21, 2017
Runtime trouble moving legacy code from CUDA 6.5 to 8.0 CUDA Programming and Performance	8	684	September 3, 2021
caffe run lenet sample ,cuda unknown error CUDA Programming and Performance	1	925	December 10, 2017
Unknown error at cudaMemcpy CUDA Programming and Performance	0	1967	December 14, 2008
CudaMemGetInfo problem Legacy PGI Compilers	3	7241	July 9, 2014
CUDA debugging issues CUDA Programming and Performance	3	2901	March 27, 2008

CUDA/Caffe slows down and Check failed: error == cudaSuccess (30 vs. 0) unknown error

Related topics