CUDA/Caffe slows down and Check failed: error == cudaSuccess (30 vs. 0) unknown error

I run caffe (FCN 8stride) on CUDA 8.0 w/cuDNN 5.1 on Tesla K40 w/12 Gb and 64 Gb RAM. The network works perfectly for about 120-130 images and then starts to slow down, from 0.6 s/img down to 4.5s/img, and then produces this:

F0303 16:00:33.861330 17604 math_functions.cu:79] Check failed: error == cudaSuccess (30 vs. 0) unknown error

I looked into this function and it is

void caffe_gpu_memcpy(const size_t N, const void* X, void* Y) {
  if (X != Y) {
    CUDA_CHECK(cudaMemcpy(Y, X, N, cudaMemcpyDefault));  // NOLINT(caffe/alt_fn)
  }
}

I tried a few quick fixes, but nothing worked. Any suggestions? Something to do with memory?

Sounds like you’re running out of host memory.

Monitor memory usage by the process e.g with top up to the point of failure.

Thanks. Could you be a bit more specific as to how I can monitor memory use and top-up. Sorry I only recently started working with GPGPU.

You seem to have mentioned in your cross-posting:

http://stackoverflow.com/questions/42625904/check-failed-error-cudasuccess-30-vs-0-while-running-inference-stage-pe

that memory usage is not an issue. The suggestion there to monitor GPU temperature of your K40m is a good one.

You can do this with nvidia-smi

Whether on linux or windows, open up an additional command prompt/terminal window, and run nvidia-smi in a loop there. This additional command prompt window can remain open while you start your caffe run in another window.

To learn how to use nvidia-smi, including how to make it loop automatically, use

nvidia-smi --help

but briefly,

nvidia-smi -l

should be good enough for this monitoring purpose

@txbob:

you are totally right, it is the temperature! GPU went from 30 to 90 in a few minutes! One obvious solution is to pause the inference for some time every N steps. Would you recommend something else, probably there are better solutions that don’t take so much time?

You apparently have a K40m that is in a system that was not designed for it.

This is typical of the trouble you run into.

The right thing to do is to get a K40c, or else get that K40m installed in a properly certified OEM server.