NVCaffe crash

We are using NVCaffe to train one of our networks because it has much better grouped convolution and depthwise/pointwise convolutions. Speed difference is dramatic with 6 seconds in Caffe per iteration and 0.68 seconds in NVCaffe! Kudos to the NVidia team for making this possible.

Sadly, however it consistently crashes after 15020 iterations whereas in Caffe it does not.

Here is the LOG:

iteration 15020 out of 100000 ; output directory:                                                 ; worker: 31-7  ; GPU: 2 ; iteration time = 0.68, loss = 0.00
F0508 23:06:20.531841 31336 common.cpp:190] Check failed: error == cudaSuccess (2 vs. 0)  out of memory
*** Check failure stack trace: ***
    @     0x7f838cd5a5cd  google::LogMessage::Fail()
    @     0x7f838cd5c433  google::LogMessage::SendToLog()
    @     0x7f838cd5a15b  google::LogMessage::Flush()
    @     0x7f838cd5ce1e  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f838501443e  caffe::CudaStream::CudaStream()
    @     0x7f8385016b7f  caffe::Caffe::pstream()
    @     0x7f838501765f  caffe::Caffe::th_cublas_handle()
    @     0x7f83851446fb  caffe::Net::ReduceAndUpdate()
    @     0x7f8384fd9018  caffe::Solver::Reduce()
    @     0x7f8384fe0bfb  boost::detail::thread_data<>::run()
    @     0x7f83846465d5  (unknown)
    @     0x7f8392ad16ba  start_thread
    @     0x7f839280741d  clone
    @              (nil)  (unknown)

We are using the latest NVCaffe docker image with tag: 18.04-py2 and we are using a server with three GTX 1080 cards, however we do not try to run on multiple GPUs. Not sure if this is related, but strangely it seems that the docker instance is creating two processes. 117Mb on one of our GPUs and 4500Mb on another one on the same server.

We are running our training scripts directly inside the docker by first opening bash and then just running our training manually at the bash prompt. Here is how we are opening it:

nvidia-docker run -it -v /shared:/shared -v /home/brandon:/workspace nvcr.io/nvidia/caffe:18.04-py2  /bin/bash

We then run our own custom scripts for training afterwards.

Any help and advice on how to proceed to debug or fix the situation would be really appreciated.

Hi @twerdster, thanks for reporting this. Please try the following:

  1. Check 18.05
  2. Add --shm-size=1G --ulimit memlock=-1 --ulimit stack=67108864 to the command.
    If nothing helps, please attach Caffe log.