We are using NVCaffe to train one of our networks because it has much better grouped convolution and depthwise/pointwise convolutions. Speed difference is dramatic with 6 seconds in Caffe per iteration and 0.68 seconds in NVCaffe! Kudos to the NVidia team for making this possible.
Sadly, however it consistently crashes after 15020 iterations whereas in Caffe it does not.
Here is the LOG:
iteration 15020 out of 100000 ; output directory: ; worker: 31-7 ; GPU: 2 ; iteration time = 0.68, loss = 0.00 F0508 23:06:20.531841 31336 common.cpp:190] Check failed: error == cudaSuccess (2 vs. 0) out of memory *** Check failure stack trace: *** @ 0x7f838cd5a5cd google::LogMessage::Fail() @ 0x7f838cd5c433 google::LogMessage::SendToLog() @ 0x7f838cd5a15b google::LogMessage::Flush() @ 0x7f838cd5ce1e google::LogMessageFatal::~LogMessageFatal() @ 0x7f838501443e caffe::CudaStream::CudaStream() @ 0x7f8385016b7f caffe::Caffe::pstream() @ 0x7f838501765f caffe::Caffe::th_cublas_handle() @ 0x7f83851446fb caffe::Net::ReduceAndUpdate() @ 0x7f8384fd9018 caffe::Solver::Reduce() @ 0x7f8384fe0bfb boost::detail::thread_data<>::run() @ 0x7f83846465d5 (unknown) @ 0x7f8392ad16ba start_thread @ 0x7f839280741d clone @ (nil) (unknown)
We are using the latest NVCaffe docker image with tag: 18.04-py2 and we are using a server with three GTX 1080 cards, however we do not try to run on multiple GPUs. Not sure if this is related, but strangely it seems that the docker instance is creating two processes. 117Mb on one of our GPUs and 4500Mb on another one on the same server.
We are running our training scripts directly inside the docker by first opening bash and then just running our training manually at the bash prompt. Here is how we are opening it:
nvidia-docker run -it -v /shared:/shared -v /home/brandon:/workspace nvcr.io/nvidia/caffe:18.04-py2 /bin/bash
We then run our own custom scripts for training afterwards.
Any help and advice on how to proceed to debug or fix the situation would be really appreciated.