NVCaffe training out of memory

I am trying to train a convolutional neural network on NVCaffe and I am getting what seems to be a memory related issue.

I am running the NVCaffe 17.12 docker container (which comes packaged with Cuda 9) on Ubuntu 16.04. I am launching the container with command:

sudo nvidia-docker run --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -ti nvcr.io/nvidia/caffe:17.12

The version of docker is ‘Docker version 17.09.1-ce, build 19e2cf6’. In terms of hardware, I am training on a Titan V GPU, with 12Gb of video ram, with Nvidia driver 387.34. The training prototxt specifies training using FLOAT32 type in both forward and backward modes.

The training crashes with the following message:

I1220 21:57:46.192134 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘conv1_1’ with space 4.37G 3/1 1 1 0 (avail 0.06G, req 0G) t: 0 0 1.9
I1220 21:57:46.753433 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘conv1_2’ with space 4.37G 64/1 1 1 0 (avail 0.06G, req 0G) t: 0 8.64 12.86
I1220 21:57:47.019194 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘conv2_1’ with space 4.37G 64/1 7 1 1 (avail 0.06G, req 0.94G) t: 0 4.42 5.16
I1220 21:57:47.422283 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘conv2_2’ with space 4.37G 128/1 7 5 5 (avail 0.06G, req 1.25G) t: 0 6.49 6.97
I1220 21:57:47.618635 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘conv3_1’ with space 4.37G 128/1 7 5 5 (avail 0.06G, req 1.25G) t: 0 3 2.09
I1220 21:57:47.983085 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘conv3_2’ with space 4.37G 256/1 7 5 5 (avail 0.06G, req 1.25G) t: 0 5.07 3.56
I1220 21:57:48.351124 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘conv3_3’ with space 4.37G 256/1 7 5 5 (avail 0.06G, req 1.25G) t: 0 5.11 3.57
I1220 21:57:48.719410 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘conv3_4’ with space 4.37G 256/1 7 5 5 (avail 0.06G, req 1.25G) t: 0 5.08 3.57
I1220 21:57:48.915444 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘conv4_1’ with space 4.37G 256/1 7 5 5 (avail 0.06G, req 1.25G) t: 0 2.64 1.78
I1220 21:57:49.292488 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘conv4_2’ with space 4.37G 512/1 7 5 5 (avail 0.06G, req 1.25G) t: 0 4.7 3.22
I1220 21:57:49.664621 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘conv4_3’ with space 4.37G 512/1 7 5 5 (avail 0.06G, req 1.25G) t: 0 4.7 3.06
I1220 21:57:50.042997 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘conv4_4’ with space 4.37G 512/1 7 5 5 (avail 0.06G, req 1.25G) t: 0 4.7 3.21
I1220 21:57:50.242841 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘conv5_1_s0’ with space 4.37G 512/1 7 5 5 (avail 0.06G, req 1.25G) t: 0 2.48 1.78
I1220 21:57:50.344413 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘conv5_2_s0’ with space 4.37G 256/1 7 5 5 (avail 0.06G, req 1.25G) t: 0 0.83 0.6
I1220 21:57:50.417989 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s1_conv1_1_joint_vec’ with space 4.37G 128/1 7 5 5 (avail 0.06G, req 1.25G) t: 0 0.58 0.52
I1220 21:57:50.504339 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s1_conv1_2_joint_vec’ with space 4.37G 160/1 7 5 5 (avail 0.06G, req 1.25G) t: 0 0.74 0.7
I1220 21:57:50.590241 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s1_conv1_3_joint_vec’ with space 4.37G 160/1 7 5 5 (avail 0.06G, req 1.25G) t: 0 0.74 0.69
I1220 21:57:50.728608 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s1_conv1_4_joint_vec’ with space 4.37G 160/1 1 0 0 (avail 0.06G, req 1.25G) t: 0 0.67 0.68
I1220 21:57:50.801311 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s1_conv1_5_joint_vec’ with space 4.37G 640/1 1 1 0 (avail 0.06G, req 1.25G) t: 0 0.28 0.66
I1220 21:57:51.056468 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s2_conv1_1_joint_vec’ with space 4.37G 191/1 5 3 2 (avail 0.06G, req 1.25G) t: 0 1.87 4.91
I1220 21:57:51.272598 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s2_conv1_2_joint_vec’ with space 4.37G 160/1 5 3 2 (avail 0.06G, req 1.25G) t: 0 1.57 3.75
I1220 21:57:51.488917 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s2_conv1_3_joint_vec’ with space 4.37G 160/1 5 3 2 (avail 0.06G, req 1.25G) t: 0 1.58 3.71
I1220 21:57:51.705770 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s2_conv1_4_joint_vec’ with space 4.37G 160/1 5 3 2 (avail 0.06G, req 1.25G) t: 0 1.58 3.74
I1220 21:57:51.920711 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s2_conv1_5_joint_vec’ with space 4.37G 160/1 5 3 2 (avail 0.06G, req 1.25G) t: 0 1.64 3.79
I1220 21:57:51.967404 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s2_conv1_6_joint_vec’ with space 4.37G 160/1 1 1 0 (avail 0.06G, req 1.25G) t: 0 0.2 0.37
I1220 21:57:51.992983 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s2_conv1_7_joint_vec’ with space 4.37G 160/1 1 1 0 (avail 0.06G, req 1.25G) t: 0 0.15 0.17
I1220 21:57:52.247993 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s3_conv1_1_joint_vec’ with space 4.37G 191/1 5 3 2 (avail 0.06G, req 1.25G) t: 0 1.85 4.96
I1220 21:57:52.465217 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s3_conv1_2_joint_vec’ with space 4.37G 160/1 5 3 2 (avail 0.06G, req 1.25G) t: 0 1.62 3.78
I1220 21:57:52.682094 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s3_conv1_3_joint_vec’ with space 4.37G 160/1 5 3 2 (avail 0.06G, req 1.25G) t: 0 1.58 3.75
I1220 21:57:52.899735 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s3_conv1_4_joint_vec’ with space 4.37G 160/1 5 3 2 (avail 0.06G, req 1.25G) t: 0 1.61 3.75
I1220 21:57:53.117147 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s3_conv1_5_joint_vec’ with space 4.37G 160/1 5 3 2 (avail 0.06G, req 1.25G) t: 0 1.6 3.79
I1220 21:57:53.164228 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s3_conv1_6_joint_vec’ with space 4.37G 160/1 1 1 0 (avail 0.06G, req 1.25G) t: 0 0.19 0.35
I1220 21:57:53.189612 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s3_conv1_7_joint_vec’ with space 4.37G 160/1 1 1 0 (avail 0.06G, req 1.25G) t: 0 0.13 0.16
I1220 21:57:53.444718 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s4_conv1_1_joint_vec’ with space 4.37G 191/1 5 3 2 (avail 0.06G, req 1.25G) t: 0 1.82 4.9
I1220 21:57:53.667579 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s4_conv1_2_joint_vec’ with space 4.37G 160/1 5 3 2 (avail 0.06G, req 1.25G) t: 0 1.62 3.81
I1220 21:57:53.885004 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s4_conv1_3_joint_vec’ with space 4.37G 160/1 5 3 2 (avail 0.06G, req 1.25G) t: 0 1.61 3.78
I1220 21:57:54.103643 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s4_conv1_4_joint_vec’ with space 4.37G 160/1 5 3 2 (avail 0.06G, req 1.25G) t: 0 1.58 3.78
I1220 21:57:54.320942 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s4_conv1_5_joint_vec’ with space 4.37G 160/1 5 3 2 (avail 0.06G, req 1.25G) t: 0 1.62 3.78
I1220 21:57:54.370590 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s4_conv1_6_joint_vec’ with space 4.37G 160/1 1 1 0 (avail 0.06G, req 1.25G) t: 0 0.19 0.35
I1220 21:57:54.395680 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s4_conv1_7_joint_vec’ with space 4.37G 160/1 1 1 0 (avail 0.06G, req 1.25G) t: 0 0.13 0.15
I1220 21:57:54.649092 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s5_conv1_1_joint_vec’ with space 4.37G 191/1 5 3 2 (avail 0.06G, req 1.25G) t: 0 1.82 4.89
I1220 21:57:54.868311 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s5_conv1_2_joint_vec’ with space 4.37G 160/1 5 3 2 (avail 0.06G, req 1.25G) t: 0 1.58 3.8
I1220 21:57:55.088066 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s5_conv1_3_joint_vec’ with space 4.37G 160/1 5 3 2 (avail 0.06G, req 1.25G) t: 0 1.58 3.72
I1220 21:57:55.306232 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s5_conv1_4_joint_vec’ with space 4.37G 160/1 5 3 2 (avail 0.06G, req 1.25G) t: 0 1.57 3.78
I1220 21:57:55.522558 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s5_conv1_5_joint_vec’ with space 4.37G 160/1 5 3 2 (avail 0.06G, req 1.25G) t: 0 1.59 3.78
I1220 21:57:55.568832 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s5_conv1_6_joint_vec’ with space 4.37G 160/1 1 1 0 (avail 0.06G, req 1.25G) t: 0 0.18 0.33
I1220 21:57:55.592062 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s5_conv1_7_joint_vec’ with space 4.37G 160/1 1 1 0 (avail 0.06G, req 1.25G) t: 0 0.12 0.14
I1220 21:57:55.847193 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s6_conv1_1_joint_vec’ with space 4.37G 191/1 5 3 2 (avail 0.06G, req 1.25G) t: 0 1.84 4.91
I1220 21:57:56.064046 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s6_conv1_2_joint_vec’ with space 4.37G 160/1 5 3 2 (avail 0.06G, req 1.25G) t: 0 1.58 3.72
I1220 21:57:56.281412 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s6_conv1_3_joint_vec’ with space 4.37G 160/1 5 3 2 (avail 0.06G, req 1.25G) t: 0 1.61 3.76
I1220 21:57:56.501744 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s6_conv1_4_joint_vec’ with space 4.37G 160/1 5 3 2 (avail 0.06G, req 1.25G) t: 0 1.57 3.7
I1220 21:57:56.717321 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s6_conv1_5_joint_vec’ with space 4.37G 160/1 5 3 2 (avail 0.06G, req 1.25G) t: 0 1.61 3.8
I1220 21:57:56.764864 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s6_conv1_6_joint_vec’ with space 4.37G 160/1 1 1 0 (avail 0.06G, req 1.25G) t: 0 0.2 0.34
I1220 21:57:56.789643 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s6_conv1_7_joint_vec’ with space 4.37G 160/1 1 1 0 (avail 0.06G, req 1.25G) t: 0 0.13 0.16
*** Aborted at 1513807076 (unix time) try “date -d @1513807076” if you are using GNU date ***
PC: @ 0x7f90aa726b60 caffe::CuDNNConvolutionLayer<>::FindExConvAlgo()
*** SIGSEGV (@0x0) received by PID 4640 (TID 0x7f90abacb0c0) from PID 0; stack trace: ***
@ 0x7f90a83b84b0 (unknown)
@ 0x7f90aa726b60 caffe::CuDNNConvolutionLayer<>::FindExConvAlgo()
@ 0x7f90aa73cce1 caffe::CuDNNConvolutionLayer<>::Reshape()
@ 0x7f90aa52bf0a caffe::Layer<>::Forward()
@ 0x7f90aa8a50fb caffe::Net::ForwardFromTo()
@ 0x7f90aa8a5267 caffe::Net::Forward()
@ 0x7f90aa8a8a45 caffe::Net::ForwardBackward()
@ 0x7f90aa885f65 caffe::Solver::Step()
@ 0x7f90aa887bc0 caffe::Solver::Solve()
@ 0x40f85d train()
@ 0x40c198 main
@ 0x7f90a83a3830 __libc_start_main
@ 0x40ca09 _start
@ 0x0 (unknown)
Segmentation fault (core dumped)

It appears to be saying that certain convolution layers in cuDNN require 1.25Gb on the GPU but only sees 0.06Gb for some reason. I tried increasing the --shm-size and --ulimit stack parameters of the docker container launch command, but I still keep crashing at the same place with the same 0.06G message. This is particularly odd since the exact same training setup (same solver/training prototxt and data set) was previously successful used in training on (vanilla) caffe. Any idea why I am getting this error? The reason why we want to move to NVCaffe is to take advantage of FP16 training on Volta GPUs.

Do you have other GPUs in that system besides the Titan V?
what does nvidia-smi show for memory utilization when this is running?
what does nvidia-smi show for memory utilization when the GPU is idle?

Thank you for your reply. As a matter of fact, I do have a 2nd Titan V in the same system, but I only train on a single GPU for now. This is the command I use for training:

caffe train --solver=/data/TrainingParameters/solver.prototxt --gpu=0

Based your suggestion, I ran nvidia-smi during training and it revealed that the peak memory usage is 97%! I subsequently reduced in the size of the input images but I am still crashing at the same place:

I1221 17:57:15.745223 4876 cudnn_conv_layer.cpp:900] [0] Conv Algos (F,BD,BF): ‘s5_conv1_7_joint_vec’ with space 6.25G 160/1 0 0 0 (avail 3.92G, req 0.08G) t: 0 0.04 0.03
I1221 17:57:15.789032 4876 cudnn_conv_layer.cpp:900] [0] Conv Algos (F,BD,BF): ‘s6_conv1_1_joint_vec’ with space 6.25G 191/1 4 0 1 (avail 3.92G, req 0.08G) t: 0 0.12 0.16
I1221 17:57:15.831817 4876 cudnn_conv_layer.cpp:900] [0] Conv Algos (F,BD,BF): ‘s6_conv1_2_joint_vec’ with space 6.25G 160/1 4 0 1 (avail 3.92G, req 0.08G) t: 0 0.13 0.12
I1221 17:57:15.857851 4876 cudnn_conv_layer.cpp:900] [0] Conv Algos (F,BD,BF): ‘s6_conv1_3_joint_vec’ with space 6.25G 160/1 4 0 1 (avail 3.92G, req 0.08G) t: 0 0.13 0.16
I1221 17:57:15.890970 4876 cudnn_conv_layer.cpp:900] [0] Conv Algos (F,BD,BF): ‘s6_conv1_4_joint_vec’ with space 6.25G 160/1 4 0 1 (avail 3.92G, req 0.08G) t: 0 0.12 0.16
I1221 17:57:15.925066 4876 cudnn_conv_layer.cpp:900] [0] Conv Algos (F,BD,BF): ‘s6_conv1_5_joint_vec’ with space 6.25G 160/1 4 0 1 (avail 3.92G, req 0.08G) t: 0 0.14 0.13
I1221 17:57:15.932271 4876 cudnn_conv_layer.cpp:900] [0] Conv Algos (F,BD,BF): ‘s6_conv1_6_joint_vec’ with space 6.25G 160/1 0 0 0 (avail 3.92G, req 0.08G) t: 0 0.05 0.04
I1221 17:57:15.938091 4876 cudnn_conv_layer.cpp:900] [0] Conv Algos (F,BD,BF): ‘s6_conv1_7_joint_vec’ with space 6.25G 160/1 0 0 0 (avail 3.92G, req 0.08G) t: 0 0.04 0.04
*** Aborted at 1513879035 (unix time) try “date -d @1513879035” if you are using GNU date ***
PC: @ 0x7f1eed75bb60 caffe::CuDNNConvolutionLayer<>::FindExConvAlgo()
*** SIGSEGV (@0x0) received by PID 4876 (TID 0x7f1eeeb000c0) from PID 0; stack trace: ***
@ 0x7f1eeb3ed4b0 (unknown)
@ 0x7f1eed75bb60 caffe::CuDNNConvolutionLayer<>::FindExConvAlgo()
@ 0x7f1eed771ce1 caffe::CuDNNConvolutionLayer<>::Reshape()
@ 0x7f1eed560f0a caffe::Layer<>::Forward()
@ 0x7f1eed8da0fb caffe::Net::ForwardFromTo()
@ 0x7f1eed8da267 caffe::Net::Forward()
@ 0x7f1eed8dda45 caffe::Net::ForwardBackward()
@ 0x7f1eed8baf65 caffe::Solver::Step()
@ 0x7f1eed8bcbc0 caffe::Solver::Solve()
@ 0x40f85d train()
@ 0x40c198 main
@ 0x7f1eeb3d8830 __libc_start_main
@ 0x40ca09 _start
@ 0x0 (unknown)

Notice how the available memory size is no longer smaller than required memory size, yet I still crash at the same place with basically the same error. The peak memory usage was 87% right before the crash. The memory footprint was nowhere that high when training on the vanilla caffe. What could be the cause for this issue on NVCaffe?

By pulling a container from nvcr.io you are effectively using NGC (right? - you’re logging into the NGC container repository, right?)

You might want to ask this question on the NGC forums:

https://devtalk.nvidia.com/default/board/231/container-nvcaffe/

At the moment, I don’t have any immediate ideas. If you are using NGC, then I would certainly recommend following the NGC setup guide for Titan V:

http://docs.nvidia.com/ngc/ngc-titan-setup-guide/index.html

and in other respects following instructions for proper use of NGC, but at the moment I don’t see anything you’ve reported here that looks incorrect (for NGC usage).

It might be that a complete test case would be needed (e.g. your caffe prototext, etc.). It could be a bug in NVCaffe or CUDNN. I’m not able to say based on what is posted here, but there are plenty of smart folks watching the NGC forums and they may be able to spot something.