I am trying to train a convolutional neural network on NVCaffe and I am getting what seems to be a memory related issue.
I am running the NVCaffe 17.12 docker container (which comes packaged with Cuda 9) on Ubuntu 16.04. I am launching the container with command:
sudo nvidia-docker run --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -ti nvcr.io/nvidia/caffe:17.12
The version of docker is ‘Docker version 17.09.1-ce, build 19e2cf6’. In terms of hardware, I am training on a Titan V GPU, with 12Gb of video ram, with Nvidia driver 387.34. The training prototxt specifies training using FLOAT32 type in both forward and backward modes.
The training crashes with the following message:
I1220 21:57:46.192134 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘conv1_1’ with space 4.37G 3/1 1 1 0 (avail 0.06G, req 0G) t: 0 0 1.9
I1220 21:57:46.753433 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘conv1_2’ with space 4.37G 64/1 1 1 0 (avail 0.06G, req 0G) t: 0 8.64 12.86
I1220 21:57:47.019194 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘conv2_1’ with space 4.37G 64/1 7 1 1 (avail 0.06G, req 0.94G) t: 0 4.42 5.16
I1220 21:57:47.422283 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘conv2_2’ with space 4.37G 128/1 7 5 5 (avail 0.06G, req 1.25G) t: 0 6.49 6.97
I1220 21:57:47.618635 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘conv3_1’ with space 4.37G 128/1 7 5 5 (avail 0.06G, req 1.25G) t: 0 3 2.09
I1220 21:57:47.983085 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘conv3_2’ with space 4.37G 256/1 7 5 5 (avail 0.06G, req 1.25G) t: 0 5.07 3.56
I1220 21:57:48.351124 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘conv3_3’ with space 4.37G 256/1 7 5 5 (avail 0.06G, req 1.25G) t: 0 5.11 3.57
I1220 21:57:48.719410 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘conv3_4’ with space 4.37G 256/1 7 5 5 (avail 0.06G, req 1.25G) t: 0 5.08 3.57
I1220 21:57:48.915444 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘conv4_1’ with space 4.37G 256/1 7 5 5 (avail 0.06G, req 1.25G) t: 0 2.64 1.78
I1220 21:57:49.292488 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘conv4_2’ with space 4.37G 512/1 7 5 5 (avail 0.06G, req 1.25G) t: 0 4.7 3.22
I1220 21:57:49.664621 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘conv4_3’ with space 4.37G 512/1 7 5 5 (avail 0.06G, req 1.25G) t: 0 4.7 3.06
I1220 21:57:50.042997 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘conv4_4’ with space 4.37G 512/1 7 5 5 (avail 0.06G, req 1.25G) t: 0 4.7 3.21
I1220 21:57:50.242841 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘conv5_1_s0’ with space 4.37G 512/1 7 5 5 (avail 0.06G, req 1.25G) t: 0 2.48 1.78
I1220 21:57:50.344413 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘conv5_2_s0’ with space 4.37G 256/1 7 5 5 (avail 0.06G, req 1.25G) t: 0 0.83 0.6
I1220 21:57:50.417989 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s1_conv1_1_joint_vec’ with space 4.37G 128/1 7 5 5 (avail 0.06G, req 1.25G) t: 0 0.58 0.52
I1220 21:57:50.504339 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s1_conv1_2_joint_vec’ with space 4.37G 160/1 7 5 5 (avail 0.06G, req 1.25G) t: 0 0.74 0.7
I1220 21:57:50.590241 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s1_conv1_3_joint_vec’ with space 4.37G 160/1 7 5 5 (avail 0.06G, req 1.25G) t: 0 0.74 0.69
I1220 21:57:50.728608 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s1_conv1_4_joint_vec’ with space 4.37G 160/1 1 0 0 (avail 0.06G, req 1.25G) t: 0 0.67 0.68
I1220 21:57:50.801311 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s1_conv1_5_joint_vec’ with space 4.37G 640/1 1 1 0 (avail 0.06G, req 1.25G) t: 0 0.28 0.66
I1220 21:57:51.056468 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s2_conv1_1_joint_vec’ with space 4.37G 191/1 5 3 2 (avail 0.06G, req 1.25G) t: 0 1.87 4.91
I1220 21:57:51.272598 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s2_conv1_2_joint_vec’ with space 4.37G 160/1 5 3 2 (avail 0.06G, req 1.25G) t: 0 1.57 3.75
I1220 21:57:51.488917 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s2_conv1_3_joint_vec’ with space 4.37G 160/1 5 3 2 (avail 0.06G, req 1.25G) t: 0 1.58 3.71
I1220 21:57:51.705770 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s2_conv1_4_joint_vec’ with space 4.37G 160/1 5 3 2 (avail 0.06G, req 1.25G) t: 0 1.58 3.74
I1220 21:57:51.920711 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s2_conv1_5_joint_vec’ with space 4.37G 160/1 5 3 2 (avail 0.06G, req 1.25G) t: 0 1.64 3.79
I1220 21:57:51.967404 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s2_conv1_6_joint_vec’ with space 4.37G 160/1 1 1 0 (avail 0.06G, req 1.25G) t: 0 0.2 0.37
I1220 21:57:51.992983 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s2_conv1_7_joint_vec’ with space 4.37G 160/1 1 1 0 (avail 0.06G, req 1.25G) t: 0 0.15 0.17
I1220 21:57:52.247993 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s3_conv1_1_joint_vec’ with space 4.37G 191/1 5 3 2 (avail 0.06G, req 1.25G) t: 0 1.85 4.96
I1220 21:57:52.465217 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s3_conv1_2_joint_vec’ with space 4.37G 160/1 5 3 2 (avail 0.06G, req 1.25G) t: 0 1.62 3.78
I1220 21:57:52.682094 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s3_conv1_3_joint_vec’ with space 4.37G 160/1 5 3 2 (avail 0.06G, req 1.25G) t: 0 1.58 3.75
I1220 21:57:52.899735 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s3_conv1_4_joint_vec’ with space 4.37G 160/1 5 3 2 (avail 0.06G, req 1.25G) t: 0 1.61 3.75
I1220 21:57:53.117147 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s3_conv1_5_joint_vec’ with space 4.37G 160/1 5 3 2 (avail 0.06G, req 1.25G) t: 0 1.6 3.79
I1220 21:57:53.164228 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s3_conv1_6_joint_vec’ with space 4.37G 160/1 1 1 0 (avail 0.06G, req 1.25G) t: 0 0.19 0.35
I1220 21:57:53.189612 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s3_conv1_7_joint_vec’ with space 4.37G 160/1 1 1 0 (avail 0.06G, req 1.25G) t: 0 0.13 0.16
I1220 21:57:53.444718 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s4_conv1_1_joint_vec’ with space 4.37G 191/1 5 3 2 (avail 0.06G, req 1.25G) t: 0 1.82 4.9
I1220 21:57:53.667579 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s4_conv1_2_joint_vec’ with space 4.37G 160/1 5 3 2 (avail 0.06G, req 1.25G) t: 0 1.62 3.81
I1220 21:57:53.885004 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s4_conv1_3_joint_vec’ with space 4.37G 160/1 5 3 2 (avail 0.06G, req 1.25G) t: 0 1.61 3.78
I1220 21:57:54.103643 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s4_conv1_4_joint_vec’ with space 4.37G 160/1 5 3 2 (avail 0.06G, req 1.25G) t: 0 1.58 3.78
I1220 21:57:54.320942 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s4_conv1_5_joint_vec’ with space 4.37G 160/1 5 3 2 (avail 0.06G, req 1.25G) t: 0 1.62 3.78
I1220 21:57:54.370590 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s4_conv1_6_joint_vec’ with space 4.37G 160/1 1 1 0 (avail 0.06G, req 1.25G) t: 0 0.19 0.35
I1220 21:57:54.395680 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s4_conv1_7_joint_vec’ with space 4.37G 160/1 1 1 0 (avail 0.06G, req 1.25G) t: 0 0.13 0.15
I1220 21:57:54.649092 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s5_conv1_1_joint_vec’ with space 4.37G 191/1 5 3 2 (avail 0.06G, req 1.25G) t: 0 1.82 4.89
I1220 21:57:54.868311 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s5_conv1_2_joint_vec’ with space 4.37G 160/1 5 3 2 (avail 0.06G, req 1.25G) t: 0 1.58 3.8
I1220 21:57:55.088066 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s5_conv1_3_joint_vec’ with space 4.37G 160/1 5 3 2 (avail 0.06G, req 1.25G) t: 0 1.58 3.72
I1220 21:57:55.306232 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s5_conv1_4_joint_vec’ with space 4.37G 160/1 5 3 2 (avail 0.06G, req 1.25G) t: 0 1.57 3.78
I1220 21:57:55.522558 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s5_conv1_5_joint_vec’ with space 4.37G 160/1 5 3 2 (avail 0.06G, req 1.25G) t: 0 1.59 3.78
I1220 21:57:55.568832 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s5_conv1_6_joint_vec’ with space 4.37G 160/1 1 1 0 (avail 0.06G, req 1.25G) t: 0 0.18 0.33
I1220 21:57:55.592062 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s5_conv1_7_joint_vec’ with space 4.37G 160/1 1 1 0 (avail 0.06G, req 1.25G) t: 0 0.12 0.14
I1220 21:57:55.847193 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s6_conv1_1_joint_vec’ with space 4.37G 191/1 5 3 2 (avail 0.06G, req 1.25G) t: 0 1.84 4.91
I1220 21:57:56.064046 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s6_conv1_2_joint_vec’ with space 4.37G 160/1 5 3 2 (avail 0.06G, req 1.25G) t: 0 1.58 3.72
I1220 21:57:56.281412 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s6_conv1_3_joint_vec’ with space 4.37G 160/1 5 3 2 (avail 0.06G, req 1.25G) t: 0 1.61 3.76
I1220 21:57:56.501744 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s6_conv1_4_joint_vec’ with space 4.37G 160/1 5 3 2 (avail 0.06G, req 1.25G) t: 0 1.57 3.7
I1220 21:57:56.717321 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s6_conv1_5_joint_vec’ with space 4.37G 160/1 5 3 2 (avail 0.06G, req 1.25G) t: 0 1.61 3.8
I1220 21:57:56.764864 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s6_conv1_6_joint_vec’ with space 4.37G 160/1 1 1 0 (avail 0.06G, req 1.25G) t: 0 0.2 0.34
I1220 21:57:56.789643 4640 cudnn_conv_layer.cpp:900] [1] Conv Algos (F,BD,BF): ‘s6_conv1_7_joint_vec’ with space 4.37G 160/1 1 1 0 (avail 0.06G, req 1.25G) t: 0 0.13 0.16
*** Aborted at 1513807076 (unix time) try “date -d @1513807076” if you are using GNU date ***
PC: @ 0x7f90aa726b60 caffe::CuDNNConvolutionLayer<>::FindExConvAlgo()
*** SIGSEGV (@0x0) received by PID 4640 (TID 0x7f90abacb0c0) from PID 0; stack trace: ***
@ 0x7f90a83b84b0 (unknown)
@ 0x7f90aa726b60 caffe::CuDNNConvolutionLayer<>::FindExConvAlgo()
@ 0x7f90aa73cce1 caffe::CuDNNConvolutionLayer<>::Reshape()
@ 0x7f90aa52bf0a caffe::Layer<>::Forward()
@ 0x7f90aa8a50fb caffe::Net::ForwardFromTo()
@ 0x7f90aa8a5267 caffe::Net::Forward()
@ 0x7f90aa8a8a45 caffe::Net::ForwardBackward()
@ 0x7f90aa885f65 caffe::Solver::Step()
@ 0x7f90aa887bc0 caffe::Solver::Solve()
@ 0x40f85d train()
@ 0x40c198 main
@ 0x7f90a83a3830 __libc_start_main
@ 0x40ca09 _start
@ 0x0 (unknown)
Segmentation fault (core dumped)
It appears to be saying that certain convolution layers in cuDNN require 1.25Gb on the GPU but only sees 0.06Gb for some reason. I tried increasing the --shm-size and --ulimit stack parameters of the docker container launch command, but I still keep crashing at the same place with the same 0.06G message. This is particularly odd since the exact same training setup (same solver/training prototxt and data set) was previously successful used in training on (vanilla) caffe. Any idea why I am getting this error? The reason why we want to move to NVCaffe is to take advantage of FP16 training on Volta GPUs.