System freezing with cuDNN and cuda

Hi,

I am training a convolutional network using Torch. If I use cuDNN, my system freezes at the beginning of the training and shuts down eventually. This happens by adding more convolutional layers, increasing the number of filters in a convolutional layer or increasing the batch size. The same kind of system freeze happens with using cuda but with larger parameters. I monitor GPU features with nvidia-smi, but I do not see any abnormal values.

I use Ubuntu 14.04 with GTX Titan X. My driver version is 361.28. I use Cuda Toolkit 7.5 and cuDNN 4, Torch 7. My colleague who uses Caffe experiences the same problem.

Here is a network I have problem with when I train Cifar-10 dataset. If I take out the last convolutional layer with 512 filters, things work. Reducing the number of filters from 512 to 128 makes it work at the first run, but crashes the system at the second run. So I cannot find bordering paramters that will crash the system.

Your help is greatly appreciated!

nn.Sequential {
[input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> output]
(1): cudnn.SpatialConvolution(3 -> 128, 3x3, 1,1, 1,1)
(2): cudnn.ReLU
(3): cudnn.SpatialConvolution(128 -> 128, 3x3, 1,1, 1,1)
(4): cudnn.ReLU
(5): cudnn.SpatialMaxPooling(2,2,2,2)
(6): cudnn.SpatialConvolution(128 -> 256, 3x3, 1,1, 1,1)
(7): cudnn.ReLU
(8): cudnn.SpatialConvolution(256 -> 256, 3x3, 1,1, 1,1)
(9): cudnn.ReLU
(10): cudnn.SpatialMaxPooling(2,2,2,2)
(11): cudnn.SpatialConvolution(256 -> 512, 3x3, 1,1, 1,1)
(12): cudnn.ReLU
(13): cudnn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
(14): cudnn.ReLU
(15): cudnn.SpatialMaxPooling(2,2,2,2)
(16): nn.Reshape(8192)
(17): nn.Linear(8192 -> 1024)
(18): nn.Linear(1024 -> 1024)
(19): nn.Linear(1024 -> 10)
}

Is it possible to provide the detail instructions that ran into this problem? Thanks.

I think the problem is about the power consumption of the GPU. At the beginning of the training there is a surge of power consumption by GPU. If I limit the maximum power to 210W instead of 250W, system does not crash. Though my power supply unit provides 1100W total. I don’t know why the system crashes. Any ideas?
Thanks

So, what any solution found to this problem?
I have experienced a similar issue. I was trying to train a VGG-type-A network with Caffe and the system crashes after initializing the last LOSS layer. Interestingly, it works fine fine with Cudnn-v2 with AlexNet. To remedy the problem with VGG training I tried everything - reinstall kernel, tried cuda-6.5, 7, and 7.5, triend cudnn v2,v3,v4 with corresponding CUDA versions. Tried drivers 340,346,352,361,364 - it still does not work with training VGG.

Also, how do you limit the maximum power from 250W to 210?

You can set the power limit (within certain minimum and maximum values specific to each card) with: nvidia-smi --power-limit=[watt]