Object segmentation with CUDA-Memory requirements

Recently I’ve started working on object/instance segmentation problems using ConvNNs (like FCN-8s and CRF-RNN). They do a great job, and I use their outputs extensively, but there’s always a problem with memory. We currently run Tesla K40m with 12 Gb memory. It’s more than enough for text tasks, but for images I ran into a number of issues like memory and overheating.

For example, inference (aka testing) of 1 image size 1280x720 (~1M pixels, just what I need) on GPU takes only a fraction of a second, that’s great (the ConvNN has ~140M parameters, in total ~537Mb). For training though it immediately reports a memory issue, even for a batch size 1. Eventually I had to reduce the image size to 500x500 (250K pixels, which is still OK but far worse than 1280x720) to do any training at all (batch size of 1).

So here’s my question: why does this happen? It can’t be derivatives, because if there’s 1 derivative for each weight, the total array should be ~1Gb (see above). So why then? How is 1 image w/~1M pixels loaded into memory and how does it get processed by CUDA? Is there any way to predict how much memory I need if I know the image size/number?



A k40m shouldn’t overheat. What’s the configuration of the system? Is the K40m installed in an OEM server that has been certified for K40m?

I didn’t set it up. Apparently there was some problem with the box and the noise from the heatsink. Our IT dept are working on it now. However, I’m more concerned with this memory vs size thing; GPU refused outright to process anything above 500x500. Even with low temp.

maybe the machine needs to be rebooted

what does nvidia-smi show for available memory?

No it’s a hardware issue. They’re fixing it now.

When specifically? When nothing’s processing, 12 Gb or so free. When I managed to actually run the training (batch size 1, img 500x500) something like 2 Gb free, so the training eats up ~10 Gb and I don’t understand why. This is the essence of my question.

OK, I’ve done some more calculation. I ran a 1000x600 image through CRF-RNN and saved every layer (i.e. ‘data’, ‘conv1_1’, etc). All-in, it’s 4.5 Gb, + 1 Gb for weights + derivatives, that’s 5.5, way less than required memory. I don’t understand why.

I don’t know what kind of deep learning framework you are using, but several of them are using a memory manager under the hood, which (I suppose) allocates a large chunk of the GPU memoary at begin as its ‘memory pool’. This is, because the overhead of cudaMalloc/cudaFree can be significant - and it got even worse in Windows in non-TCC mode with recent drivers. E.g., CAFFE seems to use CnMem under the hood -> https://github.com/NVIDIA/caffe/pull/11
Actually, not only Deep learning frameworks also other frameworks like the image processing library ArrayFire seem to use a memory manager under the hood, for the same reasons. So each framework which tries to be performant necessarily seems to employ a memory manager as a ‘workaround’, to counteract the performance issues of the GPU memory allocation routines. So when using several frameworks the GPU memory must be big enough to hold all memory pools for the used frameworks.

Thanks, this makes sense. Sorry I didn’t mention it earlier, I do use Caffe on Ubuntu 16.04. If I understood you correctly, the only way to reduce memory reqs is to get to the bottom of memory mgmt by one of the libraries employed by Caffe. That’s way to low-level for me right now, I’m afraid. Could you at least suggest the following 2 things:

  1. Tricks for reducing memory demand,
  2. Is there any way at all to predict how much memory will be needed if the input I have are model size + img size (e.g. 1000x600 px)?


Sorry, I don’t know. You have to ask in Caffe forums, like https://groups.google.com/forum/#!forum/caffe-users