How to solve memory allocation problem in cuda??

Hi there, I have been working with CUDA for quite a long now and have implemented quite a few programs. But for the first time, I am facing a difficulty with an error called “Fatal error. Memory allocation cannot be possible”. I have searched in the internet and couldn’t find any satisfactory solution. Here I would like to mention that I am using CUDA 6 version, Linux 64 bit operating system with 64 GB Ram support and 2TB hard disk support. I have checked memory status and it shows only 3% is in use. I have used CUDAMalloc function to allocate memory. But still this problem arises. So, anybody if could help me out in this regard will be very much helpful. Thanks in advance.

The error message you quote does not appear to be a CUDA error message. How are you running your CUDA application? What part of the software produces this error message? Does the error message refer to GPU memory or system memory? Does it occur on a call to cudaMalloc() or some other call, such as cudaMallocHost()?

What GPU are you running on? How much physical memory does this GPU have? How much GPU memory is the app trying to allocate at the point the cudaMalloc() calls fails?

You state that you have “checked memory status and it shows only 3% in use”. Exactly how did you perform this check? What tool did you run, what command line switches did you use with the tool? Running nvidia-smi -q should provide a quick initial check on available GPU memory.

Generally speaking, CUDA applications are limited to the physical memory present on the GPU, minus system overhead. If your GPU supports ECC, and it is turned on, 6.25% or 12.5% of the memory will be used for the extra ECC bits (the exact percentage depends on your GPU). Beyond that, about 100 MB are needed for internal use by the CUDA software stack. If the GPU is also used to support a GUI with 3D features, that may require additional memory.

Whatever is left over should be available for your CUDA application, but if there are many allocations and de-allocations of GPU memory made by the app, the allocation of large blocks of memory could fail even though the request is smaller than the total free memory reported. This is caused by an issue called fragmentation and it is a common issue with many memory allocators, not just the ones used to allocate GPU memory.

Fatal error: Failed to allocate device buffer. (out of memory at …/src/programname:linenumber

My 3D array is 20 X 200 X 200 and for each value in an array it returns 1331 outcomes (one for location and one for difference). Hence, I have to pass total 3 arrays to GPU of which one is of size 20 X 200 X 200 and other two are 20 X 200 X 200 X 1331.

So, I think this much memory allocation is not possible in GPU memory. So, is there any other way to handle this problem???

It is impossible to make any recommendations based on this scant information.

Ignoring the specifics of the error message for the moment, a matrix of 20x200x200x1331 is just over a billion elements. Without knowing the type I cannot be specific but assuming it is a single precision float, that is over 4GB. If you are wanting two of these then that is over 8GB. What card are you using?

It is a similar problem to what I face so what I do is manual paging of the elements (dealing with thousands of images, each of 8MB+). For example your ‘other two’ matrices, if instead of doing it as one giant matrix instead you do it is 1331 distinct arrays, each stored as a pointer generated from its own malloc you can then marshal them to/from of GPU/host memory when needed and you suddenly remove the limitations that all your data needs to live on the GPU. I was expecting this to have a drastic effect on my performance, but with a bit of tweaking using page-locked memory the downsides were minor.

As an aside, my general rule anytime I get an error the first thing I ask myself is have I done something silly/unexpected. I did exactly that last week trying to allocate 33GB of images and wondered why my program bailed!