Best Practice for Memory Managment in OpenCL


I was wondering if anyone could throw me some hints regarding memory managment in OpenCL on Nvidia cards. My problem is how to determine whether the device memory is full at Buffer creation time. While the OpenCL 1.1 specification states that clCreateBuffer returns CL_OUT_OF_RESOURCES if the device memory is filled, it seems that the Nvidia OpenCL implementation does not conform to this: I ran some tests and was able to overcommit the device memory at buffer creation. From what I read, this is possible since the actual device allocation is delayed until a kernel is executed on this buffer. Once I run a kernel on the overcommited buffer, an error message is sent to my notification function. So currently I only see two ways of determining whether the device memory is packed:


Making some manual checks, i.e. ensuring that the combined buffer sizes are smaller than CL_DEVICE_GLOBAL_MEM_SIZE and that each allocation is smaller than CL_DEVICE_MAX_MEM_ALLOC_SIZE. However, there might be a problem with device memory fragmentation, that leads to buffers not fitting on the device which theoretically should.

Checking whether the execution of a kernel fails on the given buffer object. However, this option seems a bit “suboptimal” for my taste :)

So my question is: is there another - more elegant - way of determining at buffer creation time whether the given memory block fits onto the device?



Well, if you can use OpenCL 1.1 (is it already supported on any device?), you can allocate few huge buffers and do all the memory management on your own using the clCreateSubBuffer functions. Of course, the huge buffers can cause some fragmentation problems on their own, but it is still a way to go (although far from a perfect one).

The reason for this behaviour is simple and good from a POV. Creating buffers for a context does not imply sending them immediately to the device, but rather as you have mentioned only copying them onto the device when the kernel is launched. The reason for this is that you can use more memory on the device than there is present at one given time. If you enqueue several types of kernels using different buffers, you need not worry about if a buffer is in host memory or device. That is the reason why all buffers have a corresponding memory in host memory, because that is what the host uses to swap buffers onto and from devices. If each kernel you enqueue only uses less than the maximum amount on the device, the OCL implementation should guarantee that the kernel will get the desired memory, as other buffers will be swapped out to host resident RAM.

This is favorable from one point of view, but undesired from another. This way you can write complex applications with many buffers and not having to worry too much about buffer movement, but it takes a little more attention to know when you overuse device memory.

If I’m not mistaken, the reason is not support of virtual memory (which is not in the OpenCL spec, although is possible under windows 7 even after actaul allocation, a big performance hit by the way), but rather due to the fact that memory allocation under OpenCL is context specific rather than device specific. This means that the OpenCL implementation doesn’t know which device to allocate the memory on until it is actually used. This means that you know where to allocate the memory only on kernel launch or memory copy. The memory should be allocated when you perform the memory copy, so if you don’t allocated and free the buffers, but rather keep them around, doing a memory copy after the allocation (or memset) should force actual buffer allocation.

Also, note that NVIDIA doesn’t support OpenCL 1.1 AFAIK.