Hi,
i developed a library that makes use of OpenCL. The code in this library can also be called from multiple threads if a separate instance of the algorithm implementation class is created for every single thread.
The problem starts once I allocate to much memory. Unfortunately this could be at any given time, not just when i allocate OpenCl Buffers. One example is the following error:
CL_MEM_OBJECT_ALLOCATION_FAILURE error executing CL_COMMAND_READ_BUFFER on GeForce GTX 750 Ti (Device 0).
Meaning it’s possible that the library was able to allocate the OpenCL Buffers in the intialization part of the algorithm but once I execute another operation (e.g. enqueuing a kernel) it fails. (and returns 18446744073709551612 as error code).
The major problem is that I can control the amount of memory I allocate in my process but not outside of it and there is afaik no way to check how much unused memory is available. In theory Adobe Lightroom could be running at the same time and using up some memory for its algorithm and I would have less memory available than expected. Or what happens if somebody starts an OpenCL or OpenGL based application while my algorithm is already running and allocates some memory → it would mean that I end up again in suddenly not having enough memory to execute the next kernel within the algorithm or execute any other OpenCL function that requires some additional memory.
I tested the same thing also with AMD and Intel hardware, and there this was not an issue. There execution is simply delayed until memory is available.
How can this problem be solved with NVIDIA hardware?
Best Regards
Michael
Currently I only found a partial solution:
- I link in cudart and use it to retrieve the current amount of available memory. Depending on the result I either start my algorithm or return an appropriate error code on my API. (The only other way to get this information would be to create a hidden window, initialize an OpenGL context and retrieve it using glGetIntegerv(0x9049,…)).
The problem is that this does not solve it in all situations. In case my algorithm is running and somebody else starts an OpenGL or OpenCL application there might again not be enough memory. Meaning CL_MEM_OBJECT_ALLOCATION_FAILURE could occure at any given time when I call the next function on the OpenCL API.
Any suggestions on how to handle this? Or is this something that should be differently implemented in the driver? (Meaning delays, loading data of to the host memory,…).
Best Regards
Michael
Is this on windows, or linux, or both?
My project atm only supports Windows.
I wrote a small test tool to reproduce the same error. It reproduced it on Windows 10 and Fedora 26. On Windows I used the driver delivered with the Cuda 8 SDK and on Fedora I used the driver packaged up in the negativo17 repository. The system is an i7-6700k with 16gb of memory and the already mentioned GTX 750 TI.
The test code: https://pastebin.com/raw/TjjLE7zn
And CMakeLists.txt: https://pastebin.com/raw/rcxfmBuf
On Windows I can select the CPU (Intel OpenCL implementation on x86) and the NVIDIA GPU to execute it.
When I select the NVIDIA GPU this is the result:
C:\User\USER\Documents\opencl\build>Release\opencl_memalloc.exe
Device: 0
Platform: Intel(R) OpenCL
Name: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Device: 1
Platform: NVIDIA CUDA
Name: GeForce GTX 750 Ti
Select a devices: 1
Device Memory: 2147483648
Allocated 2147483648bytes.
ATtempting to execute kernel with buffer 0
ATtempting to execute kernel with buffer 1
CL_MEM_OBJECT_ALLOCATION_FAILURE error executing CL_COMMAND_NDRANGE_KERNEL on GeForce GTX 750 Ti (Device 0).
Exception caught: kernel exec, error code: -4
And when I select the CPU everything seems to be fine:
C:\Users\USER\Documents\opencl\build>Release\opencl_memalloc.exe
Device: 0
Platform: Intel(R) OpenCL
Name: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Device: 1
Platform: NVIDIA CUDA
Name: GeForce GTX 750 Ti
Select a devices: 0
Device Memory: 17117061120
Allocated 17179869184bytes.
ATtempting to execute kernel with buffer 0
ATtempting to execute kernel with buffer 1
...
ATtempting to execute kernel with buffer 14
ATtempting to execute kernel with buffer 15
EXEC finished
Meaning in the second case it’s not an issue to allocate more memory than there is on the device. And this was also true when I tried the same scenario with the Radeon R7 360 and the Intel HD 530 integrated GPU.
Best Regards
Michael
I received a GTX 1050 Ti today.
The behaviour is the same.
D:\temp\opencl\build\Release
λ .\opencl_memalloc.exe
Device: 0
Platform: Intel(R) OpenCL
Name: Intel(R) HD Graphics 530
Device: 1
Platform: Intel(R) OpenCL
Name: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
Device: 2
Platform: AMD Accelerated Parallel Processing
Name: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
Device: 3
Platform: NVIDIA CUDA
Name: GeForce GTX 1050 Ti
Select a devices: 3
Device Memory: 4294967296
Allocated 4294967296bytes.
ATtempting to execute kernel with buffer 0
ATtempting to execute kernel with buffer 1
ATtempting to execute kernel with buffer 2
ATtempting to execute kernel with buffer 3
CL_MEM_OBJECT_ALLOCATION_FAILURE error executing CL_COMMAND_NDRANGE_KERNEL on GeForce GTX 1050 Ti (Device 0).
Exception caught: kernel exec, error code: -4
(Only the error code returned by the function is more reasonable (-4 instead of 18446744073709551612), but the problem still exists.
Regards
Michael
I created a bug in the nvidia bug report utility. The conclusion was that this behaviour is intentional and does not violate the specification.
How I solved this atm:
- When I start the algorithm initialization, I check the available memory using NVAPI.
- I directly use the buffers after creating them in a simple dummy kernel → to ensure that they are created on the device.
- After every time I enqueue a kernel I directly call queue.finish().
- The algorithm initialization function uses a mutex to ensure that no initialization phase runs in parallel (and might incorrectly read out the available memory).
Not perfect, but under the circumstances the only working solution.
Regards
Hi, MichaelE1000, i have the same bug on nVidia GT 610 during running your sample on Windows 10 X64:
Device: 0
Platform: NVIDIA CUDA
Name: GeForce GT 610
Device: 1
Platform: AMD Accelerated Parallel Processing
Name: Oland
Device: 2
Platform: AMD Accelerated Parallel Processing
Name: Intel(R) Core(TM) i7-6900K CPU @ 3.20GHz
Select a devices: 0
Device Memory: 1073741824
Allocated 1073741824bytes.
ATtempting to execute kernel with buffer 0
CL_MEM_OBJECT_ALLOCATION_FAILURE error executing CL_COMMAND_NDRANGE_KERNEL on GeForce GT 610 (Device 0).
Exception caught: kernel exec, error code: -4
It’s normal to have this issue, according to NVidia Developers this is intended behaviour.
Another thing I recently noticed:
Try to enqueue multiple kernels using the same buffer. I have the slight suspicion that the nvidia driver just sums up the required space by all operations in the queue and does not check whether the used buffers are the same. Because if I did not flush every single operation, I sometimes had multiple kernels enqueued and got this error way earlier (even so that there was enough memory since the kernels were using the same buffer).
Besides that have a look at my previous posting, that’s how I solved the issue (or rather worked around it).