I’m working on an application that uses OpenCL / OpenGL interop. In short, buffers are uploaded from the CPU to the GPU in OpenCL buffers, then work is performed on those OpenCL buffers (i.e. kernels are run), and the final destination of the final kernel is OpenGL buffers shared as OpenCL buffers.
With a given data set, I get a hang when calling clFlush(), which is called at the end of all the OpenCL enqueuing work. Here is the call stack that I get when the application gets stuck in the clFlush() call:
#0 0x00007fc7db4134ed in __lll_lock_wait () at /lib64/libpthread.so.0 #1 0x00007fc7db40ede6 in _L_lock_941 () at /lib64/libpthread.so.0 #2 0x00007fc7db40ecdf in pthread_mutex_lock () at /lib64/libpthread.so.0 #3 0x00007fc78e272c3c in () at /lib64/libnvidia-opencl.so.1 #4 0x00007fc78e275f78 in () at /lib64/libnvidia-opencl.so.1 #5 0x00007fc78e2623ca in () at /lib64/libnvidia-opencl.so.1 #6 0x00007fc78441647b in MyFunctionCallingCLFlushInMyApplication [...]
This hang does not always happens at the same time, sometime I need to run a few iterations before hitting the hang.
I tried narrowing down by calling clFlush() more often to isolate what OpenCL call could be the problem (which makes the problem harder to reproduce). And it was always happening after enqueueing one of the kernels that writes to the buffer shared between OpenCL and OpenGL, with the same call stack as above.
Note that I am using clEnqueueAcquire3DObjects() to acquire the objects. In the regular application, we are using cl_khr_gl_event to do the synchronization when it’s available, but as part of debugging this issue, I added glFinish() calls after the OpenGL work and the problem still occurs.
In order to debug the problem further, I surrounded the calls to clEnqueueAcquire3DObjects() with clFinish() (before and after), and this time the hang was happening in the clFinish() call, but right after the call to clEnqueueAcquire3DObjects().
Questions to the NVidia developers
- Is there any reason why would clFlush() hang for ever? What could this mutex be that the driver is waiting for forever?
- Could this be a driver bug? Or is there something that we are doing wrong in our application?
- What would the next steps be to debug this further?
Any help would be greatly appreciated.
P.S. I don’t seem to have an option to attach the requested nvidia-bug-report.log.gz to the new topic… I generated it using on Linux CentOS 7, driver 390.77 (although I tried other versions like 430.14 and got the same problem). The hardware is a Quadro K5000.
nvidia-bug-report.log.gz (102 KB)