OpenCL - Extremely slow kernel execution

Currently I’m at the final stages in my project, optimizations. Something very strange happens, something that I can’t explain.

Doesn’t matter what I do, there is always one kernel that takes about 10 ms to execute. What’s interesting is that from execution to execution, it isn’t always the same kernel. Obviously, these kernels i’m speaking about shouldn’t take much more then 100 micros. For example, I have this kernel that copies pixels with simple condition:

kernel void copyBufferWithMask(global const float* _src, global float* _dst, global const float* _mask, const int _width)
{
const int x = get_global_id(0);
const int y = get_global_id(1);
const int pos = y * _width + x;

if (x >= _width)
{
    return;
}

if (_mask[pos] != 0)
{
    _dst[pos] = _src[pos];
}

}

This kernel takes 10 ms as i said before while usually it takes about 90-100 micros.
Yesterday, a multiplication kernel took 11 ms while today it’s only 138 micros.

kernel void multiply(global const float * _src, global const float * _mask, global float * _dst, const int _width)
{
const int x = get_global_id(0);
const int y = get_global_id(1);
const int pos = y * _width + x;

if (x < _width)
{
    _dst[pos] = _src[pos] * _mask[pos];
}

}

I’m not sure how relevant it is for the problem but anyway, I’m using GTX-1080Ti and the command queue is initiated with CL_NON_BLOCK flag. Of course - using NVIDIA OpenCL means that i’m not using OpenCL 2.0 (version 1.2).

I tried searching for solution in the ‘Khronos Group’ specification but I found nothing about such behavior.

Has anyone heard about this problem / any ideas for a fix?

Thank you!

Michael.