Combining reads of read_imageui() leads to empty result

EDIT(3): It seems that something is wrong with the C++ bindings of OpenCL (or I’ve somehow misused them in a way that messed up memory). I’ve converted my program to plain C and it works now (more or less). Hence, consider this solved. :)

Hello everybody,

I hate to ask here as I’m quite sure I’m missing something obvious, but I’ve messed with this problem more than a day and it simply won’t reveal its evil character. (Note that I’m taking my first steps…)

Basically I’m blurring an image by summing up the pixels’ R, G, B, A values in a certain area around each pixel and dividing by the number of pixels summed up. It works in the naive approach of two for-loops:

[codebox]__kernel void

medianBlur(__global __read_only image2d_t in,

        __global __write_only image2d_t out)

{

size_t x = get_global_id(0);

size_t y = get_global_id(1);

uint4 sum = (uint4) 0;

for (int dx = -p; dx <= p; ++dx) // dx = [-4 .. 4]

{

    int ix = x + dx;

    for (int dy = -p; dy <= p; ++dy)

    {

        int iy = y + dy;

        sum += read_imageui(in, sampler, (int2)(ix,iy));

    }

}

// ... write_imageui(...)

}

[/codebox]

(Please kindly ignore in the ugliness of the code… I know it could be shorter. ;) )

Next, I tried to unroll the inner loop to improve performance. Quit simple, however the result was a grey box.

I was confused and experimented a lot to find what the problem was. In the end, I came up with this chunk (which does work) to demonstrate the problem:

[codebox]__kernel void

medianBlur(__global __read_only image2d_t in,

        __global __write_only image2d_t out,

        __global uint* buf) // added a int[4] buffer to read sample values back

for (int dx = -p; dx <= p; ++dx)

{

    int ix = x + dx;

    for (int dy = -p; dy <= p; ++dy)

    {

        int iy = y + dy;

        uint4 t = read_imageui(in, sampler, (int2)(ix,iy));

        uint4 t2 = read_imageui(in, sampler, (int2)(ix,iy));

        if (x == 320 && y == 230 && dx == -p) // only take one pixel close to image center

        {

             buf[0] = t.x;

             buf[1] = t.y;

             buf[2] = t.z;

             buf[3] = t.w;

        }

        sum += t;

    }

}

// ...

[/codebox]

Now, if I change the buf[n] code a little, all I get as resulting image is a grey square:

[codebox]// …

    for (int dy = -p; dy <= p; ++dy)

    {

        int iy = y + dy;

        uint4 t = read_imageui(in, sampler, (int2)(ix,iy));

        uint4 t2 = read_imageui(in, sampler, (int2)(ix,iy));

        if (x == 320 && y == 230 && dx == -p) // only take one pixel close to image center

        {

             buf[0] = t.x;

             buf[1] = t.y;

             buf[2] = t.z;

             buf[3] = t2.w; // <-- !! only changed this line

        }

        sum += t;

    }

// ...

[/codebox]

The only difference is that buf is filled with both values from t and t2, but they should be the same. Reading back buf to host memory, buf is filled with EDIT its initial values END EDIT instead of the pixel’s correct values. When not mixing t and t2, buf contains the correct values. What is going on here?

The same strange thing happens when I do the actual loop unrolling, something like this:[codebox] sum += read_imageui(in, sampler, (int2)(ix,iy));

    iy++;

    sum += read_imageui(in, sampler, (int2)(ix,iy));

    iy++;

    // ... etc

[/codebox]

Thanks in advance to everybody for their time. :)

EDIT(1): P.S.: I’m using the 195.39 beta driver and CUDA Toolkit 3.0 beta1.

EDIT(2): The problem remains the same with the latest driver (195.62).

What is the C++ code that you expect is causing the bug? The C++ bindings are being included in part of the 1.1 specification so any feedback is greatly appreciated.

Honestly, I have no clue. Since the problem appeared and disappeared simply by switching one statement in the kernel, there is no obvious culprit on the C++ bindings side. First I thought the values could be read back in the wrong way, but that didn’t seem reasonable considering it worked in one kernel, but didn’t in the other. I have since moved on and not investigated the case anymore, so I’m sorry but I think I cannot help you (it seems I’ve even overwritten the files that contained the problematic code…).

However, I had another issue with the C++ bindings: it didn’t allow to not give a local work group size in order to signal that OpenCL should choose something by itself, which you can do in the plain C method. But that bug is probably already known, I assume…? :)