EDIT(3): It seems that something is wrong with the C++ bindings of OpenCL (or I’ve somehow misused them in a way that messed up memory). I’ve converted my program to plain C and it works now (more or less). Hence, consider this solved. :)
Hello everybody,
I hate to ask here as I’m quite sure I’m missing something obvious, but I’ve messed with this problem more than a day and it simply won’t reveal its evil character. (Note that I’m taking my first steps…)
Basically I’m blurring an image by summing up the pixels’ R, G, B, A values in a certain area around each pixel and dividing by the number of pixels summed up. It works in the naive approach of two for-loops:
[codebox]__kernel void
medianBlur(__global __read_only image2d_t in,
__global __write_only image2d_t out)
{
size_t x = get_global_id(0);
size_t y = get_global_id(1);
uint4 sum = (uint4) 0;
for (int dx = -p; dx <= p; ++dx) // dx = [-4 .. 4]
{
int ix = x + dx;
for (int dy = -p; dy <= p; ++dy)
{
int iy = y + dy;
sum += read_imageui(in, sampler, (int2)(ix,iy));
}
}
// ... write_imageui(...)
}
[/codebox]
(Please kindly ignore in the ugliness of the code… I know it could be shorter. ;) )
Next, I tried to unroll the inner loop to improve performance. Quit simple, however the result was a grey box.
I was confused and experimented a lot to find what the problem was. In the end, I came up with this chunk (which does work) to demonstrate the problem:
[codebox]__kernel void
medianBlur(__global __read_only image2d_t in,
__global __write_only image2d_t out,
__global uint* buf) // added a int[4] buffer to read sample values back
for (int dx = -p; dx <= p; ++dx)
{
int ix = x + dx;
for (int dy = -p; dy <= p; ++dy)
{
int iy = y + dy;
uint4 t = read_imageui(in, sampler, (int2)(ix,iy));
uint4 t2 = read_imageui(in, sampler, (int2)(ix,iy));
if (x == 320 && y == 230 && dx == -p) // only take one pixel close to image center
{
buf[0] = t.x;
buf[1] = t.y;
buf[2] = t.z;
buf[3] = t.w;
}
sum += t;
}
}
// ...
[/codebox]
Now, if I change the buf[n] code a little, all I get as resulting image is a grey square:
[codebox]// …
for (int dy = -p; dy <= p; ++dy)
{
int iy = y + dy;
uint4 t = read_imageui(in, sampler, (int2)(ix,iy));
uint4 t2 = read_imageui(in, sampler, (int2)(ix,iy));
if (x == 320 && y == 230 && dx == -p) // only take one pixel close to image center
{
buf[0] = t.x;
buf[1] = t.y;
buf[2] = t.z;
buf[3] = t2.w; // <-- !! only changed this line
}
sum += t;
}
// ...
[/codebox]
The only difference is that buf is filled with both values from t and t2, but they should be the same. Reading back buf to host memory, buf is filled with EDIT its initial values END EDIT instead of the pixel’s correct values. When not mixing t and t2, buf contains the correct values. What is going on here?
The same strange thing happens when I do the actual loop unrolling, something like this:[codebox] sum += read_imageui(in, sampler, (int2)(ix,iy));
iy++;
sum += read_imageui(in, sampler, (int2)(ix,iy));
iy++;
// ... etc
[/codebox]
Thanks in advance to everybody for their time. :)
EDIT(1): P.S.: I’m using the 195.39 beta driver and CUDA Toolkit 3.0 beta1.
EDIT(2): The problem remains the same with the latest driver (195.62).