OpenCL on Windows much slower than on Mac? A simple convolution test

Hi,

I’m newbie to the development of OpenCL. I have a simple convolution kernel for RGB 24bit image :

__kernel void ConvolveRGB(const global uchar *in,
global uchar *out,
int width, int height)
{
const int x = get_global_id(0);
const int y = get_global_id(1);
const int row = width * 3;

if (x > 0 && x < width - 1 && y > 0 && y < height - 1)
{
for (int b = 0; b < 3; ++b)
{
int t = (y * width + x) * 3 + b;
int v = 1 * in[t - row - 3]
+2 * in[t - row]
+1 * in[t - row + 3]
+2 * in[t - 3]
+4 * in[t]
+2 * in[t + 3]
+1 * in[t + row - 3]
+2 * in[t + row]
+1 * in[t + row + 3];
v /= 16;
if (v > 255) v = 255;
out[t] = v;
}
}
}

And the caller program like :

void ProcessRGBOnGPU()
{
size_t g_work_size = {IMAGE_WIDTH, IMAGE_HEIGHT};
size_t l_work_size = {16, 16};
cl_event events[1];

clEnqueueWriteBuffer(CLQueue, CLRGBInBuf, false, 0, IMAGE_WIDTH * IMAGE_HEIGHT * 3, RGBIn, 0, 0, 0);
clEnqueueNDRangeKernel(CLQueue, CLKnlRGB, 2, 0,
g_work_size,
l_work_size,
0, 0, events);
clEnqueueReadBuffer(CLQueue, CLRGBOutBuf, true, 0, IMAGE_WIDTH * IMAGE_HEIGHT * 3,
RGBOutGPU, 0, 0, 0);
}

The problem is the same program running on my same MacBook Pro (9400M/9600M GT) notebook for 2048*2048 image needs about 5ms, but on Windows XP it takes about 250ms! Why the performance is so much different?

Any help is really appreciated. Thanks in advance.

ZhaoYu

Don’t know if it’s related, but with 3.0 cuda toolkit, compilation was super slow (at least on my gtx 470). I think it’s fixed in 3.1

Don’t know if it’s related, but with 3.0 cuda toolkit, compilation was super slow (at least on my gtx 470). I think it’s fixed in 3.1

I also tried the CUDA 3.1, but not any difference. Maybe it’s because the different ICD implementation of OpenCL.

I also tried the CUDA 3.1, but not any difference. Maybe it’s because the different ICD implementation of OpenCL.

Probably you include start up time in your measurement. Check pure function time. OpecnCL is initilized when first function called afaik.

Probably you include start up time in your measurement. Check pure function time. OpecnCL is initilized when first function called afaik.

I only measure the function “void ProcessRGBOnGPU()” time in a loop. On MacOS X 10.6, the very first call this function takes about 60ms, but later only 5ms. On WinXP, every time call this function takes about 250ms. The program has not any change on XP, just build with VC2008.

I only measure the function “void ProcessRGBOnGPU()” time in a loop. On MacOS X 10.6, the very first call this function takes about 60ms, but later only 5ms. On WinXP, every time call this function takes about 250ms. The program has not any change on XP, just build with VC2008.

What is about running almost empty kernel?
It could be compiler or system issue. Compiler may wrongly compile kernell or ssytem driver adds too much overhead to run time.

What is about running almost empty kernel?
It could be compiler or system issue. Compiler may wrongly compile kernell or ssytem driver adds too much overhead to run time.

A empty kernel is fine. Actually, as my test, 90% time is on final line “out[t] = v;”.

A empty kernel is fine. Actually, as my test, 90% time is on final line “out[t] = v;”.

And do they work similar way?

And do they work similar way?

You probably checked this, but are you running on the same GPU? OSX turns the 9600M GT on and off via the Energy Saver section of System Preferences.

I have the same model, but have never put another OS on. How do you turn the 9600M GTon and off under XP?

Is it possible that you are using the 9600M GT on OSX & 9400M on XP? It would not explain the entire difference. Still it would help.

You probably checked this, but are you running on the same GPU? OSX turns the 9600M GT on and off via the Energy Saver section of System Preferences.

I have the same model, but have never put another OS on. How do you turn the 9600M GTon and off under XP?

Is it possible that you are using the 9600M GT on OSX & 9400M on XP? It would not explain the entire difference. Still it would help.

Yes. The source code is identical.

Yes. The source code is identical.

I tried both 9400M and 9600M GT on OS X. The result is similar, except the 9600M GT only work in 32 bit (i386). On XP, only the 9600M GT can be used. You can’t switch to 9400M.