openCL jobs are eating unusually much CPU

Hardware setup:
Phenom II X4 940 overclocked @ 3.2 ghz
1 * geforce gts 8800 (G92)
1* geforce gt 8800 (G92)
1* geforce GT 9500 (G92)

Software setup:
Linux kernel 3.2.32
nvidia drivers 312
amd app kit 2.7 (CPU opencl driver)

problem:
Compared to a much slower laptop (turion zm-86 + radeon hd 4570), this machine is incredibly slow when executing en openCL job: the jobs are taking at least 4 times as much time for each element, the CPU (which is WAY faster, as it is higher clocked, has 4 cores and a 128 bit sse2 execution unit) does not render anything either.
is there some form of coordination that is required by the nvidia drivers which might be blocking the performance when working with multiple cards?

this is my programs debug output when launching the program (I guess everybody will understand even though it is not their program):
--------------[i]
found 1 devices on platform 0
Query device platform name:
AMD Accelerated Parallel Processing
Query vendor platform name:
Advanced Micro Devices, Inc.
Query openCL version name:
OpenCL 1.2 AMD-APP (923.1)
found 3 devices on platform 1
Query device platform name:
NVIDIA CUDA
Query vendor platform name:
NVIDIA Corporation
Query openCL version name:
OpenCL 1.1 CUDA 4.2.1
passing through device 0 on platform 0
available global memory cache size: 65536
available global memory device size: 4143386624
available local memory size: 32768
device name: AMD Phenom™ II X4 920 Processor
passing through device 0 on platform 1
available global memory cache size: 0
available global memory device size: 536674304
available local memory size: 16384
device name: GeForce 8800 GT
passing through device 1 on platform 1
available global memory cache size: 0
available global memory device size: 536674304
available local memory size: 16384
device name: GeForce 9500 GT
passing through device 2 on platform 1
available global memory cache size: 0
available global memory device size: 670367744
available local memory size: 16384
device name: GeForce 8800 GTS
the last call to clBuildProgram on the specified program object for device was successful.
the last call to clBuildProgram on the specified program object for device was successful.

copied algorithm info to shared memory[/i]

and this is the program internal data while executing jobs (I attached it with GDB while it was running):


typedef struct {
communication_data* hosts;
task_data* info;
device_selector selected_device;
cl_kernel invoked_kernel;
cl_mem kernel_memory_args[1];
cl_mem constant_memory_args[4];
int needed_iterator_bytes;
cl_uchar fixedstring[16];
unsigned int varstring[8];
cl_ulong answer;
volatile pthread_t pid;
size_t dimension_settings[6]; // 0-2: global dimensions, 3-5: workgroup dimensions
} taskentry;

(gdb) print entry_list[0]
$5 = {hosts = 0x20f36e0, info = 0x20f36a0, selected_device = {platform_number = 1, device_number = 2}, invoked_kernel = 0x205c4d0, kernel_memory_args = {0x2259aa0}, constant_memory_args = {
0x23f5410, 0x20f21b0, 0x2405b80, 0x205ee20}, needed_iterator_bytes = 1, fixedstring = “C\000\000\000`\000\000\000\000\000\000\000\060\000\000”, varstring = {67, 0, 0, 0, 0, 0, 0, 0},
answer = 0, pid = 140496206808832, dimension_settings = {8100, 8100, 90, 90, 1, 1}}
(gdb) print entry_list[1]
$6 = {hosts = 0x20a3b40, info = 0x1f5e760, selected_device = {platform_number = 1, device_number = 1}, invoked_kernel = 0x209c650, kernel_memory_args = {0x2044b40}, constant_memory_args = {
0x21dc940, 0x20f5550, 0x20f5c70, 0x20f6370}, needed_iterator_bytes = 1, fixedstring = “A\002\000\000\060\000\000\000\000\000\000\000\201\001\000”, varstring = {16705, 0, 0, 0, 0, 0, 0,
0}, answer = 0, pid = 140496198416128, dimension_settings = {8100, 8100, 90, 90, 1, 1}}
(gdb) print entry_list[2]
$7 = {hosts = 0x20453d0, info = 0x2045390, selected_device = {platform_number = 1, device_number = 0}, invoked_kernel = 0x2045410, kernel_memory_args = {0x209a8f0}, constant_memory_args = {
0x20455d0, 0x2071120, 0x2071840, 0x209a200}, needed_iterator_bytes = 1, fixedstring = “B\000\000\000\060\000\000\000\000\000\000\000Á\000\000”, varstring = {16962, 0, 0, 0, 0, 0, 0, 0},
answer = 0, pid = 140496190023424, dimension_settings = {8100, 8100, 90, 90, 1, 1}}
(gdb) print entry_list[3]
$8 = {hosts = 0x1c3b0f0, info = 0x2259eb0, selected_device = {platform_number = 0, device_number = 0}, invoked_kernel = 0x2045a00, kernel_memory_args = {0x1fe6d70}, constant_memory_args = {
0x1fc7460, 0x1fc7120, 0x1fe6a90, 0x1fe6c00}, needed_iterator_bytes = 1, fixedstring = “A\000\002\000 \002\000\000\000\000\000\000\060\000\000”, varstring = {16708, 0, 0, 0, 0, 0, 0, 0},
answer = 0, pid = 140495896442624, dimension_settings = {8100, 8100, 90, 90, 1, 1}}

the .cl file is written in opencl 1.0, I know these cards are old so I don’t have to request too much features of them. It mainly consists of bit-shift operations, bitwise and / bitwise or / xor / not on 32-bit integers

any idea what might be the problem here?

edit
I forgot:

the size of the CONSTANT arguments are sizeof(cl_uint4), 256, 64, 64 bytes
the size of the GLOBAL argument is sizeof(cl_ulong), this buffer is write only: only 1 workitem will write into this buffer (just to inform the guys who are thinking about memory issues)

all right, I think I found something, but I’d still like a confirmation from the nvidia CUDA experts:

on the wikipedia: http://en.wikipedia.org/wiki/CUDA

I found that cpus with compute capability < 2.0 do not support a 3rd dimension of grid of thread blocks, and as such, the CPU has to re-send the kernel for each 3rd dimension (90) a new setup of the kernel, also recompiling the kernel because it uses the get_group_id(2) kernel call.
On 1 GPU, it won’t be a problem, but probably ‘compiling’ 8100*8100 work-items (reserving memory, pre-defining variables, constants, …) for each GPU 90 times in a kernel job and aside executing its own kernel is way too much for this CPU.

what do you think? if this is the problem, will a compute capability > 2.0 card solve the problem?