openCL jobs are eating unusually much CPU

jpsollie2 · April 14, 2013, 7:57am

Hardware setup:
Phenom II X4 940 overclocked @ 3.2 ghz
1 * geforce gts 8800 (G92)
1* geforce gt 8800 (G92)
1* geforce GT 9500 (G92)

Software setup:
Linux kernel 3.2.32
nvidia drivers 312
amd app kit 2.7 (CPU opencl driver)

problem:
Compared to a much slower laptop (turion zm-86 + radeon hd 4570), this machine is incredibly slow when executing en openCL job: the jobs are taking at least 4 times as much time for each element, the CPU (which is WAY faster, as it is higher clocked, has 4 cores and a 128 bit sse2 execution unit) does not render anything either.
is there some form of coordination that is required by the nvidia drivers which might be blocking the performance when working with multiple cards?

this is my programs debug output when launching the program (I guess everybody will understand even though it is not their program):
--------------[i]
found 1 devices on platform 0
Query device platform name:
AMD Accelerated Parallel Processing
Query vendor platform name:
Advanced Micro Devices, Inc.
Query openCL version name:
OpenCL 1.2 AMD-APP (923.1)
found 3 devices on platform 1
Query device platform name:
NVIDIA CUDA
Query vendor platform name:
NVIDIA Corporation
Query openCL version name:
OpenCL 1.1 CUDA 4.2.1
passing through device 0 on platform 0
available global memory cache size: 65536
available global memory device size: 4143386624
available local memory size: 32768
device name: AMD Phenom™ II X4 920 Processor
passing through device 0 on platform 1
available global memory cache size: 0
available global memory device size: 536674304
available local memory size: 16384
device name: GeForce 8800 GT
passing through device 1 on platform 1
available global memory cache size: 0
available global memory device size: 536674304
available local memory size: 16384
device name: GeForce 9500 GT
passing through device 2 on platform 1
available global memory cache size: 0
available global memory device size: 670367744
available local memory size: 16384
device name: GeForce 8800 GTS
the last call to clBuildProgram on the specified program object for device was successful.
the last call to clBuildProgram on the specified program object for device was successful.

copied algorithm info to shared memory[/i]

and this is the program internal data while executing jobs (I attached it with GDB while it was running):

typedef struct {
communication_data* hosts;
task_data* info;
device_selector selected_device;
cl_kernel invoked_kernel;
cl_mem kernel_memory_args[1];
cl_mem constant_memory_args[4];
int needed_iterator_bytes;
cl_uchar fixedstring[16];
unsigned int varstring[8];
cl_ulong answer;
volatile pthread_t pid;
size_t dimension_settings[6]; // 0-2: global dimensions, 3-5: workgroup dimensions
} taskentry;

(gdb) print entry_list[0]
$5 = {hosts = 0x20f36e0, info = 0x20f36a0, selected_device = {platform_number = 1, device_number = 2}, invoked_kernel = 0x205c4d0, kernel_memory_args = {0x2259aa0}, constant_memory_args = {
0x23f5410, 0x20f21b0, 0x2405b80, 0x205ee20}, needed_iterator_bytes = 1, fixedstring = “C\000\000\000`\000\000\000\000\000\000\000\060\000\000”, varstring = {67, 0, 0, 0, 0, 0, 0, 0},
answer = 0, pid = 140496206808832, dimension_settings = {8100, 8100, 90, 90, 1, 1}}
(gdb) print entry_list[1]
$6 = {hosts = 0x20a3b40, info = 0x1f5e760, selected_device = {platform_number = 1, device_number = 1}, invoked_kernel = 0x209c650, kernel_memory_args = {0x2044b40}, constant_memory_args = {
0x21dc940, 0x20f5550, 0x20f5c70, 0x20f6370}, needed_iterator_bytes = 1, fixedstring = “A\002\000\000\060\000\000\000\000\000\000\000\201\001\000”, varstring = {16705, 0, 0, 0, 0, 0, 0,
0}, answer = 0, pid = 140496198416128, dimension_settings = {8100, 8100, 90, 90, 1, 1}}
(gdb) print entry_list[2]
$7 = {hosts = 0x20453d0, info = 0x2045390, selected_device = {platform_number = 1, device_number = 0}, invoked_kernel = 0x2045410, kernel_memory_args = {0x209a8f0}, constant_memory_args = {
0x20455d0, 0x2071120, 0x2071840, 0x209a200}, needed_iterator_bytes = 1, fixedstring = “B\000\000\000\060\000\000\000\000\000\000\000Á\000\000”, varstring = {16962, 0, 0, 0, 0, 0, 0, 0},
answer = 0, pid = 140496190023424, dimension_settings = {8100, 8100, 90, 90, 1, 1}}
(gdb) print entry_list[3]
$8 = {hosts = 0x1c3b0f0, info = 0x2259eb0, selected_device = {platform_number = 0, device_number = 0}, invoked_kernel = 0x2045a00, kernel_memory_args = {0x1fe6d70}, constant_memory_args = {
0x1fc7460, 0x1fc7120, 0x1fe6a90, 0x1fe6c00}, needed_iterator_bytes = 1, fixedstring = “A\000\002\000 \002\000\000\000\000\000\000\060\000\000”, varstring = {16708, 0, 0, 0, 0, 0, 0, 0},
answer = 0, pid = 140495896442624, dimension_settings = {8100, 8100, 90, 90, 1, 1}}

the .cl file is written in opencl 1.0, I know these cards are old so I don’t have to request too much features of them. It mainly consists of bit-shift operations, bitwise and / bitwise or / xor / not on 32-bit integers

any idea what might be the problem here?

jpsollie2 · April 14, 2013, 5:44pm

edit
I forgot:

the size of the CONSTANT arguments are sizeof(cl_uint4), 256, 64, 64 bytes
the size of the GLOBAL argument is sizeof(cl_ulong), this buffer is write only: only 1 workitem will write into this buffer (just to inform the guys who are thinking about memory issues)

jpsollie2 · April 14, 2013, 6:59pm

all right, I think I found something, but I’d still like a confirmation from the nvidia CUDA experts:

on the wikipedia: CUDA - Wikipedia

I found that cpus with compute capability < 2.0 do not support a 3rd dimension of grid of thread blocks, and as such, the CPU has to re-send the kernel for each 3rd dimension (90) a new setup of the kernel, also recompiling the kernel because it uses the get_group_id(2) kernel call.
On 1 GPU, it won’t be a problem, but probably ‘compiling’ 8100*8100 work-items (reserving memory, pre-defining variables, constants, …) for each GPU 90 times in a kernel job and aside executing its own kernel is way too much for this CPU.

what do you think? if this is the problem, will a compute capability > 2.0 card solve the problem?

Topic		Replies	Views
trying to understand kernel parameters and CL_INVALID_WORK_GROUP_SIZE CUDA Programming and Performance	8	3980	February 26, 2010
Is task parallel programming (aka "concurrent kernels") in OpenCL supported? CUDA Programming and Performance	1	4287	August 31, 2011
OpenCL and Ubuntu 10.10 CUDA Programming and Performance	7	80074	January 25, 2011
CUDA/OpenCL runs multiple GPUs sequentially CUDA Programming and Performance	16	19321	November 26, 2015
OpenCL Asynchronous Kernel Launches CUDA Programming and Performance	6	2706	May 24, 2022
Problem with get_global_id(1); CUDA Programming and Performance	5	4005	May 19, 2014
Effective global memory bandwidth? CUDA Programming and Performance	17	17562	September 18, 2007
performance question CUDA Programming and Performance	9	9931	August 4, 2010
GPU and CPU don't run in (pure) parallel ? CUDA Programming and Performance	24	20120	May 4, 2007
Need help testing OpenCL program CUDA Programming and Performance	2	1453	May 24, 2012

openCL jobs are eating unusually much CPU

copied algorithm info to shared memory[/i]

Related topics