trying to understand kernel parameters and CL_INVALID_WORK_GROUP_SIZE

I finally figured out a couple of issues of my code running with nvidia opencl implementation. These problems were described in

First of all, the CL_OUT_OF_RESOURCES error I encountered with nvidia OpenCL is similar to the “kernel launch timed out” error I had before with CUDA. Basically, when a kernel running on a non-dedicated GPU with more than 5~10 sec will be killed by the driver. I refined my work-load and now it worked fine.

I found that the CL_INVALID_WORK_GROUP_SIZE was related to the constant/shared memory limitation of an nvidia device. Setting workgroupsize to NULL will get it around it, but the speed is quite bad. So, I still need to find out a way to set workgroupsize specifically.

For CUDA, I can use -cubin to find out the amount of shared/constant/register I used per kernel, but this does not seem to be available for OpenCL.

My kernel has the following parameters:

__kernel void mcx_main_loop(const int nphoton,const int ophoton,__global const uchar media[],

	 float twin0,float twin1, float tmax, ..., __global float4 n_len[],__constant float4 gproperty[])




and I have the following questions:

  1. for a “const” parameter, will they be stored in the constant memory?

  2. for the parameter “float twin0”, where will it be stored? in CUDA I remember it is in the shared memory.

  3. by using __constant float4 gproperty, I want to pass an array to the constant memory, is this the right way to do? (in CUDA, I only need to use cudaMemcpyToSymbol to init the constant array and I don’t need to pass it)

  4. how much shared memory and constant memory are consumed with this kernel (I defined globalworksize=1024 and workgroupsize=128)? I have a 8800GT card.

thanks in advance for any helpful comments.

the problem of CL_INVALID_WORK_GROUP_SIZE is partially solved, see my post…iew=getlastpost

by specifying a block dimension of 128, I am now getting 40x speed up compared with a block size of 1x1x1 (with NULL workgroupsize). however, this opencl code is still 8x slower than an equivalent CUDA implementation I wrote.

I ran the profiler, the register number (60) is in the same scale as the CUDA version (53), and the shared memory is also similar. Both cases I used fast math.

Is nvidia OpenCL significantly slower than CUDA? I found different views on this, one post says it is 5% slower, one says 4x~5x slower. Which one is more realistic?

If it is 4x-5x slower then you are doing something wrong. Just recently I wrote a relatively complicated application in OpenCL that consisted of a lot of different kernels that employed most of the functionality of the latest NVIDIA’s GPUs (shared mem, atomics, … ) and then I’ve done the same in CUDA. To my surprise the OpenCL version was actually faster than the CUDA one (I think it was mainly because I used image objects more effectively in OpenCL). But the point is that the difference in terms of performance is rather minimal … unless you do something wrong.

If you are using older OpenCL implementations (such as the one came with CUDA 2.3) it could be much slower in some cases. You should try the newer OpenCL implementation in CUDA 3.0 beta if possible.

yes, I was using CUDA 2.3+190.29 opencl.

if I want to try what you suggested, should I download the 195.17 driver and CUDA Toolkit from the following link?

I am working on a Ubuntu 9.10 desktop, I heard some issues with cuda 3.0 on 9.10 though.

thank you for confirming this. That’s a good news for me. In fact, on ATI hardware (4890OC), I already get decent acceleration compared to CUDA on 8800GT. With your comment, I think the opencl code should work equally well on nvidia’s card. I will keep looking into this, probably just some configuration issues.

by the way, I had the impression that opencl currently doesn’t support atomics, has this changed?

for nVidia you need compute capability 1.1. The appendix A in OpenCL Programmer’s Guide outlines this.

Highlights are the only ones with 1.0 are:
8800 Ultra, GTX, & GTS
Tesla S870, D870, C870
Quadro Plex 2100 Model S4, 1000 Model IV
Quadro FX 5600, 4600

The rest all have Global atomics

Compute level 1.2 devices have local atomics (no actual 1.2 devices exist).
1.3 dev’s exist though.

Amazing! just by installing the new 3.0 driver and toolkit, and the code speed got boosted for 6 times! I did not change anything in the code.

although it is still not as good as the CUDA version (about 30% less), but it is at least in the ballpark.

this is great, thank you so much for pointing this out.

by the way, is -cl-fast-relaxed-math option for clBuildProgram gives roughly the same level of optimizations as -use_fast_math option in CUDA?