First of all, the CL_OUT_OF_RESOURCES error I encountered with nvidia OpenCL is similar to the “kernel launch timed out” error I had before with CUDA. Basically, when a kernel running on a non-dedicated GPU with more than 5~10 sec will be killed by the driver. I refined my work-load and now it worked fine.
I found that the CL_INVALID_WORK_GROUP_SIZE was related to the constant/shared memory limitation of an nvidia device. Setting workgroupsize to NULL will get it around it, but the speed is quite bad. So, I still need to find out a way to set workgroupsize specifically.
For CUDA, I can use -cubin to find out the amount of shared/constant/register I used per kernel, but this does not seem to be available for OpenCL.
for a “const” parameter, will they be stored in the constant memory?
for the parameter “float twin0”, where will it be stored? in CUDA I remember it is in the shared memory.
by using __constant float4 gproperty, I want to pass an array to the constant memory, is this the right way to do? (in CUDA, I only need to use cudaMemcpyToSymbol to init the constant array and I don’t need to pass it)
how much shared memory and constant memory are consumed with this kernel (I defined globalworksize=1024 and workgroupsize=128)? I have a 8800GT card.
by specifying a block dimension of 128, I am now getting 40x speed up compared with a block size of 1x1x1 (with NULL workgroupsize). however, this opencl code is still 8x slower than an equivalent CUDA implementation I wrote.
I ran the profiler, the register number (60) is in the same scale as the CUDA version (53), and the shared memory is also similar. Both cases I used fast math.
Is nvidia OpenCL significantly slower than CUDA? I found different views on this, one post says it is 5% slower, one says 4x~5x slower. Which one is more realistic?
If it is 4x-5x slower then you are doing something wrong. Just recently I wrote a relatively complicated application in OpenCL that consisted of a lot of different kernels that employed most of the functionality of the latest NVIDIA’s GPUs (shared mem, atomics, … ) and then I’ve done the same in CUDA. To my surprise the OpenCL version was actually faster than the CUDA one (I think it was mainly because I used image objects more effectively in OpenCL). But the point is that the difference in terms of performance is rather minimal … unless you do something wrong.
If you are using older OpenCL implementations (such as the one came with CUDA 2.3) it could be much slower in some cases. You should try the newer OpenCL implementation in CUDA 3.0 beta if possible.
thank you for confirming this. That’s a good news for me. In fact, on ATI hardware (4890OC), I already get decent acceleration compared to CUDA on 8800GT. With your comment, I think the opencl code should work equally well on nvidia’s card. I will keep looking into this, probably just some configuration issues.
by the way, I had the impression that opencl currently doesn’t support atomics, has this changed?