No speedup from run 2 kernels concurrently on a gpu device of compute capability 2.0

I’m creating 2 in-order command queues to execute 2 kernels concurrently on NVIDIA Tesla C2070 which has compute capability 2.0. The kernels are completely independent from each other. I expect to get some speedup compared to running 2 kernels on the same in-order command queue, but there is no speedup at all. Does anyone know what I do wrong?

Here is my code:

Concurrent kernels for the NVIDIA (nor AMD) GPU is not yet supported in OpenCL, using either a single queue with CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, or using multiple queues. Concurrent kernels are supported for the GPU using CUDA using multiple streams, so support in OpenCL should be possible–eventually. Concurrent kernels are supported for AMD’s OpenCL for the CPU using multiple queues, but not for out-of-order queues. Unfortunately, the spec does not have an interface to describe which way concurrent kernels are implemented (e.g., for the CPU using AMD’s OpenCL), let alone whether they are supported. http://forums.nvidia.com/index.php?showtopic=207195

That’s really good to know. Thank you very much.

So, I saw this:
“For devices of compute capability 2.0, multiple kernels can execute concurrently on a device, so maximum utilization can also be achieved by using queues to enable enough kernels to execute concurrently.”
from OpenCL Programming Guide for the CUDA Architecture version 3.2. That’s why I think OpenCL support concurrent kernels on GPU. Why does it say such statement in OpenCL Programming Guide?

I’m not sure why the programming guide is written that way. But, I haven’t gotten concurrent kernels to work in OpenCL but have in CUDA. (My OpenCL test program is here: http://domemtech.com/code/ocl-task-parallel.zip .) The only other thing I can think of is whether there is an extension that would support concurrent kernels, e.g. using cl_ext_device_fission with one kernel per fission device. But, it is not obvious. As with anything, show me the code, then I will believe.

Ken

I have some problem with CUDA(version 3.2) concurrent kernels.
There is no speedup.
My card is GTX570.