Does OpenCL support CKE? The docs “imply” that it does but there are no samples for it and no real instructions on how to implement it?


We do support concurrent kernel execution in OpenCL, but there are some caveats. Currently it only works with multiple cl_command_queues - you can get overlap between kernels executing in each queue.

In practice it is often tricky to get performance gains from concurrent kernels, but if you have problems please file a bug.

I’ve tried with multiple command queues but this has not worked. Any other ideas? Is there an example from Nvidia that I can look at, I didn’t see one in the SDK examples, quite unfortunate?

What hardware are you using? Concurrent kernels are only supported on Fermi.

I agree we should have an SDK sample, we’ll work on that.

I have a GTX480 I am running it on, Fermi, yes, I know it’s only supported on Fermi.

So basically, I’m trying to run a MatrixTranspose and a SimpleConvolution concurrently. When done without any wait events or clFinish(), the SimpleConvolution executes first (though it’s written second in the code), when this is done I get some really long execute time for MatrixTranspose, like 47303 (when it actually takes ~8200 when run alone) and the time for the SimpleConv doesn’t change, ~26000.

The problem is that if I block the MatrixTranspose, I get the right run time (~8200) but then there is no concurreny correct? Because I am blocking it with a wait evt or clFinish?

Now, it’s possible that I may see no benefit with these two kernels, but my problem is that I can’t even tell if concurrent is even working (which is all I want to know, I’m not really “looking” for speedup but I want to do a comparison with concurrent kernels and need to get this working).

BTW, thanks for the help.

Try running a compute bound problem parallel to a memory bandwidth bound problem - maybe you can demonstrate some speedup that way.

Or sample the actual order of execution using atomic operations.

See, that is somewhat confusing to me, if only one kernel runs at one time on any given multiprocessor, is there some kernel switching? For example, if one kernels is waiting for fetches from global memory will it be switched out with another kernel until those fetches come?

I also tried a BlackScholes with MatrixTranspose and ran into the same problem, even if it’s getting speedup, which I don’t think it is, I can’t tell because the profile is giving me these really long run times unless I block the kernel.

No, kernels are never switched (at least on current hardware, although the PTX manual reserves the right to change that in the future). Concurrent kernels run on different SMs.

So my suggestion to try a compute bound and a memory bound kernel in parallel is indeed on the assumption that SMs may share their memory bandwidth asymmetrically. This was based on the further assumption that Nvidia would probably use a ringbus to connect the SMs to the memory controllers. Any of those assupmtions may be wrong of course, particularly remembering how Nvidia blamed the complexity of the fabric for Fermi’s delays.

I wouldn’t rely on the profiler for determining whether kernels execute in parallel, as the profiler itself changes how kernels are executed (I vaguely remember that it prevents parallel kernel execution at all?).

Also note that kernels only run in parallel once the first kernel does not occupy all SMs anymore. Thus parallel execution is best demonstrated if both kernels together use fewer blocks than there are SMs on the GPU (even though that configuration does not achieve full throughput).