Applications for concurrent kernels

I am curious if anyone has gotten performance or functionality improvements from concurrent kernel execution.

With some experimentation, I’ve gotten them working (I verify the concurrency using atomic ops in global memory), but haven’t verified any performance benefit. On my to-do list is to try getting two kernels to benefit from sharing Fermi’s L2 cache (say the subkernels performing a Scan operation), but I have not gotten there yet.

Could those who have worked with concurrent kernels post their experiences?

(Even negative results would be good to know)