CUDA 4.0 concurrent kernels


I have two questions regarding CUDA 4.0 concurrent kernels(on devices with compute capability is 2.0). I am looking for some detailed explanation which clarifies these concepts

  1. When multiple threads execute different computation kernels at the same time on the same device - are those executed one after another or simultaneously in reality?

  2. What is the exact difference between these two scenarios

    1. Call 2 different kernels from two different threads on same device
    2. Call 2 kernel from same thread on same device

As far as you use different streams to submit kernels, you should be fine, and it does not matter how many threads you use to submit the work.

Still, I am not very clear what you are trying to explain. Can you please tell me in detail.

to get kernels running concurrently you need 2 things: the hardware, and your kernels submitted to different streams (i.e - different stream handle as the last parameter of cudaKernel<<<>>> specification). It does not matter if you populate your streams with kernels using many threads, or just one thread (like it is done here) - they all should get executed concurrently in any case.
See programing guide paragraph and

Thanks for reply. I agree to the point if you have different stream ids then it does not matter that kernel is launched by one thread or multiple. But, what I think these kernels can not run cuncurrently. What I mean here is that at one time you will have blocks of kernel distributed over different SMs. If they can run cuncurrently, so you mean some SM will run blocks from kernel 1 and another from kernel 2 and so on … I think this should not be the case …
What is your opinion in this regard ? And if this is not the case, then any idea how the execution happens ?

The resource usage of your kernels (registers, shared memory) should not be too demanding to allow your kernels to run concurrently. And I don’t know if kernels using different function code can be overlapped or not. The best way to find precise answers to your questions is to do a benchmark test, where you could try out different combinations - 2 same kernels, 2 different kernels but from same module, 2 kernels from different modules and whatever else scenario you can think of.

Thanks sergeyn! It seems worth trying out the combinations as you have suggested.