I have two questions regarding CUDA 4.0 concurrent kernels(on devices with compute capability is 2.0). I am looking for some detailed explanation which clarifies these concepts
When multiple threads execute different computation kernels at the same time on the same device - are those executed one after another or simultaneously in reality?
What is the exact difference between these two scenarios
Call 2 different kernels from two different threads on same device
to get kernels running concurrently you need 2 things: the hardware, and your kernels submitted to different streams (i.e - different stream handle as the last parameter of cudaKernel<<<>>> specification). It does not matter if you populate your streams with kernels using many threads, or just one thread (like it is done here) - they all should get executed concurrently in any case.
See programing guide paragraph 3.2.5.3 and 3.2.5.5
Thanks for reply. I agree to the point if you have different stream ids then it does not matter that kernel is launched by one thread or multiple. But, what I think these kernels can not run cuncurrently. What I mean here is that at one time you will have blocks of kernel distributed over different SMs. If they can run cuncurrently, so you mean some SM will run blocks from kernel 1 and another from kernel 2 and so on … I think this should not be the case …
What is your opinion in this regard ? And if this is not the case, then any idea how the execution happens ?
The resource usage of your kernels (registers, shared memory) should not be too demanding to allow your kernels to run concurrently. And I don’t know if kernels using different function code can be overlapped or not. The best way to find precise answers to your questions is to do a benchmark test, where you could try out different combinations - 2 same kernels, 2 different kernels but from same module, 2 kernels from different modules and whatever else scenario you can think of.