Easiest way to invoke two different kernels simultaneously ?

What is the easiest way to invoke two different kernels simultaneously ?
Kernel A + Kernel B

There are two warp schedulers per SM in FERMI.
Is it possible to assign kernel A and kernel B to each of those schedulers ? so that they can execute simultaneously ?

Or the only possible way is to rely on GPU’s context-switching capability ?
To achieve this, should I generate two CPU threads and call CUDA_Kernel simultaneously ?
If so, GPU still executes one kernel at a time, right ? (it does context-switching…)

Or any suggestions ?
Thank you

Fermi supports the launch of multiple kernels on the same GPU using the concept of “streams”. Kernels on different streams can execute at the same time on one GPU, and kernels in the same stream execute in order. See section in the CUDA Programming Guide.

Thanks, does this mean that if I use Fermi GPU that has more than one warp schedulers per SM, two different kernels will be executed on each SM at the same time ?

There is no relation between the number of warp schedulers per multiprocessor and the number of concurrent kernels. The limit on Fermi is 16 concurrent kernels, I believe, although it is up to the driver how many it actually runs simultaneously. (That’s an important note! Streams tell CUDA what kernels can run concurrently, but it does not guarantee concurrency. Streams still work on pre-Fermi GPUs, but the kernels get run sequentially.)

According to the Fermi whitepaper (page 18), all SMs are first filled with threads from the first kernel, after that threads from a second kernel are used. When all SMs are completely filled, threads from a next kernel have to wait until another kernel finishes. So it is concurrent, but I don’t think there is a way to tell the GPU: use 5 SMs for kernel-1, the other SMs for kernel-2. Would be nice though.