CUDA Parallel Kernels

Hi to all,
I’ ve a simple question: if i have two (or more) consecutive kernel launches in the host code… are they launched asynchroned? Or the second launch starts only after the first one is finished?

example:

kernel1 <<< …, …>>>(par1, par2…)
kernel2 <<<…, …>>>(par1, par2…)

I want to adopt this kind of programming model to avoid conditional structures to check the thread index and adding other overheads…

There is an implicit synchronization barrier between two kernels in the same stream. Note that behavior of launching two kernels in different streams that access the same memory without any sort of synchronization between the two streams is undefined.

But is it possible? Using two different streams? Each kernel fetchs the same portion of memory but writes in two different portions…

… kernel 1 launch … //doesn’t stop

… kernel 2 launch … // stop