cuda kernels execution one by one - in sequential


 As per my knowledge, the number of cuda kernels called from the host code, will be executed in parallel. 

How to make them execute in sequential? i.e. kernel2 must execute after the completion of kernel1.


Section of the CUDA 4.1 Programming Guide says:

Programmers can globally disable asynchronous kernel launches for all CUDA applications running on a system by setting the CUDA_LAUNCH_BLOCKING environment variable to 1. This feature is provided for debugging purposes only and should never be used as a way to make production software run reliably.


Kernels are automatically executed sequentially if they are launched in the same stream (which may just be the default stream, if you haven’t specified any).