A kernel call within a stream is always one stream operation. All blocks and threads finish before the stream continues.
It does not matter, if the same or a different kernel is called as next operation in the stream.
Within a kernel call the blocks and their threads run asynchronously and you possibly get all kind of crazy race conditions, yes.
That is, why you try to make blocks as independent as possible in your algorithm.
The same to a lesser degree for warps and to the least degree for the threads within a warp.
And where you have to share work or data or reconfigure, which thread is responsible for which data packet (that can make sense even within a kernel), you use synchronization primitives.