How to wait for completion of execution of one particular thread? For instance, the second thread needs to wait for the first thread to finish its task, the third thread needs the second thread to finish its task, and so on.

__syncthreads() is the only inter-thread synchronization method (execution barrier) and it only works for threads within a single block.

What you describe very much sounds like a serial algorithm, which is not well suited for massively parallel architectures like a GPU.

For synchronizing two warps PTX assembly has the “barrier.arrive a, b” and barrier.sync “a, b” instructions where “b” denotes the number of participating threads (as the number of participating warps times 32).

So what does __syncthreads__count() or __syncthreads__or() or __syncthreads__and() do?

The algorithm I am trying to code has many independent calculations and one dependent calculation for each step. I am trying to do independent calculations and delay the dependent steps to maximize parallalism. Can you suggest me a way to do it?