simultaneous execution of kernels in cuda

I am implementing a pipeline comprising of 4 functional blocks that are connected sequentially. Each block is parallelized to the maximum extent, i.e each blocks have kernels that run on GPU. I need to run all the blocks in such a way that when the output of first block is ready to be dispatched to block2, the next set of input must enter block1. So After 4units of time(latency), at each unit of time there must be output at the end of the pipeline. can someone suggest a novel way to do it?

This would be a standard pipelined algorithm. It involves copy-compute overlap, i.e. the scheduling of cudaMemcpyAsync operations that run concurrently with kernel executions. Its a fairly standardized or canonical technique, including in terms of design/construction. With a bit of research you can find various tutorials and examples of this, here is one example:

here is a good treatment of the topic:

In addition, your pipeline rather than having 1 compute stage has 4 compute stages. This can also be accomplished via the stream semantics that are used to arrange for copy/compute overlap.