What is the advantage of implementing software pipeline over gpu itself hardware pipeline by warp scheduler?

In cutlass, software pipeline is implemented, is it because of warp scheduling not being good enough to keep compute resources busy in gemm/conv scenarios? Can you give a guideline on where a software pipeline is needed?

one example where a pipeline may be typically used is in a producer-consumer arrangement. The “producer” in this case could be the retrieval of data/operands from global memory to shared memory. The subsequent “consumer” operations (e.g. matrix algebra) cannot begin until shared memory is properly populated. The pipeline allows for definition of the producer and consumer workloads, as well as providing syntax and structure for the necessary synchronization between.

In newer GPUs, for this type of example, the pipeline can also implement an “asynchronous” copy for the producer stage, further improving performance by not tying up register footprint for the movement of data between global and shared memory, and allowing the producer stage to effectively work “in the background” of other warp activity.

In the case of producer-consumer scenario, can we just leave the pipelining work to warp scheduler, i.e. letting warp scheduler manage overlapping data fetching and computing, how does it compare with the producer-consumer software pipeline in perfomance?

you seem to have ignored my statement about synchronization. It would require more than just letting warps overlap things. It would require synchronization of some sort, at least in the case I described.

I don’t have a detailed comparison and analysis to offer you. The two cases you are “comparing” are not well defined to me anyway. You might wish to code up both examples as you see fit, and analyze performance that way.

I think the async copy mechanism I mentioned is generally considered to be valuable, performance-wise, but it will not benefit every case. It may provide benefit when the async nature is valuable (e.g. because other warps are frequently stalled, or the datapaths are not highly utilized) or when the register footprint that I already mentioned is important for performance.

The cuda::pipeline methodology/api/syntax is not the only way to achieve this kind of functionality, but it provides abstraction that may be of interest for this case.

1 Like

For the synchronization you mentioned, if it is the sequencing of computing after data being fetched, it can be easily guaranteed without using software pipeline. The synchronization problem arises when software pipeline is applied, prefetching the data for the next round of computation.

Sure. I misunderstood your original question:

No I cannot. Please disregard my previous posts in this thread. There is no situation where a cuda::pipeline is needed, ie. where the functionality cannot be done in any other way.