Hello. While reading the CUTLASS documentation CUTLASS efficient gemm doc. I came across a concept I’d like to understand better. In GEMM, when using a persistent kernel, it seems that the number of thread blocks is set to match the number of SMs, and a tile scheduler assigns tiles to these thread blocks. What are the advantages of this approach compared to allocating a thread block for each tile? Is it generally beneficial to divide the work based on the number of SMs, or is this approach specifically advantageous due to the nature of GEMM operations?
persistent kernels may offer benefits:
- lower latency - there is no kernel launch overhead, so if your communication or work delivery scheme is efficient, sometimes its possible to reduce latency of processing. Other techniques to reduce latency include copy/compute overlap and use of CUDA graphs
- persistent data - a persistent kernel also means that data can “persist” e.g. in shared memory or even registers (that is, in data storage in the GPU die itself, as opposed to off-die e.g. in GPU DRAM). This can have a variety of benefits.
It’s generally a good idea to balance work across SMs, this relates to tail effect and other phenomena. A persistent kernel is usually related to the idea of occupancy, in that the kernel will usually be sized (i.e. number of threads, blocks) such that all threads/blocks/warps can be resident on one of the SMs, “all the time”, i.e. for the entire duration of kernel execution. This type of kernel sizing will reduce the time cost/latency of block scheduling and block depositing, since that behavior is removed in the case of persistent kernel, and you are then left with the latency of warp scheduling.
Many of these benefits/effects may be “small” in typical practice. So if you design a simple test case and don’t observe any benefit going from one data item per thread kernel design to a persistent kernel or grid-stride-loop (ie. multiple data items per thread), I probably wouldn’t be able to comment further. The basic mechanisms (e.g. cost of threadblock scheduling) are generally small cost to begin with, so the “benefit” may be small and hard to measure. You can find forum threads where people have tried simple experiments along these lines and not witnessed anything interesting.
The way persistent kernels are done and can be done also have changed over the years.
There is the possibility to launch cooperative kernels now and to do grid wide synchronization.
With cooperative kernels there is a guarantee that all blocks run at the same time, which is also needed for persistent kernels.
So keep this in mind, when reading old documentation.
Thank you all so much for your responses. I’ll run a few more tests, and if I have further questions, I’ll ask again.