Hi, I’m investigating on the kernel fusion and fission. Assume that there is a fused kernel. If it is fission into 2 kernels, it can use less shared memory so that it have more active blocks per SM. However, if it is fused, it can reuse data more, and reduce kernel launch overhead. So, what’s the choice of kernel fusion and fission in this case? From my basic understanding, if the number of blocks to be scheduled is less than max active blocks per SM * the count of SMs, kernel fusion is better, is it right?
I’m not sure that makes any sense.
Kernel fusion in my experience usually implies a sequence of dependent operations.
D = A*B + C
The first non-fused kernel is a multiplication kernel that multiplies A and B, producing a “temporary” result. The second non-fused kernel is an addition kernel that adds the temporary result to C to produce the final result D. With kernel fusion, we combine those two kernels together, and one of the primary benefits is that we don’t need to store and then load the temporary result, saving some global memory traffic.
A by product of this sort of simplistic view is that the general width of the the two non-fused kernels is the same: presumably they operate on the same number of elements. Another by-product is that it makes no sense to consider running the two non-fused kernels at the same time. The second kernel cannot begin until the first kernel has ended.
I’m less familiar with the idea of kernel fission. You could certainly take a fused kernel and split it into its unfused components. I’m not aware of any advantages that brings. If shared memory were used in both unfused kernels, and that shared memory were declared separately (i.e. doubled-up) for the fused case, then that could certainly be an issue. But to a first order approximation I would view that as a foolish way to fuse those two kernels. The parts are clearly separable, so rather than “doubling-up” on shared memory, I would seek to reuse the existing allocation, such that the shared memory demand of the fused kernel is the maximum of the shared memory demand of its unfused components. In that case, I don’t see any (performance) benefit to “kernel fission” or splitting a fused kernel into its unfused components.
This seems mostly orthogonal to me (i.e. not really applicable to the previous discussion). If the number of blocks to be scheduled is less than the maximum carrying capacity of the GPU, I don’t see how kernel fusion or not, will help in that case. Instead you should seek to expose more parallelism. Kernel fission will not help in such a case because you cannot run the dependent operations at the same time.
I’m also doubtful that such sweeping generalizations are either true or useful, but that is a separate topic. You might propose an exact example of what you mean by kernel fusion and kernel fission (just as I have linked an exact example of kernel fusion) to help focus the discussion.
We could probably imagine a case of kernel fusion where the fusion might not be beneficial, roughly for the reason you suggest.
Suppose we have a first-stage unfused kernel that does a lot of work, takes a long time, and uses no shared memory. Furthermore it can achieve “full” occupancy: max threads per SM times number of SM. Maybe our SMs have a maximum carrying capacity of 2048 threads per SM, and we are launching threadblocks of 1024 threads. Each thread processes one element.
Suppose we have a second stage kernel that uses a lot of shared memory for a relatively small number of elements, but doesn’t have to do much processing. Again, each thread processes one element. Maybe it uses 48KB of shared memory for each 128 elements. Now we have an occupancy situation where the second kernel can only support maybe 1-3 threadblocks per SM, so maybe 128-384 threads per SM.
If we fuse the two kernels, its quite possible that the net result is going to be a performance reduction, rather than an improvement.
So it’s possible we might come up with a case that disproves a generalization like “kernel fusion is a good idea”.
Rather than trying to condense CUDA programming into a set of canned “rules”, I think its better to be able to think about problems this way. But there is room for both. Coalesced global traffic is almost always a good idea. Could you come up with a counter-example? Maybe.
The situation you outline regarding two seperate kernels over one “fused” kernel, is almost exactly the situation my project is in. The first kernel is just short of register constraints at 128. The second uses just over 32k of shared memory, a block size of 768 and does a good amount of work, only requiring the kernel 1 result at the very end - no waiting required.
The program performs very well and performance would drop considerably if combined. Kernel durations 0.24 and 0.4ms respectively, (RTX4080), in a situation where they are tightly looped for up to hours at a time.
Thanks all for patient explanation. Sorry that I didn’t talk about my case in details.In my case, the first kernel is batch GEMM and the second is to use the result of first kernel to EVD and some other calculations. They use same quantity of threads but different shared memory(the second is larger than the first). Registers used has no effect on this case. The max active blocks per SM for the first kernel is 4 and for the second is 2. If they fuse, the advantage is that they can reuse the shared memory otherwise it needs to copy the result from SMEM to GMEM in the first kernel and copy the result from GMEM to SMEM in the second kernel, and also it can avoid kernel launch overhead. And what I would like to discuss is that if the blocks to be scheduled is larger than max active blocks per SM * the count of SMs, it seems that in the first stage, in the fused case, due to the limited active blocks, actually is 2, the performance for GEMM is reduced, and while in the fission case, the active blocks can be 4. However, if the case is that the number of blocks is not larger than the times relationship, the active blocks in the first stage seems has not negative effect. That’s why I mentioned the relationship between blocks number and active blocks times SMs number. So, the analysis I mentioned previously has any misunderstanding? @Robert_Crovella
Rather than trying to figure out something from generalizations, if the investigation was important to me I would just try the experiment. I’ve already indicated I don’t agree with your previous analysis in the general case, and I still don’t. Why not just try it? You already have a hunch that there may be some benefit to leaving data in shared memory, ie. you have a hunch that kernel fusion might be a good idea, which is the main point I was trying to make. I still consider the cases where it might not be a good idea to be outliers, but of course I haven’t done any statistical analysis, so I certainly wouldn’t argue other viewpoints. Just stating my opinion. Kernel fusion might be a good idea. Try it.