Do Tensor Core fragments help conserve registers?

I have a portion of my algorithm that can, in fact, be phrased as a series of 4x4 matrix multiplications. This could be an excellent use of the new fp32 tensor cores in Ampere cards, albeit a small portion of the overall run time. At present, we do things like this with lots of floats declared with names like ua_11, aa_20, etc. The register pressure from this part of the code is pretty high. If we could shift this all over to tensor cores, would the fragments that the data gets loaded into also help to conserve registers for other parts of the algorithm running in the same kernel?

There is no fp32 tensor core path in Ampere. There is a TF32 path, so if that is what you mean, great. (There is also a FP16/FP32 path). Presumably you have found the differences to be not an issue for your case. Furthermore, I know of no exposed 4x4 Tensorcore ops. The ones that are available generally are 16x16, or 8x32/32x8 (or arguably larger if we include CUTLASS, CUBLAS, etc).

If still interesting after that, I think the only proper suggestion I have is to try it. However, I don’t expect any “register conservation” to come with this approach. The fragment system is still using specific registers in specific warp lanes, there is no free lunch here or “extra” register space that somehow becomes available when the fragment system is being used. So unless your existing storage scheme somehow has some inefficiency, and you somehow expect to iron out that inefficiency yourself in the process of adopting the fragment scheme, I don’t know why anything would change markedly, register-usage wise.

As an additional aside, you seem to have already acknowledged that this isn’t really about performance. I would concur. In my experience, it’s not reasonable to assume significant performance impact on a code that does scattered small matrix multiplies using tensorcore (TC). If your alternative implementation is pretty bad, then tensorcore will probably run better, but folks often look at the very large peak theoretical numbers of the TC engine and think that is going to make a big difference perf-wise if you use it in a scattered fashion for a 16x16 op here and there. I would generally caution against that sort of thinking. In order to use TC in a way that approaches the peak theoretical throughput of the engine, it’s necessary to have a lot of work to do and to do it correctly. IMO that is not trivial, and the example I would point to as proof is CUTLASS.

If you have the time to spend on these kinds of refactorings, I will say that using TC in this fashion might arguably improve the abstract expression of your code, if perhaps only slightly. That sort of beauty is in the eye of the beholder.

Thanks for your thoughtful reply. Indeed, I don’t expect a big performance win, but I was hoping that there could be some fringe benefits. If they are not to be found in registers, then the elegance of the expression, which will improve significantly (these are totally dense 4x4 matrices, after all, and we’re basically unrolling the multiplication by hand and writing it in a different way). That, in itself, may improve the register situation to have only a couple of groups of 16 numbers in flight at any one time, but time will tell.