code: tangpanyu/mma-gemm: mma gemm
Hello,when I use mma to implement two versions of the matrix multiplication using tensor cores, version 1 doesn’t use local memory and L1 cache at all, but version 2 uses local memory and L1 cache, the latter causes a serious loss of efficiency, and I don’t know how to improve version 2 and version 1 to store data in registers like version 1.
Version 2 uses ldmatreix to load the data needed by tensor core, which will be transferred to local memory, and mma matrix multiplication also needs to take data from local memory. And the operation also results in a high warp stall.
Hi there @8948542, welcome to the NVIDIA developer forums.
I do not really understand the context of your issue, but since you tagged it as CUDA, I would suggest you try some of the sub-categories of our CUDA forums. Maybe the “Programming and Performance” section?
Ideally, reg_read and reg_write should be computed in a way that the compiler can precompute their values for each loop iteration. Another problem will be that K_tiles which is used in the if and as outer loop counter is computed dynamically.
Here is a modified code which does not use local memory. Compiler Explorer
The outer loop which depends on K_tiles is unrolled with unroll-factor 2. reg_read and reg_write are computed from the loop index. You need to check if the transformations are correct
Thank you, this question has troubled me for a long time. But why can version 1 be stored normally? Its reg_store_idx and reg_load_idx parameters are also changed at runtime. I also tried to initialize reg_store_idx and reg_load_idx to 0 and 1, and modify them in the outer loop, but this did not work properly either.
int reg_write = 0;
int reg_read = 1;
......
#pragma unroll
for(int i=0;i<K_tiles;++i){
reg_write ^= 1;
reg_read ^=1;
......