Why indexing arrays with a constant amount is not in the register?

code: tangpanyu/mma-gemm: mma gemm
Hello,when I use mma to implement two versions of the matrix multiplication using tensor cores, version 1 doesn’t use local memory and L1 cache at all, but version 2 uses local memory and L1 cache, the latter causes a serious loss of efficiency, and I don’t know how to improve version 2 and version 1 to store data in registers like version 1.
Version 2 uses ldmatreix to load the data needed by tensor core, which will be transferred to local memory, and mma matrix multiplication also needs to take data from local memory. And the operation also results in a high warp stall.



But version 1 data is put directly in the registers, and I don’t know what happened in the middle that caused the difference between the two

Some of the memory usage of v2 does not exist as a memory overflow.

cuobjdump -ptx last
.extern .shared .align 16 .b8 smem[];
.visible .entry _Z18mma_swizzle_kstageI6__halfEvPKT_S3_PS1_iii(
.param .u64 _Z18mma_swizzle_kstageI6__halfEvPKT_S3_PS1_iii_param_0,
.param .u64 _Z18mma_swizzle_kstageI6__halfEvPKT_S3_PS1_iii_param_1,
.param .u64 _Z18mma_swizzle_kstageI6__halfEvPKT_S3_PS1_iii_param_2,
.param .u32 _Z18mma_swizzle_kstageI6__halfEvPKT_S3_PS1_iii_param_3,
.param .u32 _Z18mma_swizzle_kstageI6__halfEvPKT_S3_PS1_iii_param_4,
.param .u32 _Z18mma_swizzle_kstageI6__halfEvPKT_S3_PS1_iii_param_5
)
{
.local .align 16 .b8 __local_depot0[256];
.reg .b64 %SP;
.reg .b64 %SPL;
.reg .pred %p<16>;
.reg .b32 %r<1380>;
.reg .b64 %rd<102>;
.loc	1 72 0
nvcc -Xptxas=-v -arch=sm_86 -o last last.cu 
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z18mma_swizzle_kstageI6__halfEvPKT_S3_PS1_iii' for 'sm_86'
ptxas info    : Function properties for _Z18mma_swizzle_kstageI6__halfEvPKT_S3_PS1_iii
    256 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 126 registers, used 1 barriers, 256 bytes cumulative stack size, 388 bytes cmem[0]

Hi there @8948542, welcome to the NVIDIA developer forums.

I do not really understand the context of your issue, but since you tagged it as CUDA, I would suggest you try some of the sub-categories of our CUDA forums. Maybe the “Programming and Performance” section?

Thanks!

Ok, thank you.

Please do not post code as image.

This is just a snippet of the ncu analysis, I gave the link to github because the code is too long.

You register accesses do not use constant indexing.
For example, reg_write is changed based on a runtime-condition.

if(i_tile_read < K_tiles){
#pragma unroll
            for(size_t i=0;i<warp_tile_i;++i){
                size_t bank_row = istage_read * 64 + (warp_id / block_row_warps) * MMA_M + (lane_id % MMA_M);
                size_t bank_col = (i * C_ + lane_id / MMA_M) ^ (bank_row & 7);
                uint32_t a_smem_srt = __cvta_generic_to_shared(&smem[bank_row][bank_col * M_]);
                LDMATRIX_X4(RA[reg_write][i][0], RA[reg_write][i][1], RA[reg_write][i][2], RA[reg_write][i][3], a_smem_srt);
            }
#pragma unroll    
            for(size_t i=0;i<warp_tile_j;++i){
                size_t bank_row = B_offset + istage_read*32 + (warp_id % block_row_warps) * MMA_K + i / 2 * 8 + lane_id % 8;
                size_t bank_col = (i / 2 * 2 + (lane_id / 8) % 2) ^ (bank_row & 7);
                uint32_t b_smem_srt = __cvta_generic_to_shared(&smem[bank_row][bank_col * M_]);
                LDMATRIX_X2(RB[reg_write][i][0], RB[reg_write][i][1],  b_smem_srt);
            }
            reg_write ^= 1;   //<---
            istage_read = (istage_read + 1) % Stage;
            ++i_tile_read;
        }

Ideally, reg_read and reg_write should be computed in a way that the compiler can precompute their values for each loop iteration. Another problem will be that K_tiles which is used in the if and as outer loop counter is computed dynamically.

Here is a modified code which does not use local memory. Compiler Explorer
The outer loop which depends on K_tiles is unrolled with unroll-factor 2. reg_read and reg_write are computed from the loop index. You need to check if the transformations are correct

Thank you, this question has troubled me for a long time. But why can version 1 be stored normally? Its reg_store_idx and reg_load_idx parameters are also changed at runtime. I also tried to initialize reg_store_idx and reg_load_idx to 0 and 1, and modify them in the outer loop, but this did not work properly either.

    int reg_write = 0;
    int reg_read = 1;
    ......
#pragma unroll   
    for(int i=0;i<K_tiles;++i){
        reg_write ^= 1;
        reg_read ^=1;
        ......


The first one means that the size is determined by constants during initialization or that only constant indexes can be used at runtime?

They also have to be constant across the threads (i.e. not depend on threadIdx or blockIdx).

Ok,thank you.