How to understand Sectors/Req?

Hello! I am learning how to analyze kernel performance using Nsight Compute. I have run a simple matrix transposition kernel, the code is as follows:

// m = 8192, n = 4096
__global__ void transposeNative(float *input, float *output, int m, int n){
    int colID_input = threadIdx.x + blockDim.x * blockIdx.x;
    int rowID_input = threadIdx.y + blockDim.y * blockIdx.y;
    if(rowID_input < m && colID_input < n){
        int index_input = colID_input + rowID_input * n;
        int index_output = rowID_input + colID_input * m;
        output[index_output] = input[index_input];
    }
}

Then I use Nsight Compute to profile it. In Memory Workload Analysis section, I have some questions about how to analyze the performance of coalesced memory accesses from Sectors/Req.

In the L2 Cache table, the value of Sectors/Req for L1/TEX Load is 4, which in my understanding should be optimal and minimal. However, the value of Sectors/Req for L1/TEX Store is 1, why is this?

Setup: RTX 3090, CUDA 11.8, Nsight Compute 2023.3.0.0

Can someone explain this to me, it’s really been bugging me for a long time, thanks a million!

index_input are coalesced
input[index_input] results is 1 request of 4 sectors (128B) = 128B/instruction

index_output is strided by m = 8192
output[index_output] results in 32 requests of 1 sector (32B) = 1024B/instruction

There are 32x more store requests than load requests.
There is 8x more store data than load data (1024B/instruction vs. 128B/instruction)