Hello! I am learning how to analyze kernel performance using Nsight Compute. I have run a simple matrix transposition kernel, the code is as follows:
// m = 8192, n = 4096
__global__ void transposeNative(float *input, float *output, int m, int n){
int colID_input = threadIdx.x + blockDim.x * blockIdx.x;
int rowID_input = threadIdx.y + blockDim.y * blockIdx.y;
if(rowID_input < m && colID_input < n){
int index_input = colID_input + rowID_input * n;
int index_output = rowID_input + colID_input * m;
output[index_output] = input[index_input];
}
}
Then I use Nsight Compute to profile it. In Memory Workload Analysis section, I have some questions about how to analyze the performance of coalesced memory accesses from Sectors/Req
.
In the L2 Cache table, the value of Sectors/Req
for L1/TEX Load is 4, which in my understanding should be optimal and minimal. However, the value of Sectors/Req
for L1/TEX Store is 1, why is this?
Setup: RTX 3090, CUDA 11.8, Nsight Compute 2023.3.0.0
Can someone explain this to me, it’s really been bugging me for a long time, thanks a million!