How to understand Sectors/Req?

whix · December 29, 2023, 1:26am

Hello! I am learning how to analyze kernel performance using Nsight Compute. I have run a simple matrix transposition kernel, the code is as follows:

// m = 8192, n = 4096
__global__ void transposeNative(float *input, float *output, int m, int n){
    int colID_input = threadIdx.x + blockDim.x * blockIdx.x;
    int rowID_input = threadIdx.y + blockDim.y * blockIdx.y;
    if(rowID_input < m && colID_input < n){
        int index_input = colID_input + rowID_input * n;
        int index_output = rowID_input + colID_input * m;
        output[index_output] = input[index_input];
    }
}

Then I use Nsight Compute to profile it. In Memory Workload Analysis section, I have some questions about how to analyze the performance of coalesced memory accesses from Sectors/Req.

In the L2 Cache table, the value of Sectors/Req for L1/TEX Load is 4, which in my understanding should be optimal and minimal. However, the value of Sectors/Req for L1/TEX Store is 1, why is this?

Setup: RTX 3090, CUDA 11.8, Nsight Compute 2023.3.0.0

Can someone explain this to me, it’s really been bugging me for a long time, thanks a million!

Greg · January 3, 2024, 2:35am

index_input are coalesced
input[index_input] results is 1 request of 4 sectors (128B) = 128B/instruction

index_output is strided by m = 8192
output[index_output] results in 32 requests of 1 sector (32B) = 1024B/instruction

There are 32x more store requests than load requests.
There is 8x more store data than load data (1024B/instruction vs. 128B/instruction)

veraj · March 7, 2024, 3:47am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Profiling coalesced memory accesses confusion Nsight Compute	2	608	October 12, 2021
Trying to understand why Sectors/Req in wmma_example is 8 Sec/Req CUDA Programming and Performance	1	82	September 2, 2024
Uncoalesced access to one float element per thread Nsight Compute	3	525	February 7, 2024
Excessive sectors reported for LDGSTS.E Nsight Compute	2	24	August 20, 2025
Uncoalesced Local Accesses Nsight Compute	0	1356	April 2, 2024
Understanding L1/TEX Cache Sectors/Req Nsight Compute	4	337	December 13, 2024
About the number of store transactions on pascal CUDA Programming and Performance	0	347	June 1, 2020
Global load transaction count when in coalesced memory access Visual Profiler and nvprof	3	2206	July 7, 2017
Visual profiler and compute capability 1.3 CUDA Programming and Performance	4	9938	May 3, 2010
Matrix transpose perfomance profile explanation CUDA Programming and Performance hw , cuda , kernel , ncu	9	134	April 26, 2025

How to understand Sectors/Req?

Related topics