Dram_sectors_read.sum cofusing in Nsight Compute

firmoonlight · February 13, 2025, 2:15am

hi, I’m just learning the cuda. I have create a test case for testing L1 and L2 cache behavior and encounter some issue.

I try to read 64 bytes from global memory. Firstly it go to L1 cache and miss all two sectors, then it go to L2 cache for these 2 sectors, again it got cache miss, so it finally go to the dram. And then I found that it get 4 sectors from dram, which I assume it should be 2 sector as the granularity of L2 cache is 32-byte.

I use the RTX4000 to do the test.

the kernel is very simple

static __global__ void sumArraysGPU(short* a, short* b, short* res,  int n)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (k < n) {
        res[i] = a[i];
    }
        
}

Robert_Crovella · February 13, 2025, 3:17am

The GPU might be prefetching a full cache line.

firmoonlight · February 13, 2025, 5:48am

Weird, If I make the block size to 1024, the result of dram_sectors_read.sum is always 4 sectors larger than the required sectors which is 64 sectors.

Curefab · February 13, 2025, 7:52am

I would not test those numbers with such low occupancy: There could be memory accesses like loading parameters or constants, which are just a few bytes overall, but distort your calculation. I am not sure, if this is an issue here, but for example the shared memory bank conflicts are known to have a slight variability for small numbers.

Use a loop and load from such memory addresses, that the data of each iteration is apart (not consecutive) in memory so that possible prefetching can still be seen as effect.

Greg · February 13, 2025, 10:06pm

The tools and GPU HWPM system are not designed to be as accurate as you are requesting. Reasons you can run into issues.

A metric may be collected over multiple passes. Given no method for 100% deterministic replay this can result in variance/error.
GPU has many independent simultaneous engines that may increment a PM.
On more recent GPUs tools have moved to _realtime metrics. These are not 100% accurate. The error is small for typical sample periods (e.g. ±32 for 10000 cycles) but on a small sample you may not see an increment.
There are hardware features on hardware features. On 100 class HBM GPUs and Ada GPUs L2 has 64B promotion enabled by default. 100 class may also have ECC turned on which can increase traffic.

When testing PMs I generally launch equal warps per SM sub-partition (optimal launch) and produce sufficient work that any other increment reason will result in small variance that will be noise.

Topic		Replies	Views
Unexpected Data Read Behavior on Tesla V100: Cache Line and Memory Access Patterns Nsight Compute	2	589	August 31, 2023
Dram transactions (bytes) question! Nsight Compute	3	274	June 27, 2025
Global memory access patterns - too slow CUDA Programming and Performance cuda , performance	6	1898	April 7, 2024
L1/L2 Cacheline size and latency Drivers - Linux, Windows, MacOS	0	47	November 11, 2025
Ampere GPU L2 cache write miss policy CUDA Programming and Performance	3	1008	February 9, 2022
L2 cache misses CUDA Programming and Performance	3	771	September 8, 2023
Nsight compute "Sectors Misses to L2" greater than "Sectors" Nsight Compute cuda	2	539	September 27, 2021
Metrics about sysmem access with L2 cache Nsight Compute	0	508	February 9, 2022
What is the expected L1/L2 hit rate for fully coalesced accesses? CUDA Programming and Performance	10	366	January 8, 2025
Pascal L1 cache CUDA Programming and Performance	21	12489	January 20, 2024

Dram_sectors_read.sum cofusing in Nsight Compute

Related topics