Cause of "Compute Memory Access Throughput Internal Activity” being the dominant metric

Lai-YT · March 27, 2026, 4:03am

TL;DR

I would like to ask for the possible cause of GPU: Compute Memory Access Throughput Internal Activity being the dominant metric, especially for the following kernel.

The Kernel

I’m profiling a poorly written kernel. No __restrict__, no shared memory, w/ uncoalesced accesses.

__global__ void gesummv_kernel(int n, float alpha, float beta, float *A,
                               float *B, float *tmp, float *x, float *y) {
  int i = blockIdx.x * blockDim.x + threadIdx.x;

  if (i < n) {
    int j;
    for (j = 0; j < n; j++) {
      tmp[i] += A[i * n + j] * x[j];
      y[i] += B[i * n + j] * x[j];
    }
    y[i] = alpha * tmp[i] + beta * y[i];
  }
}

Profiling Report

GPU: RTX 3050 Laptop

Current Hypothesis

My understanding is:

Since the pointers are not marked with __restrict__, the compiler must assume possible aliasing. This force repeated global memory loads/stores inside the loop (no register promotion).
Because global stores are write-through from L1 to L2, this could result in heavy L2 traffic.
As a result, L2 bandwidth might become the bottleneck.

However, if L2 is the limiting factor, I would expect metric L2: T Sector to be more prominent.

Greg · March 30, 2026, 8:06pm

It would be helpful if you can post the full report and state the value of “n”. If “n” is large then each warp lane will have very poor access pattern for A and B.

All throughputs < 50% indicates a latency limited kernel. The Memory Table can help by stating number of sectors per request. As n increases the number of sectors per request will increase resulting in much higher latency.

There are a few private metrics in the GPU Compute Memory Access Throughput Internal Activity. I believe the limiter is the LTS input interface. Each LDG request is 1 cycle. If the LDG instruction is not coalesced due to larger n then it will be broken up into multiple requests. The STG are consecutive so if tmp and y are 128-byte aligned then there should be 1 request for 4 sectors which at LTS is 4 cycles. There are 2.5-3 loads per store. The compiler which can be reduced to 2 loads per store.

As you noted adding restrict and const will greatly change the optimizations that can be performed by the compiler.

Lai-YT · March 31, 2026, 1:45am

Ah, that explains it. The value of n is quite large (16,384).
Thank you Greg.

Here is the full profiling report:

nvcc.ncu-rep.zip (288.3 KB)

Lai-YT · March 31, 2026, 8:44am

Additionally, I would like to confirm my understanding:
LTS is capable of handling a cache line per cycle (uncoalesced takes 4 cycles because the 4 sectors scatters in 4 different cache lines).
Is this correct?

Greg · April 2, 2026, 2:35am

Each LTS (L2 slice) can

perform 1 tag lookup per cycle.
process 1 32B sector for write per cycle from a client
return 1-2 32B sectors per cycles; depends on the GPU architecture

You are correct that if the warp instruction in uncoalesced resulting in loading/storing to 4 cachelines then the limiter will be the tag lookup. The next limiters to reviews if there is some coalescing is the write throught and read return throughput.

Lai-YT · April 3, 2026, 3:21am

Got it!

Topic		Replies	Views
Kernel with very low eligible warps despite fully coalesced memory access CUDA Programming and Performance	6	1394	July 3, 2023
Uncoalesced global loads CUDA Programming and Performance	3	1813	April 29, 2022
Visual debugger to see if mem access is coalesced CUDA Programming and Performance	7	1149	November 1, 2011
coalesced vs. uncoalesced access why not speed-up of 16x? CUDA Programming and Performance	13	6424	October 29, 2008
Same SOL for memory and SM Throughput Nsight Compute	1	342	October 9, 2024
Memory access should be coalesced but is not CUDA Programming and Performance	6	1227	May 16, 2019
Global memory access bottleneck CUDA Programming and Performance	8	3673	September 4, 2015
How to know where the bottleneck is? CUDA Programming and Performance	3	4377	February 29, 2008
Do these two global memory coalesced access pattern have same performance in thoery? CUDA Programming and Performance cuda	2	406	December 17, 2022
Profiling coalesced memory accesses confusion Nsight Compute	1	742	April 15, 2021

Cause of "Compute Memory Access Throughput Internal Activity” being the dominant metric

TL;DR

The Kernel

Profiling Report

Current Hypothesis

Related topics