Understanding memory throughput

I have a CUDA kernel that looks like the follows:

__global__ void f(float* __restrict__ X, float* __restrict__ W, int H) {
  X_shared = X[((blockIdx.x / ((H + 127) >> 7)) * 128 * 768) + ...];
  W_shared = W[((blockIdx.x % ((H + 127) >> 7)) * 128 * 768) + ...];
f<<<H/16, 256>>>(X, W, H);

When I executed the above CUDA kernel using different values of H, I observe different compute throughput. The reason, according to NSightCompute memory workload analysis, seems to be because of the load throughput:

MemoryWorkloadAnalysis when H = 768

    Section: Memory Workload Analysis
    ---------------------------------------------------------------------- --------------- ------------------------------
    Memory Throughput                                                         Gbyte/second                          16.22
    Mem Busy                                                                             %                          31.54
    Max Bandwidth                                                                        %                          26.38
    L1/TEX Hit Rate                                                                      %                          21.02
    L2 Hit Rate                                                                          %                          89.78
    Mem Pipes Busy                                                                       %                          24.37
    ---------------------------------------------------------------------- --------------- ------------------------------

MemoryWorkloadAnalysis when H = 3072

    Section: Memory Workload Analysis
    ---------------------------------------------------------------------- --------------- ------------------------------
    Memory Throughput                                                         Gbyte/second                          38.23
    Mem Busy                                                                             %                          46.84
    Max Bandwidth                                                                        %                          39.19
    L1/TEX Hit Rate                                                                      %                          17.55
    L2 Hit Rate                                                                          %                          84.38
    Mem Pipes Busy                                                                       %                          36.23
    ---------------------------------------------------------------------- --------------- ------------------------------

What I do not understand here is why the memory throughput of the former is only half of that of the latter. Could anyone please offer me some help on this?

What GPU did you run this on?

What’s the value of W? It is possible that the case with smaller H is simply unable to utilize available memory bandwidth fully because too little data is moved.

The access pattern for the global memory accesses has an influence on the maximum bandwidth that can be achieved. By not showing the full address computation, you impede others in analyzing this aspect.

For questions like this, it is useful to post a minimal self-contained program that reproduces the issue and that others can build and run.

The only way to help is with a minimal reproducible. Even with the minimal reproducible it would be useful if you would expand the addresses for each thread in the warp. The current code shows that all threads in a thread block are reading the same location but “…” may include per thread indexing.

There is so little code it is hard to answer. The first launch (768=16 = 48 thread blocks) will fit on most GPUs. The second launch (3072/16 = 192) could be a single wave of thread blocks on a 24 SM GPU or could result in many waves of thread blocks.