I have a CUDA kernel that looks like the follows:
__global__ void f(float* __restrict__ X, float* __restrict__ W, int H) {
X_shared = X[((blockIdx.x / ((H + 127) >> 7)) * 128 * 768) + ...];
W_shared = W[((blockIdx.x % ((H + 127) >> 7)) * 128 * 768) + ...];
f<<<H/16, 256>>>(X, W, H);
When I executed the above CUDA kernel using different values of H, I observe different compute throughput. The reason, according to NSightCompute memory workload analysis, seems to be because of the load throughput:
MemoryWorkloadAnalysis when H = 768
Section: Memory Workload Analysis
---------------------------------------------------------------------- --------------- ------------------------------
Memory Throughput Gbyte/second 16.22
Mem Busy % 31.54
Max Bandwidth % 26.38
L1/TEX Hit Rate % 21.02
L2 Hit Rate % 89.78
Mem Pipes Busy % 24.37
---------------------------------------------------------------------- --------------- ------------------------------
MemoryWorkloadAnalysis when H = 3072
Section: Memory Workload Analysis
---------------------------------------------------------------------- --------------- ------------------------------
Memory Throughput Gbyte/second 38.23
Mem Busy % 46.84
Max Bandwidth % 39.19
L1/TEX Hit Rate % 17.55
L2 Hit Rate % 84.38
Mem Pipes Busy % 36.23
---------------------------------------------------------------------- --------------- ------------------------------
What I do not understand here is why the memory throughput of the former is only half of that of the latter. Could anyone please offer me some help on this?