I would like to ask for the possible cause of GPU: Compute Memory Access Throughput Internal Activity being the dominant metric, especially for the following kernel.
The Kernel
I’m profiling a poorly written kernel. No __restrict__, no shared memory, w/ uncoalesced accesses.
__global__ void gesummv_kernel(int n, float alpha, float beta, float *A,
float *B, float *tmp, float *x, float *y) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
int j;
for (j = 0; j < n; j++) {
tmp[i] += A[i * n + j] * x[j];
y[i] += B[i * n + j] * x[j];
}
y[i] = alpha * tmp[i] + beta * y[i];
}
}
Since the pointers are not marked with __restrict__, the compiler must assume possible aliasing. This force repeated global memory loads/stores inside the loop (no register promotion).
Because global stores are write-through from L1 to L2, this could result in heavy L2 traffic.
As a result, L2 bandwidth might become the bottleneck.
However, if L2 is the limiting factor, I would expect metric L2: T Sector to be more prominent.
It would be helpful if you can post the full report and state the value of “n”. If “n” is large then each warp lane will have very poor access pattern for A and B.
All throughputs < 50% indicates a latency limited kernel. The Memory Table can help by stating number of sectors per request. As n increases the number of sectors per request will increase resulting in much higher latency.
There are a few private metrics in the GPU Compute Memory Access Throughput Internal Activity. I believe the limiter is the LTS input interface. Each LDG request is 1 cycle. If the LDG instruction is not coalesced due to larger n then it will be broken up into multiple requests. The STG are consecutive so if tmp and y are 128-byte aligned then there should be 1 request for 4 sectors which at LTS is 4 cycles. There are 2.5-3 loads per store. The compiler which can be reduced to 2 loads per store.
As you noted adding restrict and const will greatly change the optimizations that can be performed by the compiler.
Additionally, I would like to confirm my understanding:
LTS is capable of handling a cache line per cycle (uncoalesced takes 4 cycles because the 4 sectors scatters in 4 different cache lines).
Is this correct?
process 1 32B sector for write per cycle from a client
return 1-2 32B sectors per cycles; depends on the GPU architecture
You are correct that if the warp instruction in uncoalesced resulting in loading/storing to 4 cachelines then the limiter will be the tag lookup. The next limiters to reviews if there is some coalescing is the write throught and read return throughput.