There is a big difference between nsc and theoretical Arithmetic Intensity. Is it resonable?

I’m using the simplest sgemm code to lear nsc:

#define matA(i, j) (a[(i)+(j)*M])
#define matB(i, j) (b[(i)+(j)*K])
#define matC(i, j) (c[(i)+(j)*M])

__global__ void sgemm(const float *a, const float *b, float *c, int M, int N, int K) {
    int tx = blockIdx.x*blockDim.x + threadIdx.x;
    int ty = blockIdx.y*blockDim.y + threadIdx.y;

    if (tx < M && ty < N) {
        float sum = 0.0f;
        for (int i = 0; i < K; ++i) {
            sum += matA(tx, i)*matB(i, ty);
        }
        matC(tx, ty) = sum;
    }
}

When M = N = K =2048, block.x = blockx.y = 16, 3080 12g, the AI given by NSC is about 185:

But the theoretical result is about 320. Reference: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html

Is this difference resonable?

Which number are you specifically looking at in the doc? Can you provide a more precise location?

Thanks for your responding.
I’m following the equation in section 2. However, I’m using M=N=K=2048, and using float32, so the theoretical AI is
2 * M^3 / (4 * 3 * M ^2) = M / 6 = 341

What GPU did you run on?

A 3080 with 12G memory

It’s normal to see variance from the theoretical limits, but it’s difficult to say why without more information. Are you able to share the Nsight Compute report? We can look to see if the bytes read is higher than expected and try and identify where and why this could be occurring. For example, poor memory access patterns could lead to extra data being read which decreases AI.