There is a big difference between nsc and theoretical Arithmetic Intensity. Is it resonable?

sian1987o · May 8, 2023, 2:43pm

I’m using the simplest sgemm code to lear nsc:

#define matA(i, j) (a[(i)+(j)*M])
#define matB(i, j) (b[(i)+(j)*K])
#define matC(i, j) (c[(i)+(j)*M])

__global__ void sgemm(const float *a, const float *b, float *c, int M, int N, int K) {
    int tx = blockIdx.x*blockDim.x + threadIdx.x;
    int ty = blockIdx.y*blockDim.y + threadIdx.y;

    if (tx < M && ty < N) {
        float sum = 0.0f;
        for (int i = 0; i < K; ++i) {
            sum += matA(tx, i)*matB(i, ty);
        }
        matC(tx, ty) = sum;
    }
}

When M = N = K =2048, block.x = blockx.y = 16, 3080 12g, the AI given by NSC is about 185:

But the theoretical result is about 320. Reference: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html

Is this difference resonable?

jmarusarz · May 8, 2023, 9:08pm

Which number are you specifically looking at in the doc? Can you provide a more precise location?

sian1987o · May 8, 2023, 11:06pm

Thanks for your responding.
I’m following the equation in section 2. However, I’m using M=N=K=2048, and using float32, so the theoretical AI is
2 * M^3 / (4 * 3 * M ^2) = M / 6 = 341

jmarusarz · May 9, 2023, 1:36am

What GPU did you run on?

sian1987o · May 9, 2023, 3:30am

A 3080 with 12G memory

jmarusarz · May 11, 2023, 9:00pm

It’s normal to see variance from the theoretical limits, but it’s difficult to say why without more information. Are you able to share the Nsight Compute report? We can look to see if the bytes read is higher than expected and try and identify where and why this could be occurring. For example, poor memory access patterns could lead to extra data being read which decreases AI.