I’m using the simplest sgemm code to lear nsc:
#define matA(i, j) (a[(i)+(j)*M])
#define matB(i, j) (b[(i)+(j)*K])
#define matC(i, j) (c[(i)+(j)*M])
__global__ void sgemm(const float *a, const float *b, float *c, int M, int N, int K) {
int tx = blockIdx.x*blockDim.x + threadIdx.x;
int ty = blockIdx.y*blockDim.y + threadIdx.y;
if (tx < M && ty < N) {
float sum = 0.0f;
for (int i = 0; i < K; ++i) {
sum += matA(tx, i)*matB(i, ty);
}
matC(tx, ty) = sum;
}
}
When M = N = K =2048, block.x = blockx.y = 16, 3080 12g, the AI given by NSC is about 185:
But the theoretical result is about 320. Reference: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html
Is this difference resonable?