Hi, I am a beginner of cuda, I want to apply cuda in a matrix-matrix multiplication, the algorithm to be optimized is as follows,
for (int a = 0; a < N; a++)
for (int b = 0; b < N; b++)
for (int c = 0; c < N; c++)
for (int d = 0; d < N; d++)
sum[a][b][c] = sum[a][b][c] + A[a][b][d] * C[d][c];
but I can’t use it very well, I try to use tiling and shared memory in the code, below is the cuda code that I have written,
but I am not quite sure if this is correct. Can someone help me?