Hi, I am a beginner of cuda, I want to apply cuda in a matrix-matrix multiplication, the algorithm to be optimized is as follows,

for (int a = 0; a < N; a++)

for (int b = 0; b < N; b++)

for (int c = 0; c < N; c++)

for (int d = 0; d < N; d++)

sum[a][b][c] = sum[a][b][c] + A[a][b][d] * C[d][c];

but I can’t use it very well, I try to use tiling and shared memory in the code, below is the cuda code that I have written,

but I am not quite sure if this is correct. Can someone help me?